Building Junto: On-Device ML on iOS with MLX-Swift
The Problem with Finance Apps
Most personal finance apps fall into two categories: they're either glorified spreadsheets, or they send all your data to a server and call it "AI-powered." Neither felt right to me.
Financial data is among the most sensitive data people have. I wanted to build something that actually reasons about your spending behavior — but without your data ever leaving your device.
That's what Junto is: a finance app where the AI runs entirely on your iPhone.
Why MLX-Swift?
When Apple released MLX-Swift, it opened something genuinely new: running fine-tuned language models on-device with Apple Silicon efficiency. Not just classification models — actual LLMs capable of reasoning in natural language.
The pitch for Junto was simple: analyze your transactions, answer questions about your finances, and generate personalized insights. All locally. No API calls, no server, no data leaving your phone.
The Architecture
The stack ended up being more layered than I initially planned:
- SwiftUI + SwiftData for the interface and local persistence
- MLX-Swift for on-device LLM inference
- QLoRA fine-tuning on a base Qwen model, specialized for Brazilian personal finance reasoning
- RAG pipeline (VectorStore + EmbeddingService) to give the model context about the user's actual transactions and goals
- Pluggy API for Open Finance bank syncing — connecting to 300+ Brazilian banks automatically
- Gemini and Claude as cloud fallback for premium features
- Supabase for auth and subscription tier management
Fine-Tuning: What Actually Happened
I trained a QLoRA adapter on top of Qwen using MLX's training pipeline. The goal was to make the model understand Brazilian financial terminology, transaction categories, and give advice grounded in real user data rather than generic financial tips.
The adapter training worked. The problem came at inference time.
The base model I started with — Qwen3 4B at 4-bit quantization — was around 2-3GB. Even quantized, it created serious memory pressure on most iPhones. Loading times were slow, the memory warnings were constant, and on older devices it simply wasn't viable.
The fix: migrate to a 1.5B parameter model. Smaller, faster, still capable enough for the classification and chat tasks Junto needs. The LoRA adapter needed to be retrained for the new architecture — not ideal, but the right tradeoff.
The RAG Layer
One thing that made a real difference in response quality: giving the model access to the user's actual data at inference time.
I built a local RAG pipeline from scratch:
- Transactions, goals, and chat summaries get embedded and stored in a local VectorStore (SwiftData)
- At query time, the top-K most relevant items are retrieved and injected into the model's context
- The embedding model runs on-device via Core ML, with a hash-based fallback
This means when you ask "why did I overspend in January?", the model actually has your January transactions in context — not just a generic prompt.
What's Still Hard
On-device ML on iOS is genuinely difficult in ways that aren't obvious from the outside:
Model size vs. device capability is a constant negotiation. What runs fine on an iPhone 15 Pro may crash on an iPhone 13.
The simulator is useless for ML. MLX requires Metal, which doesn't run in the iOS simulator. Every inference test requires a physical device.
Memory management is brutal. SwiftUI's memory model and MLX's memory requirements don't always cooperate. Auto-unloading the model on memory warnings, cache limits, and graceful degradation all had to be built manually.
Current Status
Junto is in final development before App Store beta. The core features — transaction tracking, AI chat, goal management, and the Marketplace (AI-generated financial reports) — are all working. Backend auth via Supabase is integrated. The model migration to 1.5B is in progress.
If you want early access or want to talk about on-device ML on iOS, reach out.
Gustavo Barra Felizardo
CS Student at UFMG · Researcher @ FutureLab · Founder of Solitus & Junto