Let’s say your client asked for a spending accountability agent and you had to come up with an MCP for AI Agents based on your client’s pain.
What goes in the protocol:
- Persistent Memory Schema: Custom JSON structure for what should always be remembered (e.g. categories, user-preferred labels like āfun moneyā instead of āentertainmentā).
- Temporal Awareness Layer:
- Know the current month, quarter, year.
- Be able to compare āApril 2025 vs March 2025ā and āApril 2025 vs April 2024ā.
- Event Threading: Link spending events to real-world context (e.g. ābought flowers ā friendās birthdayā or āKES26500 on skincare ā Sephora haulā).
- Change Detection Rules: AI flags anomalies: āThis is 3x your usual coffee spend.ā Not guessesāreal, percent-based alerts.
- Response Memory Rules: Let the agent remember user reactions: āYou told me to chill last time I flagged skincareāso I will, unless itās double.ā
Tech Stack
š¾ Data Layer
- PostgreSQL for structured, transactional data (spend logs, metadata, user preferences).
- TimescaleDB extension if you want time-series smarts built-in for auto-rollups & fast comparisons.
š² API Ingestion
- Plaid or Stripe for banking transactions.
- Custom webhook endpoints to capture spend from manual inputs, emails, or receipts.
- Normalize data into this shape:
{
"amount": 23.90,
"currency": "USD",
"date": "2025-04-12",
"merchant": "Starbucks",
"category": "Coffee",
"tags": ["latte", "morning", "work"]
}
AI Agent Context Design
š Memory Breakdown
- Short-Term Memory: Active month, past 3 months (in RAM or temp cache).
- Long-Term Memory: Year-to-year logs (pull from PostgreSQL only when needed).
- User Style Profile: How you like summaries (bullets? graphs? sass?).
šŖ Retrieval Protocol
- User asks āHow was March?ā ā Query March table ā Compare with Feb + last yearās March ā Run difference detection ā Format response.
- Use LangChain or LlamaIndex with a smart retriever logic:
- Pull relevant chunks only.
- Context window has: current month, top spend areas, anomalies, and last user feedback.
Summary Generator
Summarizing:
{
"summary": "You spent $2,389 in March, up 12% from February. Top increase: Dining (+$180). Decline: Transport (ā$75).",
"graph_reference": "spend_trend_mar2025.png",
"alerts": ["Dining spend increased significantly. Want to set a budget alert?"]
}
- Use OpenAI, Claude, or fine-tuned DeepSeek to generate summaries from structured input only (force the AI to read the clean dataādonāt ask it to interpret raw logs).
Integration Architecture
[User Input / Banking API]
|
[Ingestion Layer]
(Webhook + Normalizer)
|
[PostgreSQL DB]
|
[AI Agent Core]
- Retriever
- Analyzer
- Generator
|
[Monthly Summary Output]
(Text + Graph + Optional Alerts)
Optional: add voice input, like āHey, how much did I blow on brunch this month?ā ā same flow, different entry point.
I would start by defining a structured memory protocol that treats every transaction as a timestamped, categorized event with optional user context.
The agent needs to understand the current time and be able to perform comparisons across various time periods without ambiguity. I would use PostgreSQL for storing transaction data, with a TimescaleDB extension for efficient time-series handling.
Database Alternatives to PostgreSQL + TimescaleDB
It’s nice to have options, so;
1. SQLite
- I would use this if the project is local or lightweight.
- No server needed, just a file.
- Ideal for mobile/desktop agents with offline use.
- Works everywhere.
2. DuckDB
- Iād pick this for fast analytics and local summaries.
- Built for OLAP-style (analysis-heavy) queries.
- Zero setup. Reads from CSVs/Parquet directly.
3. ClickHouse
- I would use this if the transaction data grows huge.
- Itās columnar, insanely fast for reads.
- Open-source, runs on low-cost hardware.
4. InfluxDB (OSS version)
- Iād use this if I wanted to optimize for time-series specifically without relying on Timescale.
- Built-in time-based functions.
- Open-source tier works for solo agents.
5. QuestDB
- Iād pick this if I wanted a super-light, native time-series engine.
- SQL syntax.
- Built for speed, even on low-resource setups.
All of these can run on local machines or VPS servers (like Hetzner, Linode, or Contaboācheap, global reach).
For regions with unreliable internet, Iād lean on SQLite or DuckDB for local-first storage.
For cloud sync or team use, ClickHouse or InfluxDB OSS would handle scale without needing expensive infrastructure.
Bonus Tools for Integration
- Supabase ā Iād use it if I wanted a hosted PostgreSQL + realtime layer without managing infra.
- Airbyte ā Iād use it to pull data from bank APIs or CSVs.
- Metabase ā Iād use this to visualize summaries with a dashboard (even offline).
I would define three memory layers in the agentās logic:
- A short-term window that focuses on the current and previous two months.
- A long-term comparison store for the past three years.
- A user preference layer that includes tag translations, ignored categories, and sensitivity settings for alerts.
I would use three memory layers as a baseline because they cover the key timeframes an agent needs to function well without overcomplicating the architecture.
If you’re asking why just threeāshort-term, long-term, and user preferenceāI wouldn’t stop there if the use case demands more.
What else I would add depending on the complexity:
4. Relational Memory
- Iād use this to link spending to events, people, or recurring life patterns.
- Example: āYou usually spend more on gifts around Aprilārelated to birthdays?ā
5. Exception Memory
- Iād store any flagged anomalies the user dismissed or explained.
- Example: āYou told me that KES 191,200 electronics bill was a one-off, so I wonāt compare it next year.ā
6. Goal/Constraint Memory
- Iād store the client’s budget rules, goals, or monthly caps here.
- Used for comparing actual vs expected.
- Auto-adapts when they override or adjust.
7. Calendar/Contextual Layer
- Iād add a layer that maps spend to holidays, weekends, travel periods, or personal tags like āon callā, āvacationā.
- Makes spend summaries smarter and more aligned with life context.
To compare periods, Iād use raw SQL with window functions.
SQL window functions let me compare each row of data to the previous one in a clean, precise way, right inside the database. That means I can calculate trends, changes, and patterns instantly, without moving the data around or risking messy logic. It’s fast, accurate, and built for exactly this kind of time-based comparison.
No LLM would touch raw transaction data. LLMs are probabilistic. The AI layer would receive only the distilled diffs, trends, and anomalies. For monthly summaries, Iād calculate percentage changes in each spend category, detect statistically significant deviations, and bundle all that into a tightly scoped JSON object for the LLM to turn into a readable summary.
I would train or prompt the LLM using few-shot examples to generate reports in the userās preferred tone, with logic baked in to avoid false assumptions or filler commentary. The LLM would not decide what’s importantāit would only surface what my backend marked as relevant.
The whole thing would run on a scheduler. On the first of each month, the agent would generate a new report, compare it to previous months and the same month in previous years, and deliver that as a push notification, email, or Slack message depending on how I configured delivery.
For integration, Iād run the AI agent in a containerized backend. The API layer would be exposed to a local dashboard or mobile app, using WebSockets or REST for sync. If the user asked, āHow did April compare to March?ā the agent would instantly query the context window, pull structured results, and pass a compact payload to the LLM for formatting only.
If the user wanted real-time insight mid-month, Iād support that with a rolling projection system that uses current spend rate versus historical average pacing. Everything would be queryable with natural language, but every natural-language response would map back to a verifiable query trace and include a clickable breakdown.
I would keep the system closed-loop. Every interaction from the user would be logged, not for tracking, but to improve context. If they dismissed a warning twice, the system would soften it. If they flagged a summary as inaccurate, it would trigger a review of the upstream logic, not just the LLM phrasing.