[CAT] Programming

Grounding an AI Coding Assistant in a Codebase It Can’t See

How I built iid-mcp: an MCP server that puts a private SailPoint IdentityIQ codebase in front of Claude Code, Copilot, and Claude Desktop, so they stop guessing.

I write a lot of SailPoint IdentityIQ code: rules, workflows, tasks, plugins, and XML configs that an importer validates against a DTD with more than a thousand elements. So when the good LLM coding assistants showed up, I did the obvious thing and asked one to write a rule for me.

What I got back looked completely plausible and was subtly wrong. It called a method that doesn’t exist on SailPointContext. It invented a helper in com.identityworksllc.iiq.common that nobody ever wrote. It set a type= attribute on a <Rule> that the importer rejects on sight. The code compiled in its head and fell over in mine.

The reason’s simple. The model has read a lot of public Java, but it’s never seen the libraries I actually build against: Instrumental Identity’s iiqcommon, the iiq-common-library, the in-house plugin set, and SailPoint’s own shipped API surface. So it does what models do when they don’t know. It produces a confident average of everything it’s seen and presents it as fact. For IIQ work that average is worse than useless, because it’s wrong in ways that take someone who knows the platform to catch.

The fix isn’t a better prompt. The fix is to stop letting the model lean on memory and hand it the real source instead. That’s what iid-mcp does.

What it is

iid-mcp is a Model Context Protocol server. MCP is the standard that lets an AI client (Claude Code, Claude Desktop, GitHub Copilot in agent mode) call tools you define, over a well-specified protocol, with no bespoke glue per client. You stand up one server, expose a set of tools, and every MCP-aware assistant can use them.

iid-mcp exposes seven tools. Two are live: they reach into Instrumental Identity’s GitLab and search or fetch real library source on demand. The other five are reference tools. They serve curated, queryable knowledge built straight from SailPoint’s own shipped artifacts: the database DDL, the Javadoc, the example configs, the XML DTD.

Tool	What it does
`search_iiq(query, max_results)`	GitLab blob search across every in-scope project, in parallel
`fetch_iiq_file(project, path, ref)`	Fetch one file’s full contents from GitLab
`get_iiq_patterns()`	Return the hand-curated patterns reference
`get_iiq_schema(table?)`	IIQ database schema: table inventory, or full column detail for one table
`get_iiq_api(name?)`	IIQ + Instrumental Identity Javadoc: package index, a package’s classes, or one class
`get_iiq_examples(key?)`	SailPoint’s bundled config examples by rule type, workflow, form
`get_iiq_dtd(element?)`	The IIQ XML DTD: element index, or one element’s children and attributes

The whole thing’s about 1,200 lines of Python plus the generated reference data. It’s small on purpose. The hard part was never the code. It was deciding what to feed the model and how to keep that data honest.

The shape of the problem

There are two different jobs hiding inside “help me write IIQ code,” and they want different tools.

The first job is find me real usage. When I’m writing against a helper, I want to see how it’s actually called in code that ships, not how the model imagines it’s called. That’s a search problem against private repos. It has to be live, because the libraries change on the order of days.

The second job is tell me the legal shape of this thing. What columns does spt_identity have on 8.5? What does SailPointContext.getObjects return? Which children is a <Workflow> allowed to have? What Rule types does the DTD actually permit? None of that changes between Tuesday and Thursday. It changes when SailPoint ships a new version. That’s a reference problem, and the right move is to precompute it from the source of truth and serve it fast.

I built both halves into one server because, from the agent’s point of view, they’re the same task: ground this artifact in reality before you generate it. The server’s own instructions even spell out the ordering. Load the patterns, pull the matching example, check the API signature, verify the XML shape against the DTD, then search the live code to confirm. Generate last.

Live search over GitLab

Scope resolution

You configure what the server can see with one environment variable, IID_MCP_IIQ_SCOPE, a comma-separated list. Each entry is either a project path (pub/iiqcommon, used as-is) or a group path (idw/idw-sailpoint/iiq-plugins, expanded to every non-archived project inside it, recursively into subgroups). The default scope resolves to roughly 38 projects.

Resolution caches in memory for the life of the process. The first search triggers it. Every search after that reuses it. Adding a new project to a configured group means restarting the server so the cache re-resolves, which is fine, because the process holds no other state and restarting it is free.

One detail I care about more than it probably deserves: the cold-resolve path is locked. If four search_iiq calls land at once on a fresh process, you don’t want all four fanning out the full scope-resolution traffic against GitLab. The first caller takes an asyncio.Lock, does the work, and populates the cache. The other three wait, then fall straight into the cache hit. The fast path, where the cache is already populated, does a single atomic pointer read and never touches the lock at all.

Parallel fan-out, bounded twice

search_iiq runs a GitLab blob search against every project in scope at once. With 38 projects that’s a lot of concurrent HTTP, so it’s bounded in two places.

Within a single search call, a semaphore caps how many per-project requests are in flight (IID_MCP_SCOPE_MAX_CONCURRENT_SEARCHES, default 8). Across the whole process, a second semaphore caps how many search_iiq calls run at once (IID_MCP_MAX_CONCURRENT_SEARCH_CALLS, default 4). Callers over the limit queue on the semaphore. They don’t get rejected. At the ceiling that’s 8 times 4, so 32 in-flight GitLab requests, which is the number I tuned the timeout and the rate-limit headroom around. The second semaphore has to be built lazily, on first call, because asyncio primitives bind to the running event loop and there’s no loop at import time.

One slow repo shouldn’t sink the search

This is the part I got wrong first and then fixed. The fix is the whole philosophy of the project in miniature.

The original search_iiq did the obvious thing. Gather all the per-project searches, return the combined hits. Then production logs showed a pattern where three of four search calls would fail outright. The cause was iiq-common-library. Its blob search legitimately runs around 11 seconds (every other project comes back in under 3.4), and under load it would tip past the timeout. One project’s ReadTimeout, propagating up through asyncio.gather, took the entire search down with it. The 37 healthy projects had already returned useful hits, and the caller saw nothing but an exception.

So the rule now is: a per-project failure is data, not an exception. Each project’s search is wrapped to catch GitLabError and any httpx.HTTPError, which covers ReadTimeout, ConnectError, and friends. On failure it returns the exception instead of raising. The aggregator splits outcomes into hits and a structured errors list. You get the 37 projects that worked plus a note saying which one timed out and why. Partial results stay useful. The same instinct shows up again in the cache and in the transport layer: infrastructure trouble degrades the answer, it never destroys it.

A returned hit carries everything the agent needs to act: project, path, ref, line number, the matched snippet, and a web_url that deep-links straight to the line in GitLab. The usual loop is search, read the snippets, fetch_iiq_file the one or two that matter, write code grounded in what came back.

The reference tools

The five reference tools share one idea. SailPoint already ships the ground truth for most of what an agent needs to know. It just ships it in formats built for a human with a browser, not a model with a tool call. So I wrote build scripts that parse those artifacts into markdown shaped for retrieval, and tools that serve slices of it on demand.

Schema (get_iiq_schema). SailPoint bundles the database DDL with every release: create scripts and upgrade patch scripts for Oracle, SQL Server, MySQL, and PostgreSQL. A build script parses the base files, applies the patch files in order to get the effective schema, diffs consecutive versions, and generates a reference covering IIQ 8.4 and 8.5 including patch levels. get_iiq_schema() returns the table inventory and common query patterns. get_iiq_schema(table='spt_link') returns the full column list with per-database type differences and the 8.4-to-8.5 diff.

It also flags a real footgun. Oracle and PostgreSQL put function-based UPPER() indexes on string columns, so a query that filters on native_identity without wrapping the bind parameter in UPPER() silently bypasses the index and table-scans. Columns that need it are marked in the reference, so the agent writes the indexed form the first time instead of the slow form you discover in production.

API (get_iiq_api). Javadoc, converted to per-package markdown, merged across three sources: SailPoint IIQ 8.5, pub/iiqcommon, and iiq-common-library. About 940 classes across 93 packages. A class lookup like get_iiq_api('SailPointContext') spans all three sources, so the agent doesn’t need to know which library a helper lives in. Per-class sections get extracted on demand at request time, which means a package file can hold 569 classes (sailpoint.object does) without ever returning all of it in one response. The build script handles both the legacy table-based Javadoc HTML the SailPoint distribution still ships and the modern flexbox layout current JDKs emit, because Instrumental Identity’s libraries get freshly generated docs on a newer toolchain.

Examples (get_iiq_examples). SailPoint ships example configs with IIQ: rules, workflows, forms, quicklinks, dynamic scopes, scoring, email templates. The 8.5 build covers 185 rule examples across 93 rule types, plus the rest. When the agent’s about to write a Correlation rule, get_iiq_examples('Correlation') hands it SailPoint’s own example for that exact type, so the input and output contract and the idiom come from the vendor rather than from the model’s imagination.

DTD (get_iiq_dtd). Every IIQ artifact is XML, and the importer validates it against sailpoint.dtd. That file’s generated dynamically by IIQ’s own DTDGenerator, so I regenerate it straight from the matching jar. The 8.5 DTD has 1,108 elements and 110 legal values in the Rule.type enum. get_iiq_dtd('Workflow') returns the legal children and attributes of a <Workflow> before the agent generates one. Checking here prevents the single most common authoring failure, which is XML that looks right and gets rejected at import time.

Patterns (get_iiq_patterns). The one tool that isn’t generated. It’s a hand-curated reference of the things that come up constantly when writing against these libraries: logging idioms, task base classes, plugin anatomy, the namespace gotcha where iiqcommon and iiq-common-library share a package prefix but hold different classes. The curation rules are strict and I enforce them on myself. Every entry is sourced (file path plus class or method), verified (someone actually ran it), concise (a paragraph and a short code block), and current. Stale entries get removed, not left to rot, because the whole point of the file is that the agent trusts it and skips re-verifying. A stale pattern is worse than no pattern. It turns “the model doesn’t know” into “the model is confidently wrong, and I told it to be.”

One shared trick across the generated references: the build scripts preserve hand-written prose. Anything I write inside a ... block survives every rebuild, while the mechanical tables get overwritten. So I can layer human context onto generated data and not lose it when the next IIQ version ships and I regenerate everything.

The cache, and why it’s allowed to fail

iiq-common-library taking 11 seconds isn’t just an error-handling problem. Because search fans out in parallel, the slowest project IS the floor of total search time. Caching that response collapses the warm cost of every repeated search.

So GitLab responses get cached in Redis when IID_MCP_REDIS_URL is set. Search results, plus the project and group lookups used during scope resolution. A few deliberate choices:

It degrades gracefully, always. No Redis configured: the cache is just a no-op singleton and every call passes straight through. Redis configured but unreachable mid-call: the exception is swallowed, a structured cache.error event is logged, and the code falls back to a live GitLab fetch. The server NEVER fails a tool call because the cache is unhappy. A cache is an optimization, and an optimization that can take down the thing it’s optimizing is a liability.
JSON values, not pickle. Cached payloads are pydantic model dumps, re-validated on read. redis-cli GET returns readable JSON, and there’s no pickle-version coupling to trip over on a redeploy.
Versioned, hashed keys. iid_mcp:v1:<namespace>:<sha256-of-args>. If a payload shape ever changes incompatibly, I bump the version prefix. The old entries become unreachable and expire on their TTL. No migration, no flush.

Default TTL is four hours. Scoped repos change on the order of days, so the staleness window is bounded while the hit-ratio compounding is large. The whole thing’s a drop-in for ElastiCache the day this moves to AWS, with no code change.

Observability that survives a redeploy

Every meaningful thing the server does emits one JSON object on stdout: server.startup, search_iiq.start, search_iiq.project, search_iiq.complete, cache.hit, cache.miss, fetch_iiq_file.complete, and the rest. A timed_event context manager wraps an operation. On exit it logs the event with duration_ms filled in automatically, or on exception it logs at error level with the exception type and message attached. So “how long do searches take in real use” is a jq one-liner against the logs, with no separate metrics stack.

Two things I’m quietly proud of here. First, secrets never reach the logs. The startup event records gitlab_token_present as a boolean, not the token, and the Redis URL gets its credentials redacted to redis://***@host. You can answer “what config did this process boot with” months later without ever having logged a secret. Second, getting clean JSON out of an async web server is a fight, because uvicorn installs its own plain-text handlers that interleave INFO: Started server process lines into your structured stream and break every jq parse. Taming that took mutating uvicorn’s logging config dict in place (reassigning it is too late, the class default already captured the original object reference) and forcing its loggers to propagate up to the JSON handler. More annoying to track down than it should’ve been. 🙂

In production the logs go to the host’s systemd journal via the journald Docker driver, not the default json-file. That’s deliberate. docker logs dies with the container, and I redeploy on every push to main. journald outlives the container, so a week of search-latency and error history survives a redeploy. You query it with journalctl CONTAINER_NAME=iid-mcp -o cat | jq -R 'fromjson? | select(...)', where fromjson? quietly drops any non-JSON startup banner that slipped through.

The transport migration: SSE to Streamable HTTP

The server originally spoke MCP over SSE, a long-lived server-sent-events stream at /sse. That was a mistake for this deployment, and the symptom taught me why. The production server sits behind a Cloudflare Tunnel, and long-lived SSE streams went stale through Cloudflare on container restart or after an idle stretch. The connection looked alive and was dead. Clients had to be restarted to recover.

Streamable HTTP fixes this by construction. It uses short-lived, per-call HTTP requests against a single /mcp endpoint instead of one stream held open for the session’s life. There’s nothing to keep alive, so there’s nothing to go stale. I cut production over to streamable-http only. SSE still exists in the codebase and can be switched back on by changing one line in the Dockerfile, but the deployed container runs the new transport exclusively.

The cutover surfaced a spec-versus-reality wrinkle worth recording. The Streamable HTTP spec says a client MUST send Accept: application/json, text/event-stream on every POST, and the reference SDK enforces it with a 406 Not Acceptable when either type is missing. Real clients don’t all comply. Claude Desktop’s mcp-remote bridge, among others, sends only one type, or a bare */*, and got bounced. The spec’s actual intent is “don’t hand a client a representation it won’t accept,” so I relaxed the guard to exactly that. Accept the request if it’ll take either representation we might return. Reject only a request that accepts neither and offers no wildcard. It’s a small monkeypatch over one SDK method that both code paths derive from, and it logs accept_header.patch_applied once at startup so the deviation from stock behavior is never a mystery.

Deploy and ops

The design rule is one sentence: state is externalized, the process is killable. There’s no local database, no session store, no on-disk cache. Everything lives in GitLab and Redis. That makes the container disposable, which is why redeploy-on-every-push and journald-for-logs both work, and it’s why moving from a Proxmox VM today to ECS or Fargate tomorrow is a deploy change, not a code change.

The current production setup:

A single-stage Docker image built on the official uv Python 3.12 base, running as a non-root user, with dependencies layered separately from source so they cache independently.
A Proxmox VM behind a Cloudflare Tunnel and Cloudflare Access. Browsers get gated by M365 Entra SSO. Programmatic clients use per-user Cloudflare Access service tokens, so revocation is per-user instead of all-or-nothing.
GitLab CI on every push to main: lint, test, build, deploy. A docs-only push skips the build and deploy stages.
118 tests, every one of them mocking GitLab at the HTTP layer with respx and faking Redis with fakeredis, so the suite’s deterministic and never touches the network or a developer’s real .env.

Auth on the server itself is currently a pass-through stub. Cloudflare Access does the real gating at the edge. When this server eventually leaves the controlled network, Entra OAuth replaces the stub, and the seam for it is already there.

What I kept coming back to

A few principles ended up driving most of the decisions. They’re worth stating plainly, because they generalize past this one server.

Ground the model in real source. Never trust its memory of your private code. Every tool exists to replace a confident guess with a fact the model can cite.
Infrastructure degrades the answer, it never destroys it. A slow repo becomes one line in an error list. A dead cache becomes a live fetch. A single-type Accept header still connects. The tool call comes back.
Externalize all state. A killable process is a deployable process, a redeployable process, and a portable process.
Curated knowledge has to stay honest or it’s worse than nothing. The model trusts what you give it and skips re-checking. That trust is the whole value. It’s also the liability the moment the data goes stale, so the discipline around the references isn’t optional.

What’s next

IIQ is the first product, not the only one. The package layout already reserves space for sibling toolsets (tools/isc/, and later Evolvum Midpoint and Fischer Identity) that plug into the same scope, cache, transport, and observability machinery. Each new product brings its own GitLab paths and its own reference data and reuses everything else.

Looking back

If I were starting over, I’d build the partial-failure handling on day one instead of bolting it on after the logs embarrassed me into it. 🙂 Everything good about this server is downstream of one decision: treat the model as something to ground, not something to trust, and never let a piece of infrastructure between it and the truth fail loudly enough to matter.

For now it’s doing the job it was built for. Claude Code sessions write IIQ code against real library source instead of a plausible hallucination of it, the cost of a wrong line shows up at authoring time instead of at import time, and I stopped being the human who has to catch the method that doesn’t exist. That last part is the whole point.

Building a Personal AI Assistant (FinkBot)

I work across too many contexts at once. Three email accounts. Two calendars. Work Teams. Personal Slack. A Gitea instance for side projects. Home Assistant watching over a thousand devices. Notes in Joplin. Tasks scattered across wherever I dumped them last week.

For years I tried to manage this with dashboards, integrations, and sheer willpower. None of it stuck. What I actually wanted was something that already knew what was going on and could just tell me, without me having to ask in exactly the right way or remember which system held which piece of information.

So I built FinkBot.

What It Does

FinkBot is a personal AI assistant that runs entirely on my home network. It continuously ingests data from every corner of my digital life: email, calendar, chat, code repos, smart home sensors, task lists, notes. It indexes everything into a searchable memory store and uses that context to:

Send a morning and evening briefing every day
Prep me for meetings 30 minutes before they start, pulling together context on who I’m meeting with and what we’ve been working on
Alert me to things that matter: security sensor trips, appliances left on, unusual comms gaps, infrastructure pressure
Answer ad-hoc questions in Discord (“what did I send Dave last week?”, “who is attending this meeting?”)
Suggest home automation actions and let me approve them with a single reaction
Announce context-aware briefs to my Echo Show when I walk into a room
Run a weekly self-reflection that evaluates its own output quality and proposes improvements

All without sending a single byte of personal data outside my LAN.

The Stack

Data Sources

FinkBot polls and ingests from:

Email: three accounts, IMAP for personal and university, Microsoft Graph API for work, Dovecot archive for historical backfill
Calendar: Microsoft 365 plus Nextcloud CalDAV for personal events
Chat: Microsoft Teams DMs and Slack channels
Code: Gitea commits, issues, and PRs across all my repos
Notes: Joplin (everything I have ever written down)
Tasks: Nextcloud Tasks via CalDAV
Smart home: Home Assistant with over 1,000 entities covering presence, appliances, energy, sensors, climate, plus dedicated monitors for TrueNAS pool health and Proxmox container pressure

Each source has its own Prefect flow with a schedule tuned to how often that source changes. Email checks every 5 to 10 minutes. Smart home every 5. Git repos every 2 hours. Joplin and Slack every 30 minutes.

Two-Tier Memory

FinkBot uses a two-tier memory architecture, and the distinction matters.

ChromaDB (chroma.crosscreek) is the raw recall layer. Over 257,000 documents, everything ingested from every source, embedded and stored. When FinkBot needs to answer “what did I send Dave last week?”, this is where it looks. Semantic search, no structure, just relevance.

MemU (memu.crosscreek) is the distilled long-term layer. It accumulates synthesized facts over time: things I’ve manually added with /context add, outputs approved from the weekly self-reflection, curated patterns. It is not a replacement for Chroma. It is the part of memory that has been thought about. Raw search and distilled knowledge serve different purposes and live in different stores.

A middleware layer in the memory client handles all query logging transparently. Individual flows don’t think about it. Queries from Discord get tagged separately from background automation queries so the self-reflection loop can distinguish what I’m actually asking about from what the system is doing on its own.

The Entity Graph

Layered on top of the vector stores is a structured knowledge graph backed by Neo4j Community Edition, running at neo4j.crosscreek. It tracks four node types: Person, Company, Project, and Topic. Relationships include EMAILED, WORKS_AT, INVOLVED_IN, COLLABORATES_WITH, and others.

The move to Neo4j from an earlier embedded graph database was driven by one practical problem: the embedded approach had a single-writer bottleneck. Multiple flows running concurrently would contend for the write lock. Neo4j’s MVCC gives concurrent readers and writers without coordination overhead, and the Bolt protocol means any flow or API endpoint can connect remotely without file locking concerns.

The graph gets populated automatically from email processing, Joplin notes, Slack analysis docs, calendar attendee extraction, and a deterministic upsert on every Gitea repo. A dedicated backfill flow processes the historical Dovecot email archive on an hourly schedule, steadily growing the graph from years of past correspondence. A separate curation flow runs weekly to merge duplicate nodes and filter noise.

The entity graph answers questions the vector store cannot. “Who am I meeting with today, and what do I know about them?” The daily briefing pulls today’s calendar attendees, looks each one up in the graph, and injects a “People you’ll meet today” section into the prompt before it ever touches Chroma. Structured facts first, semantic context second.

The Flows Layer

All orchestration runs on Prefect, deployed to a 4GB/4-core LXC (prefect.crosscreek). Prefect replaced an earlier n8n-based setup. The GUI-drag-drop approach was fine until I needed version control, testability, and the ability to do something non-trivial in a node. Python and git won.

Active flows currently running:

Flow	Schedule	Purpose
`ha_monitor`	every 5 min	Security, appliances, presence, TrueNAS, Proxmox
`meeting_prep`	every 5 min	30-min lookahead prep briefs
`calendar_actions`	every 5 min	Rule engine over upcoming meetings x HA state (pauses media, etc.)
`ha_suggestions`	every 5 min	Rule-based HA suggestions with one-tap Discord approval
`proactive_voice`	every 60 sec	Presence transitions trigger TTS brief on Echo Show
`pattern_briefings`	every 5 min	Fires focused briefs 15 min before scheduled pattern moments
`email_monitor_*`	5 to 15 min	Three accounts, IMAP + Graph API
`calendar_sync`	every 30 min	M365 + Nextcloud to memory
`slack_ingestor`	every 30 min	Slack channels to memory
`joplin_ingestor`	every 30 min	Notes to memory
`chat_ingestor`	hourly	Teams DMs to memory
`entity_backfill`	hourly	Historical Dovecot emails to Neo4j entity graph
`daily_briefing`	8am + 5pm ET	Morning and evening briefings
`graph_curation`	weekly	Neo4j merge suggestions, noise filtering
`memu_curation`	weekly	MemU near-duplicate cleanup
`self_reflection`	Sunday 8pm ET	Weekly synthesis, proposals, draft PRs
`memory_defrag`	Saturday 7pm ET	Expire stale entries, corpus stats
`watchdog`	continuous	Auto-cancel stuck runs, hard-kill past threshold

A few things I have learned running these in production:

In-process concurrency guards beat deployment-level limits. High-frequency flows call check_self_concurrency() at startup and exit cleanly if another instance is already running. Relying on Prefect’s deployment-level concurrency limits alone left edge cases where crashed runs didn’t release their slots. Explicit guards are more reliable.

Startup reconciliation matters. After a crash or restart, the server can have zombie “running” states for flows that are no longer actually running. A startup reconciliation pass cleans these before new runs start, preventing phantom concurrency blocks.

CalDAV clients need timeouts. My NAS can be slow. A DAVClient without timeout=10 will hang indefinitely. Flows that run every 60 seconds cannot afford that.

Don’t alert on things you can’t fix. Meeting prep runs every 5 minutes. CalDAV errors go to print(), not the Discord error channel. Nobody needs a ping every 5 minutes because the CalDAV server hiccupped.

The API Bridge

A FastAPI service on port 8003 acts as the hub connecting flows, the Discord bot, and the kiosk. The bot never touches memory or the entity graph directly. It POSTs to the API and gets a response. Logic stays centralized.

Key endpoint groups: briefing and prep triggers, Home Assistant action execution (with Discord approval gating), entity graph CRUD, MemU memory management, pattern automation management, Prefect watchdog controls, kiosk announcement queue, and a transcription and TTS pipeline for the Echo Show.

There is also an HTTPS endpoint on port 8443 serving the kiosk dashboard. It required HTTPS because getUserMedia only works in a secure context.

The Echo Show Kiosk

The dashboard on my Echo Show (running vanilla android) has become more than a status screen. It shows live calendar, unread message counts, current tasks, and a Home Assistant home status panel. More interestingly, it now listens for a wake word via openwakeword, streaming audio from the browser to a Python WebSocket server. When the wake word fires, it triggers a context-aware TTS brief. The proactive voice flow also pushes announcements to the kiosk when presence transitions happen, so walking into the room can trigger a summary of what’s coming up.

The Discord Bot

Discord is my primary interface. Current slash commands:

/brief – trigger a morning or evening briefing on demand
/prep – meeting prep for a specific event
/search and /search-email – semantic memory and Dovecot archive search
/remember – store a memory manually
/who – entity graph person lookup
/context – manage persistent knowledge file without SSH
/task and /quicktask – Nextcloud Tasks management
/cal – natural language calendar query
/memory – corpus management (stats, forget, curate)
/reflect – trigger self-reflection immediately
/status – system health check
/ark – ARK server management (yes, the game server lives here too)

Reaction handlers let me take action on messages without typing. Thumbs up and thumbs down on briefings feed the engagement feedback loop. Checkmark or X on HA suggestion messages triggers or dismisses the action. A book reaction on meeting prep saves notes to Joplin. A no-entry reaction blocks an email sender. These reactions are the primary interface for approving anything the system proposes.

Design Decisions

Local-First, Always

No personal data leaves 192.168.48.0/24. That is the constraint the architecture is built around. Email, calendar, chat, smart home state: none of it touches an external service.

For LLM inference, the system runs a tiered approach. A 16GB M4 Mac Mini runs Ollama with qwen2.5:14b for heavy inference (briefings, meeting prep, self-reflection) and qwen2.5:7b for lighter work. Anthropic’s API is configured as a fallback and is also used for high-value one-off tasks like code change proposals, where output quality matters more than token cost. But the default path for everything is local.

This split was a deliberate choice. High-volume extraction work like the entity backfill runs local exclusively. Running the full Dovecot email archive through the Anthropic API would cost real money and send personal email content to an external service. Neither is acceptable. With local Ollama it costs nothing and stays on the network.

The fallback to Anthropic exists for resilience, not as a cost optimization. If the Mac Mini is down, the system degrades gracefully rather than going silent.

Why Qwen2.5? Instruction following is strong, JSON output mode is reliable (critical for entity extraction), and the quantized models fit the hardware. The 14B at Q4_K_M runs comfortably within 16GB unified memory while leaving headroom for everything else.

The Entity Graph Upgrade

The move from an embedded graph database to Neo4j is the biggest architectural change in the last few months. The embedded approach worked fine for read-heavy lookups but fell apart when multiple flows needed to write simultaneously. Kuzu’s single-writer model meant contention, and flows running on tight schedules can’t afford to queue behind each other for graph writes.

Neo4j’s MVCC handles concurrent writers cleanly. The Bolt protocol means any process on the network can connect without worrying about file locking. And having a proper query interface makes ad-hoc exploration and curation much easier. The graph curation flow runs weekly, suggesting node merges and filtering noise via Cypher queries that would have been awkward to express in the embedded model.

Two-Tier Memory Is Not a Migration

An earlier version of the architecture treated MemU as a replacement for ChromaDB, something to migrate to once the hardware was ready. The current design treats them as doing different things.

Chroma is raw storage. Everything ingested lands there. Semantic search across 257,000 documents is fast and works well. The limitation is that it treats every document as equally relevant. There is no way to ask “what do I actually know about this person?” and get a distilled answer rather than a pile of email snippets.

MemU is for things that have been synthesized. Approved self-reflection outputs. Manually added context notes. Curated patterns. It is smaller, more intentional, and represents knowledge that has been validated rather than just observed. Briefings and prep queries can pull from both layers and get different things from each.

The Self-Learning Loop

Every Sunday at 8pm, self_reflection runs. It reads two weeks of feedback: briefing reaction rates, Discord query patterns, entity graph growth, MemU accumulation. It queries memory across all sources for a weekly snapshot. It passes everything to an LLM and asks what is working, what is not, and what should change.

The synthesis produces two actionable outputs.

Context proposals are facts or patterns that should become permanent knowledge. Each one appears in Discord as a bookmark message. I react with a checkmark to approve or X to reject. Approved items get appended to /opt/finkbot/finkbot_context.txt, which is injected into every future self-reflection prompt. The system accumulates knowledge from its own outputs over time.

Code change proposals are improvement ideas formatted as PR descriptions. Each one opens a draft PR in the FinkBot Gitea repo and posts the URL to #insights. I review through normal CI/CD, or close it if the idea isn’t worth pursuing.

The pattern automation system extends this further. self_reflection can also write to a patterns file, which pattern_briefings reads to fire focused context briefs on a schedule. If self-reflection notices that Monday mornings are always context-switching heavy, it can propose a pattern that fires a tailored brief every Monday at 7:45am. I approve it once and it runs every week.

Infrastructure

Everything runs on a Proxmox cluster on my home network.

Service	Host	Notes
Prefect flows + API	prefect.crosscreek (LXC 203)	4GB RAM, 4 cores
ChromaDB	chroma.crosscreek	raw memory backend
MemU	memu.crosscreek (LXC 204)	distilled memory
Neo4j	neo4j.crosscreek	entity graph, Bolt protocol
Discord bot	prefect.crosscreek	thin bot, talks to API
Home Assistant	ha.crosscreek:8123	1000+ entities
Gitea	git.mystikos.org	source of truth + CI/CD
Ollama (14B + 7B)	16GB M4 Mac Mini	primary inference

CI/CD runs through Gitea Actions. Pushing to main rsyncs the relevant files to each target host and restarts the appropriate systemd services. The servers are not git clones. They are deploy targets. Code changes go through the repo, not SSH sessions on the server.

What’s Next

Since original publication (May 2026)

A month later, several “What’s Next” items have shipped, and a few new ones have emerged. The headlines:

A/B prompt variants are live. A modifier-delta framework in flows/common/prompt_variants.py picks a deterministic variant per ET calendar day (sha256 of flow:iso-date, modulo variants), so morning and evening briefings on the same day always share a variant. Engagement is logged per-variant via the existing 👍/👎 pipeline; self_reflection surfaces a per-variant comparison block in its weekly LLM prompt only when more than one variant has fired. Wired into daily_briefing, meeting_prep, and pattern_briefings. Promotion is still manual — at two briefings per day, engagement rate is too noisy for auto-promotion.

Time-scoped /ask. Queries that carry both a question phrase (“what was going on”, “recap”, “tell me about”) and a temporal marker (“last week”, “in March”, “Q1”) now route through api/temporal_intent.py: window parsing (deterministic fast path with an LLM fallback), MemU recall + by-ID fetch of pattern/anomaly detection docs from that period, and Anthropic synthesis. The pipeline also cross-references current Nextcloud tasks so historical task mentions get noted as resolved when they’re no longer open. Intent detection requires both signals — either alone produced too many false positives.

Pattern automations end-to-end. pattern_detector proposes ⏰ automations when it spots a recurring behaviour with a clear schedule. Mark taps ✅, the bot writes to /opt/finkbot/patterns.jsonl, and a new pattern_briefings flow polls every 5 minutes and fires a focused pre-briefing 15 minutes before each pattern’s next cron-scheduled time. /patterns list/add/remove Discord commands round it out.

Two new self-improvement loops.

Decision journal: every privileged reaction (task/context/graph-merge approvals, HA actions, watchdog kills, blocklist edits) writes a row to feedback_log.jsonl. self_reflection mines this weekly to propose suppression rules like “Mark rejected 4 task proposals from Client X this week — suppress them.”

Thumbs-down post-mortem: a 👎 on a briefing fires a fire-and-forget task that compares the disliked briefing to recent 👍’d baselines, asks the LLM for a single-sentence suppression rule, and routes it through the existing context-proposal approval flow. Approved rules append to finkbot_context.txt and feed every future reflection prompt.

Incident learning. Watchdog auto-cancels and startup-reconcile cleanups now log to feedback_log.jsonl with noise-filter thresholds (5 zombies, 20 backlog) so only anomalous resilience events surface. self_reflection mines them and proposes timeout/threshold tweaks as draft Gitea PRs. The thresholds are load-bearing: a clean restart after every deploy clears 1–3 zombies, and without the floor the weekly report would propose “fix deploy restart” every week and drown the real signal.

Home Assistant action suggestions. Two new flows post 🏠 one-tap action suggestions to #alerts. ha_suggestions (rule engine over HA state, e.g. “everyone’s away and the front door is unlocked: lock it?”) and calendar_actions (rule engine over upcoming meetings × HA state, e.g. “pause the media player, standup starts in 5 min”). Both share a narrow allowlist keyed by domain/service so a misbehaving rule can at worst propose something the allowlist rejects.

Chroma HNSW rebuild (2026-04-25). Long-running where-filter 500s caused by orphaned IDs in the original collection are gone. scripts/rebuild_chroma_hnsw.py copied 260,022 docs into a fresh finkbot_memory_v2 collection; 255,835 with preserved embeddings, 4,200 (slack/gitea orphans) re-embedded via the proxy’s default embedding function. The 500-retry-without-where fallback in the client stays as a canary; if it fires again, something regressed.

Mac Mini hardening. Intermittent 1–2 hour Ollama outages through April were traced to two causes. First: launchctl setenv is per-launchd-session and lost on reboot, so an earlier OLLAMA_NUM_PARALLEL=2 quietly dropped to 1 after the next reboot, serializing every caller behind whichever flow was currently running and starving live flows during backfills. Fixed by baking the env var into a LaunchDaemon plist. Second: macOS’s manual Wi-Fi mode installs only interface-scoped default routes, which Go-based clients (Ollama included) ignore for outbound TCP. A second LaunchDaemon now installs a global default route at boot. A 2026-04-27 outage was a third issue: Ollama.app autostarted from Login Items, won the port-11434 race, and the LaunchDaemon failed to bind 78 times in a row. Removing the Login Item closed it. The new ollama_health_check flow catches any future variant in <10 minutes.

Qwen3 rollback (2026-04-21). Tried upgrading to qwen3:14b/qwen3:8b. Qwen3 ships with thinking-mode on by default, adding 60–180 s of internal monologue per call. Within two hours, three flows had blown their timeouts. Rolled back the same day. Models stay on disk pending an /api/chat-with-think:false switch or the 48GB Mac Mini, where the thinking budget will be affordable.

Neo4j round-trip optimization. get_person_context() collapsed from six Cypher queries to one with CALL {} subqueries. /who warm latency dropped from ~40ms to ~12ms.

What’s still next

Multi-step tool-use loop in chat: bounded agent loop with a tool manifest (calendar, memory, HA read/write, task create), hard step cap, and an approval reaction before any external-effect tool fires. The hard problem isn’t the loop, it’s the UX. A 30-second synchronous reply is unacceptable; this needs a “working on it…” ack with async completion.
Raw Chroma in temporal queries: the corpus has no indexed timestamp metadata, so post-filtering 260k docs by parsing bodies is too slow. Either backfill a date field or wait for the Chroma proxy to gain a date-range query.
48GB Mac Mini arriving ~early June: Ollama with qwen2.5:72b, IDBot’s primary inference host, second Neo4j and Chroma instances for company namespace.
IDBot ↔ FinkBot cross-pollination: deferred until IDBot has ≥1 month of real data. Transport is solved by namespacing; the hard problem is the summary taxonomy: what’s safe for one bot to surface to the other. Design cold and you invite leaks.

Closing Thoughts

The thing that surprised me most building this was how much of the value comes from the plumbing, not the LLM. The dedup tracker. The concurrency guards. The two-tier memory split. The entity graph that knows Dave works at the same company I do and we’ve emailed 47 times. The feedback log that quietly records every reaction and query without any individual flow caring about it.

The LLM is the tip of the iceberg. Everything below it is data engineering and operational discipline.

If I had to do it over I would have started with Prefect instead of n8n. The version control alone was worth the migration cost. I would have put the dedup tracker in on day one. And I would have moved to Neo4j earlier. The embedded graph was fine until it wasn’t, and the migration was more work than switching from the start would have been.

WordPress Theme

I made a WordPress theme (you are soaking in it) to match the aesthetics of my Directory Master program, which in turn lovingly appropriated the idea from the old DOS Norton Utilities program (which, along with Wordperfect 5.1 represents the pinnacle of user interface design, it’s been downhill since).
Here is a screenshot of what it might look like

In addition to the nice edge effect and animated menu buttons, It has some rudimentary visual editor blocks like:

cool 3d blocks

And even

cooler inverse 3d blocks

And tables. Can you tell I really like the 3d bar effect?

Program	My UI rating on a scale of 1-10
Norton Utilities 5	10
Word Perfect 5.1	9
Windows 3.11	2
Windows 95	1

Source code and installable WordPress theme zip file can be downloaded from my github