Y
When I first dipped my toes into the world of self‑hosting, the idea that kept me awake at night was simple yet powerful: unlimited, private chat memory. ChatGPT had been my go‑to for quick answers, brainstorming, and drafting. But I imagined a personal “second brain” that would remember every nuance of my dialogues without sending anything off the local network. The dream was privacy‑first, infinite‑capability, and an AI companion that grew organically with me.
That was the spark. But the path from spark to reality is rarely a straight line. I created a sandbox for experimentation, and with each project I learned a new lesson that nudged the goalposts.
The first detour – RAG (Retrieval‑Augmented Generation). I basically wanted something I could immediately use for referencing large amounts of documents. Hence, RAG.
I then tried Langflow, hoping it could orchestrate complex pipelines without wrestling with code. Langflow offered an intuitive node‑based UI, and I was excited to wire together data ingestion, embedding, and response generation in a few clicks.
Reality, however, delivered a sobering lesson: Langflow, while powerful, lacks the out‑of‑the‑box components I needed for legal workflows (think clause extraction, redlining, or document comparison). I had to write custom functions, debug JSON outputs, and sometimes revert to pure Python scripts. The detour was frustrating, but it sharpened my focus on the specific gaps in the ecosystem—particularly the absence of a true redlining tool. In the middle of these detours, I stumbled upon Superdoc—a self‑hosted, open‑source, browser‑based DOCX editor. Superdoc was ideal for building a redlining extension because I didn't have to mess with MS Word itself. The architecture is lightweight, and I could inject a sidebar that connects to my LLM to highlight changes, track edits, and even generate diffs.
I spent months wrestling with the diff logic (the “nitty‑gritty” part that felt trivial to a “tech bro” but was actually a maze. Eventually, I had a prototype that could annotate a contract and suggest revisions. The whole process reminded me that building something valuable often means filling a gap that the community overlooked. And guess what, I found out today that an "old internet friend" had ALREADY built such a tool (redlines on python!).
Two days (or more!) of memory research
So I've been on and off researching the topic on how to obtain persistent unlimited memory. The short answer is: It's Hard to do Perfectly. Think of things like creating ChromaDB vector databases and having a pipeline to embed EVERY chat window and everything you ever said. Not really possible in 2-3 days!
So perfection is the enemy of good as they say, so turning back to humble Open-WebUI (actually, no longer so humble, from a simple chat GUI, it's become a behemoth with a plethora of features), I found the Adaptive Memory extension for Open‑WebUI. This extension adds dynamic, evolving memory to the LLM session. Instead of storing every message, Adaptive Memory keeps a summarized and relevant subset of past interactions, growing and pruning itself as the conversation evolves. The description from the plugin page states:
“Adaptive Memory enables dynamic, evolving, personalized memory for LLMs in OpenWebUI, making conversations more natural and responsive over time.” openwebui.com
The beauty of this approach is that it’s low‑effort, low-maintenance. I didn’t have to build a custom memory engine; I just enabled a plugin that does the heavy lifting. This aligns with my ethos: use ready made whenever possible, custom build only when necessary.
I was excited to test it. With a few configuration tweaks, the AI could remember my preferences. Well, maybe not just a few tweaks. I had to find a good model that could deal with the convoluted JSON system prompt (TLDR: Gemma3:27b works great. Qwen3, GPT-OSS all CMI). So now I have to allocate something like 12GB+ of memory to Gemma just as a helper model to process each chat to store/retrieve memories. That's fine because Ollama can handle multiple models loaded at same time so long as you have the juice/RAM for it, and I don't need a 100B model just for my personal chats. Sticking with GPT-OSS-20b for now, but up for recommendations for strong 20-30b models (Don't tell me Qwen3).
The final reflection
Looking back, the original goal—an unlimited, private chat memory—has morphed into something more nuanced. It’s not a single, all‑encompassing memory system; it’s a layered, adaptive approach that respects my local infrastructure, respects the limits of my GPU and RAM, and stays true to the privacy I demanded.
The journey taught me that self‑hosting is as much about choosing the right abstractions as it is about hard work. Each detour (RAG, Langflow, Superdoc) was a detour that added a new tool to my toolbox and a new lesson to my mind. And the final piece—Adaptive Memory—unlocks the door I had envisioned a year ago.
Going forward, what I hope to learn next, is ACTUAL coding at some level. I've been hacking it a bit too much. Maybe time to sit down with an annotated version of my Superdoc diff extension (2000 lines of code!!) to read it and see what it's actually all about!