Your AI's Knowledge Graph Is Lying to You

I built a memory system for my AI agents. Named it Aianna. It runs on Qdrant for vector search and Neo4j for graph traversal. 14,600 chunks of conversation history, decisions, lessons, and meeting notes. The graph classifier takes every chunk and extracts entities and relationships into the knowledge graph automatically.

On paper, it looks impressive. 45,000 nodes across 12 entity types. 102,000 edges across 20 relationship types. Person, Machine, Tool, Project, Decision, Lesson. All connected.

On Friday, I ran a quality audit. Third attempt. The first two failed because the specs I wrote for my engineering agent weren't concrete enough. Third time, I got specific: exact curl commands, local execution only, print output after every step, and if something fails, skip it and keep going.

The audit completed. The results were not what I wanted to hear.

The graph had scale. It did not have integrity.

Here's what I found when I pulled 10 random chunks from Qdrant and cross-referenced them against Neo4j:

Qdrant, the vector database that stores all of Aianna's memories, appeared in the graph as 10 different entities. Ten. Across five different label types: Tool, Machine, Company, Concept, and Project. Same real-world thing. Ten nodes that don't know they're the same entity.

Kush, the Mac Mini that runs the brain infrastructure, existed as 8 variants. It was simultaneously classified as a Machine, a Project, and a Tool. Aianna herself had 10 or more variants scattered across the graph.

The classifier was creating new nodes for every mention instead of resolving to canonical entities. Every time a conversation referenced "Qdrant," the classifier treated it like a new discovery.

The edge problem was just as bad.

Nearly half of all relationships in the graph, 48.8%, were typed as DISCUSSED. That's the graph equivalent of labeling every file on your computer "misc." When half your edges say the same thing, your graph has structure without meaning.

The remaining edge types were more specific: USES, BUILT_BY, DEPLOYED_ON, DECIDED. Those are the relationships that actually help an AI agent understand how things connect. But they were drowning in a sea of DISCUSSED.

I also found session tag leakage. Tags from one conversation, like "Addium" or "AROYA," were bleeding onto completely unrelated chunks. Chinese medicine discussions tagged with cannabis industry metadata. Credit card conversations tagged with cultivation technology. The classifier was applying session-level context to individual chunks without checking whether the content actually matched.

Scale made the problem worse, not better.

This is the part that matters for anyone building AI memory systems.

The instinct is always to add more data. More conversations ingested, more chunks embedded, more nodes in the graph. The assumption is that volume creates intelligence. It doesn't. Volume without entity resolution creates confusion at scale.

With 45,000 nodes, a graph traversal query for "what do we know about Qdrant" would return 10 different subgraphs that don't connect to each other. The AI agent following those paths would see 10 separate entities with their own relationship networks, none of them linking back to a single canonical truth.

That's worse than having no graph at all. At least without a graph, you fall back to vector similarity search, which returns relevant chunks ranked by semantic distance. With a broken graph, you get false precision. The system looks like it knows the answer. It just happens to be fragmented across entities that should be one.

The fix was surgical, not incremental.

I didn't try to patch the existing graph. I tuned the classifier prompt with three specific changes: raised the confidence threshold from 0.6 to 0.7, added a canonical entity table to the system prompt so the model knows "Qdrant" is always a Tool, and added metadata isolation rules so session tags don't leak across chunks.

Then I wiped the entire graph and started the backfill from scratch.

That decision, wipe versus patch, is the one most teams get wrong. When your entity resolution is fundamentally broken, incremental fixes just add complexity on top of a bad foundation. You end up with merge rules, dedup jobs, and reconciliation logic that's harder to debug than the original problem. Sometimes the right call is to burn it down and rebuild with better rules.

The re-backfill ran overnight. 14,600 chunks through an LLM classifier, each one extracting entities and relationships with the tuned prompt.

What this means for anyone building AI memory.

Three things I'd tell any team standing up a knowledge graph for AI agents:

First, audit early. I should have run this quality check after the first 1,000 chunks, not after 14,600. The entity duplication problem was there from the beginning. It just wasn't visible until the graph got large enough for the duplicates to fragment meaningful queries.

Second, entity resolution is the foundation, not a feature. If your classifier can't reliably say "this mention of Qdrant is the same Qdrant as that mention," nothing built on top of the graph will work correctly. Graph traversal, relationship inference, multi-hop reasoning. All of it depends on canonical entities.

Third, edge type diversity matters. If one relationship type dominates your graph, your graph is essentially a flat list with extra steps. Specific, typed relationships are what make graph queries more powerful than vector search. USES, DEPLOYED_ON, DECIDED_AGAINST. Those edges encode knowledge that similarity scores can't capture.

The graph is rebuilding now. The tuned classifier is producing cleaner entities in testing. But the lesson is clear: I built a system that looked intelligent at scale and was actually fragmenting knowledge faster than it was organizing it.

The numbers were impressive. The understanding was broken. And I only found out because I stopped building and started auditing.