When DuckDB FTS Meets Async MCP: An AI-Assisted Debugging Nightmare
The Deceptively Simple Setup
I was using Claude Code to build a MCP server into Detection Nexus, a security detection rule database that I’m working on. After some research, the architecture seemed straightforward:
- FastMCP framework for protocol handling, I’d used it before on previous projects
- DuckDB for fast queries with Full-Text Search (FTS)
- BM25 scoring for relevance ranking
The server worked perfectly in isolation and the core repository features worked perfectly. Then I started on the MCP server, and the basic tools looked good, until I started the implementation of the actual search tool.
The Mystery
Once I connected the MCP Server to a client like Claude Desktop, the first search query would hang indefinitely. Not a slow response - an infinite hang. The MCP client would wait forever for a response that would never arrive. I even left it running overnight and came back to find it still waiting after 6+ hours.
This pattern was consistent:
- Server already running → Query works immediately ✅
- Server restart → First query hangs forever ❌
- Cancel and immediately retry → Query works immediately ✅
The critical detail: server logs showed the query completed successfully within milliseconds. DuckDB returned results almost instantly, everything appeared fine server-side. But somehow, the response never reached the MCP client.
# Server logs showed query completed in milliseconds:
2025-08-27 09:58:24.609 - Tool call received: "malware", limit=5
2025-08-27 09:58:24.662 - FTS validation completed (52ms) ✅
2025-08-27 09:58:24.741 - Search completed, 5 rules returned (132ms total) ✅
2025-08-27 09:58:24.741 - === MCP TOOL CALL COMPLETED === ✅
# But client hung indefinitely waiting for response
This critical detail didn’t immediately register with either me or Claude Code. It took a few days of chasing dead ends until I realised what was going on.
The Dead Ends That Cost Me Days
Dead End #1: The FTS Scapegoat
My first assumption was that DuckDB FTS BM25 operations were incompatible with async MCP response transmission. I disabled FTS entirely and fell back to LIKE pattern matching. Problem solved!
# WRONG: Bandaid solution that masks the real problem
def _validate_fts_requirements(self) -> bool:
# FTS permanently disabled for MCP compatibility
return False # This just made hangs less frequent!
Except it wasn’t solved. “It still hangs sometimes, just not as often.” I’d masked the symptom, not fixed the cause.
This is where working with Claude Code became challenging. Claude Code declared victory: “Great! The FTS incompatibility is resolved. Your server should work reliably now.” But real-world testing proved otherwise. Working with Claude Code on complex race conditions requires persistence - you need to push back when it declares premature victory.
Dead End #2: The Threading Timeout Disaster
Next theory: timeout mechanism was causing the hangs. The MCP server used run_with_timeout()
to prevent queries from running forever:
# CATASTROPHIC CODE (Removed):
def run_with_timeout(func, timeout_seconds: int, *args, **kwargs):
def target():
result[0] = func(*args, **kwargs) # Uses shared database connection!
thread = threading.Thread(target=target)
thread.start()
thread.join(timeout_seconds)
# Problem: Multiple threads accessing same DuckDB connection
Critical issues:
- Thread Safety Violation: DuckDB connections are not thread-safe
- Database Connection Contention: Multiple threads compete for same connection
- Deadlock Scenarios: Threads block indefinitely waiting for database locks
- Zombie Thread Leaks: Timed-out threads continue running
I spent a full day removing threading timeout mechanisms. Hangs persisted.
Dead End #3: The SearchService Red Herring
Performance testing revealed 43% overhead from an unnecessary connection pooling layer:
# WRONG ARCHITECTURE:
MCP Server → SearchService → Connection Pool → SearchEngine → SearchDatabase
(43% overhead) (unnecessary)
# BETTER:
MCP Server → SearchEngine → SearchDatabase
(direct)
Why SearchService was wrong:
- Single-threaded application: MCP server processes one request at a time
- Pool overhead: Locking and validation for no concurrency benefit
- Complex layering: 300+ lines of pooling code that wasn’t needed
Claude Code confidently suggested: “The connection pooling is definitely the issue. Remove it and your hangs will disappear.” Another premature declaration of victory. While removing it improved performance, the cold start hangs remained.
Dead End #4: The Query Optimisation Rabbit Hole
Convinced the problem was SQL performance, I optimised DuckDB queries:
-- WRONG: Correlated subquery anti-pattern
SELECT r.*,
COALESCE(
(SELECT MAX(fts_main_rules.match_bm25(rule_id, ?))
FROM rules r2 WHERE r2.rule_id = r.rule_id), -- RUNS N TIMES!
0.0
) as relevance_score
FROM rules r WHERE (conditions)
-- BETTER: CTE approach
WITH fts_scores AS (
SELECT rule_id, MAX(score) as relevance_score
FROM (SELECT rule_id, fts_main_rules.match_bm25(rule_id, ?) AS score FROM rules) fts
GROUP BY rule_id
)
SELECT r.*, COALESCE(fs.relevance_score, 0.0) as score
FROM rules r JOIN fts_scores fs ON r.rule_id = fs.rule_id
This improved performance by 40-60%, but the hanging problem persisted. Good optimisation, wrong problem.
The Actual Root Cause
After three days of dead ends, I noticed the pattern more carefully. This required pure stubbornness, Claude Code was no help in this regard at all!
The problem was a cold start initialisation race condition:
- MCP server starts up and declares itself “ready”
- Client immediately sends first query
- Database is still warming up (connection pool, indexes, cache)
- Query hits partially initialised database state
- Response gets lost in the async pipeline during warmup
- Client waits forever for a response that will never arrive
- Second query works because database is now fully ready
This wasn’t a performance issue - DuckDB was returning results in milliseconds. The response was simply getting lost in the uninitialised async pipeline.
The Comprehensive Fix
The server needed proper initialisation sequencing:
class DetectionNexusMcpServer:
def __init__(self):
# ... basic setup ...
# CRITICAL: Don't declare ready until database is warmed up
self._database_ready = False
self._initialization_error = None
# 4-phase warmup sequence
self._warmup_database()
def _warmup_database(self) -> bool:
try:
# Phase 1: Health check validation
health_status = self._search_database.health_check()
# Phase 2: Index warming (touch all major indexes)
self._warm_indexes()
# Phase 3: Search engine cache warming
self._warm_search_cache()
# Phase 4: Performance validation
self._validate_performance()
self._database_ready = True
return True
except Exception as e:
self._initialization_error = str(e)
return False
def search_rules(self, query: str, limit: int = 10):
# Check readiness before processing ANY requests
if not self._database_ready:
if self._initialization_error:
raise RuntimeError(f"Database initialization failed: {self._initialization_error}")
raise RuntimeError("Database still initializing, please wait")
# Now proceed with actual search
return self._search_engine.search(query, limit)
Database Health Checks That Actually Work
def health_check(self) -> Dict[str, Any]:
"""Comprehensive health check covering all failure modes"""
# Test actual query performance, not just connectivity
start_time = time.time()
try:
# Verify indexes are loaded and responsive
result = self.connection.execute("SELECT COUNT(*) FROM rules").fetchone()
index_time = (time.time() - start_time) * 1000
# Check WAL file status (can indicate lock issues)
wal_check = self._check_wal_status()
# Validate schema accessibility
schema_check = self._validate_schema_tables()
# Performance threshold check
if index_time > 1000: # More than 1 second is concerning
return {"status": "degraded", "reason": f"Slow index response: {index_time:.1f}ms"}
return {
"status": "healthy",
"index_response_time": f"{index_time:.1f}ms",
"wal_status": wal_check,
"schema_tables": schema_check
}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
Index Warmup Strategy
def _warm_indexes(self):
"""Pre-load database indexes into memory"""
warmup_queries = [
"SELECT COUNT(*) FROM rules",
"SELECT COUNT(*) FROM rules WHERE provider = 'splunk'",
"SELECT COUNT(*) FROM rules WHERE enabled = true",
"SELECT COUNT(*) FROM mitre_techniques"
# Add queries that touch your major indexes
]
for query in warmup_queries:
self.connection.execute(query).fetchone()
Results
Before (Broken):
- First query after restart: Hang indefinitely ❌
- User experience: Frustrating retry-until-it-works ❌
After (Fixed):
- Server startup: 60-77ms with comprehensive warmup ✅
- First query after restart: 2ms response time ✅
- User experience: Reliable every time ✅
Key Lessons
-
Claude Code declares victory too quickly. When Claude Code says “Problem solved!”, test thoroughly. Complex race conditions often require pushing back and insisting on deeper investigation.
-
Connection != ready. Your database connection being established doesn’t mean indexes are loaded and caches are warm.
-
“Works on retry” = race condition. This pattern is a dead giveaway you’re dealing with an initialisation race.
-
Performance optimisations don’t fix architectural problems. CTE queries and removing connection pooling improved performance but didn’t address the core race condition.
-
Fix the cause, not the symptom. Disabling features to avoid problems creates technical debt.
-
Test cold starts rigorously. Always restart your server between tests during development.
Architecture Anti-Patterns That Fooled Me
- Threading with shared database connections: Never use threading timeout mechanisms with database connections
- Connection pooling for single-threaded applications: Adds overhead without benefit
- Disabling features to avoid problems: Usually masks the real issue
- Optimising the wrong layer: SQL query performance won’t fix initialisation race conditions
Quick Implementation Checklist
If you’re building async servers with database backends:
- Implement proper readiness checks - don’t declare ready until your database actually is
- Add comprehensive health validation - test query performance, not just connectivity
- Include index warmup - pre-load critical indexes during startup
- Test cold start scenarios - restart your server between tests
- Monitor initialisation timing - log warmup phases to identify bottlenecks
- Push back on premature victory declarations - test thoroughly even when tools say “fixed”
Security note: Be careful with health check endpoints - they can leak database schema information if not properly secured.
The fix eliminated the cold start race condition entirely. No more mysterious hangs, no more debugging sessions that stretch into the night. The real lesson? Understanding that “server ready” and “database ready” are two different states that need proper synchronisation - and having the persistence to find the actual root cause even when your debugging assistant declares premature victory.