The silent buildIndex failure
Source: atrium/backend/lib/tasks.js —
buildIndexwith the silent skip · atrium/backend/scripts/audit-tasks.js — the loud audit script that catches it Category: Pattern — Data & storage / observability
Silent buildIndex failure — a failure mode unique to file-based data stores that build an in-memory id → path index at startup. If even one file fails to parse, try/catch quietly skips it. The file still exists on disk and shows up in directory listings, but the API can’t find it because the index never knew about it.
What it is
Section titled “What it is”A specific shape of bug:
- App boots, walks a data directory.
- For each file, parse it (YAML, JSON, whatever) and add
id → filePathto an in-memoryMap. - Wrap the parse in
try/catchso one bad file doesn’t crash the whole boot. - On error, log a
warnand move on. - Done — index built, app starts serving.
The trap: nobody reads warn logs. The bad file is invisible to every API endpoint that uses the index. Listing endpoints that scan disk fresh still see the file (so it shows up in lists). Single-item lookups via the index 404 the file. The asymmetry is what makes it confusing.
Why it exists
Section titled “Why it exists”The problem we hit: Atrium’s feat-terminal-claude-cli-001.md had duplicate YAML frontmatter keys (started_at and reviewed_at appeared twice). gray-matter threw on parse. buildIndex’s try/catch logged a logger.warn and moved on. The file kept appearing in GET /api/tasks (which uses a separate scanAllTasks walker) but GET /api/tasks/feat-terminal-claude-cli-001 returned 404 (“Task not found”). Symptoms looked like a routing bug; root cause was a one-line YAML typo nobody saw because the warning was buried.
The fix: convert silent skips into loud, queryable signals.
The recovery pattern
Section titled “The recovery pattern”Three layers:
1. Audit script with non-zero exit
Section titled “1. Audit script with non-zero exit”A standalone script that walks the same data directory using the same parser and reports every file that would fail. Exit non-zero so CI catches new corruption before it ships.
// backend/scripts/audit-tasks.js — runs on `npm run audit:tasks`const files = collectTaskFiles(TASKS_DIR);for (const filePath of files) { try { const { data, content } = matter(fs.readFileSync(filePath, 'utf-8')); // ...validation rules: required fields, valid status, format, duplicate ids } catch (err) { record('parse_error', filePath, err.message); }}process.exit(totalIssues === 0 ? 0 : 1);2. Health endpoint that surfaces skip count
Section titled “2. Health endpoint that surfaces skip count”Optional but recommended: a /api/health/index endpoint that returns the number of files skipped during the last buildIndex run. A monitoring alert on skipped_count > 0 catches drift in production.
3. Single-file fix script (or just edit the file)
Section titled “3. Single-file fix script (or just edit the file)”When the audit finds a corrupted file, fix it. Since the file is unreachable through the API (which is what made it invisible in the first place), you have to edit on disk directly — bypass the usual “always go through the API” rule for this maintenance op.
When this pattern bites
Section titled “When this pattern bites”Every file-based store is at risk:
- Markdown-as-database (see the related pattern)
- JSON-flat-file stores
- Per-user config directories
- Anything where directory listings and key-based lookups go through different code paths
If the listing path and the lookup path use the same indexed structure, the bad file disappears from both — easier to notice. The real trap is when listing scans disk fresh while lookup uses the index. The symptoms are inconsistent and the bug looks like routing.
Gotchas / when not to apply
Section titled “Gotchas / when not to apply”- Don’t remove the try/catch. A single corrupt file shouldn’t crash the whole boot. The try/catch is correct. The fix is making the consequence visible, not removing the safety net.
- Don’t blanket-fix to “fail loudly on boot.” That breaks production every time someone hand-edits a YAML file. The audit script + health endpoint give you observability without sacrificing resilience.
- Don’t re-architect to a database. If markdown-as-database is the right choice for the rest of the system, this failure mode is a design tradeoff to monitor, not a reason to migrate.