DecentDb SPEC (Engineering Specification)¶
Date: 2026-01-27
Status: Draft (v0.1)
Note: This repo is past the initial milestone. This document describes the current 0.x (pre-1.0) baseline scope.
1. Overview¶
This document defines the baseline engineering design for DecentDb: - Embedded DB engine in Nim - Strong correctness via Python-driven testing harness + unit/property/crash tests - ACID via WAL-based design - Storage: paged file + B+Trees, with trigram inverted index for search - Mandatory Overflow Pages for BLOB/Large TEXT support
Current scope (0.x, pre-1.0): single process, multi-threaded readers, single writer.
2. Module architecture¶
2.1 Engine modules (Nim)¶
- vfs/
- OS file I/O abstraction: open/read/write/fsync/lock (intra-process lock only)
-
“Faulty VFS” hooks for tests (partial writes, dropped fsync, crash points)
-
pager/
- Fixed-size page manager (default page size: 4096 bytes, configurable at DB creation)
- 4KB aligns with typical SSD block sizes and OS page sizes
- Larger pages (8KB, 16KB) reduce internal fragmentation for wide rows
- Smaller pages (2KB) reduce memory pressure for cache-constrained environments
- Page cache (pin/unpin, dirty tracking, eviction)
-
Page allocation + freelist
-
wal/
- WAL file append and recovery
- Frame checksums, commit markers
-
WAL index (in-memory map pageId -> latest frame offset for fast reads)
-
btree/
- B+Tree implementation for tables and secondary indexes
- Cursors and iterators
-
Node split handling (merge/rebalance optional post-1.0)
-
record/
- Record encoding/decoding: varint lengths + typed fields
-
Overflow pages for large TEXT/BLOB (baseline requirement - see ADR-0020)
-
catalog/
- System tables and schema management
- Stores table/column/index/fk metadata (including constraints and index kind)
-
Schema versioning cookie
-
sql/
- Parse SQL (Postgres-like) via embedded parser or FFI (see §6)
- Bind/resolve names
-
Lower into logical plan
-
planner/
-
Rule-based planner:
- IndexSeek when predicates match indexed columns
- NestedLoopJoin + index on inner side
- Prefer selective predicates first (text index candidates)
-
exec/
- Volcano (iterator) engine operators:
- TableScan, IndexSeek, Filter, Project
- NestedLoopJoin (0.x baseline)
- Sort (External Merge Sort capable, see ADR-0022), Limit/Offset
-
Row materialization in reusable buffers (avoid per-row heap alloc)
-
search/
- Trigram index build + query evaluation
- Posting list storage format (see §8)
2.2 Testing modules (Python)¶
tests/harness/orchestrates:- black-box SQL tests
- crash injection scenarios
- differential checks vs PostgreSQL for subset
- dataset generators and benchmarks
3. File layout and formats¶
3.1 Files¶
dbfile— main database file (paged)dbfile.wal— write-ahead log (append-only)- (optional future)
dbfile.lock/ shared-memory region for multi-process
3.2 Main DB header (page 1)¶
Store: - magic bytes (16 bytes: "DECENTDB" + padding) - format version (u32) - page size (u32) - header checksum (CRC-32C of header fields) - schema cookie (increment on schema change) - root page ids for system catalog B+Trees - freelist head pointer + count - last checkpoint LSN / WAL checkpoint info
Header Layout (128 bytes total):
Offset Size Field
0 16 Magic bytes
16 4 Format version
20 4 Page size
24 4 Header checksum (CRC-32C)
28 4 Schema cookie
32 4 Root page ID for catalog
36 4 Root page ID for freelist
40 4 Freelist head pointer
44 4 Freelist count
48 8 Last checkpoint LSN
56 72 Reserved
See ADR-0016 for checksum calculation details.
Format version notes: - v2 adds catalog-encoded column constraints (NOT NULL/UNIQUE/PK/FK) and index metadata (kind + unique flag). - v1 databases are not auto-migrated; open fails with ERR_CORRUPTION until upgraded.
3.4 Catalog record encoding (v2)¶
Catalog records are stored in a B+Tree keyed by CRC-32C of record names.
Table record: - kind = "table" - name - root page id - next row id - columns encoded as name:type[:flags] with ; separators - flags: nn, unique, pk, ref=parent_table.parent_column
Index record: - kind = "index" - name - table - column - root page id - kind ("btree" or "trigram") - unique flag (0/1)
3.3 Page types¶
- B+Tree internal page
- B+Tree leaf page
- Overflow page (baseline requirement)
- Freelist trunk/leaf (or single freelist chain for 0.x)
4. Transactions, durability, and recovery (WAL-only)¶
4.1 WAL frame format (0.x baseline)¶
Each frame appends: - frame_type (u8): 0=page, 1=commit, 2=checkpoint - page_id (u32, valid for page frames) - payload_size (u32) - payload (page image or commit metadata) - frame_checksum (u64, CRC-32C of header + payload) - lsn (u64 monotonically increasing)
Frame Types: - PAGE (0): Contains modified page data - page_id: ID of the page - payload: Full page image - COMMIT (1): Transaction commit marker - page_id: 0 (unused) - payload: transaction_id (u64), timestamp (u64) - CHECKPOINT (2): Checkpoint completion marker - page_id: 0 (unused) - payload: checkpoint_lsn (u64)
Commit rule: - A transaction is committed when a COMMIT frame is durably written. - Default durability: fsync(wal) on commit.
Torn Write Detection: - Frame header includes payload_size for validation - Recovery ignores incomplete frames (checksum mismatch or size truncation)
4.2 Snapshot reads¶
On read transaction start: - capture snapshot_lsn = wal_end_lsn using atomic load with acquire semantics (AtomicU64.load_acquire()) - This ensures the reader sees a consistent point-in-time view without acquiring locks - When reading page P: - if wal has a version of P with lsn <= snapshot_lsn, return latest such frame - else return page from main db file
Maintain in-memory walIndex: page_id -> list/latest frame offsets (store latest; optionally chain for multiple versions during long read txns).
Atomicity guarantee: The wal_end_lsn is an AtomicU64 that is incremented only after the WAL frame is fully written and the in-memory index is updated. Readers use load_acquire() to ensure they see all prior writes.
4.3 Checkpointing and WAL Retention¶
Checkpointed pages are copied to the main database file.
Reader Protection Rule (ADR-0019): - The WAL must never be truncated if it contains frames required by an active reader (i.e., frame LSN > min(active_reader_snapshot_lsn)). - If a reader falls far behind, the WAL file may grow indefinitely until the reader finishes.
Checkpoint Protocol: 1. Set checkpoint_pending flag to block new write transactions. 2. Copy committed pages to main DB file (up to last commit). 3. Determine safe_truncate_lsn = min(active_readers_snapshot_lsn). - If no readers, safe_truncate_lsn = last_commit_lsn. 4. Write CHECKPOINT frame to WAL. 5. Conditionally Truncate: - If safe_truncate_lsn allows, truncate the WAL file. - If readers are blocking truncation, the WAL remains large until the next checkpoint opportunistically truncates it. 6. Clear checkpoint_pending flag.
Forced Checkpoint (Timeout): - If we must checkpoint while readers are active, we proceed with the copy based on the writer's LSN, but we skip the WAL truncation step for any portion needed by readers.
WAL Growth Prevention (see ADR-0024): - Implement reader tracking to monitor active readers and their snapshot LSNs - Introduce configurable timeout for long-running readers to prevent indefinite WAL growth - Log warnings when readers hold snapshots for extended periods - Optionally force WAL truncation with appropriate safeguards when readers become too stale
4.4 Bulk Load API¶
Dedicated API for high-throughput data loading with deferred durability.
Key characteristics: - Holds single-writer lock for entire duration - Batches rows and fsyncs only at configured intervals - Readers see consistent pre-load state (snapshot isolation) - Crash during bulk load loses all progress (no partial commits)
See ADR-0017 for detailed design.
Configuration: - batch_size: Rows per batch (default: 10000) - sync_interval: Batches between fsync (default: 10) - disable_indexes: Skip index updates during load (default: true, rebuild after) - checkpoint_on_complete: Checkpoint after load finishes (default: true)
4.5 Crash recovery¶
On open: - scan WAL from last checkpoint - validate frame checksums - apply frames up to last commit boundary into walIndex view - DB becomes readable immediately using WAL overlay - optional: perform checkpoint soon after open
5. Concurrency model (single process)¶
5.1 Writer¶
- Exactly one active writer transaction at a time
- Recommended: a single “writer thread” owning write txn state (actor pattern)
5.2 Readers¶
- Multiple concurrent readers
- Each reader uses snapshot_lsn and does not block on writer except for brief schema locks
5.3 Locks and latches (0.x baseline)¶
schemaLock: RW lock around catalog changes- Page cache: per-page latch + global eviction lock
- WAL append: mutex for serializing frame writes (enforces single-writer)
Note: The WAL append mutex enforces the single-writer constraint at the storage layer. While the architecture supports only one writer transaction at a time by design, the mutex provides defense-in-depth against programming errors.
Multi-process locking is out of scope for 0.x.
5.4 Deadlock detection and prevention¶
- Writer acquires locks in a consistent order to avoid deadlocks
- Readers never block other readers (snapshot isolation)
- If a deadlock is detected (via timeout), abort the newer transaction
- Document lock acquisition order for all operations
6. SQL parsing & compatibility (Postgres-like)¶
6.1 Parser choice¶
0.x recommendation: - Use a Postgres-compatible parser via FFI (e.g., libpg_query) to accept familiar syntax. - Normalize parse trees into DecentDb’s internal AST immediately.
Alternative: - Use Nim-native parsesql for faster iteration, then migrate to libpg_query later.
6.2 Supported SQL subset (0.x baseline)¶
- DDL:
CREATE TABLE,CREATE INDEX,DROP TABLE,DROP INDEX - DML:
SELECT,INSERT,UPDATE,DELETE - Aggregate functions:
COUNT(*),COUNT(col),SUM(col),AVG(col),MIN(col),MAX(col)withGROUP BYandHAVING - Joins:
LEFT JOIN,INNER JOINon equality predicates - Filters: basic comparisons, boolean ops,
LIKE - Ordering:
ORDER BY(multi-column),LIMIT,OFFSET
6.3 Parameterization¶
$1, $2, ...positional (Postgres style) — chosen for the 0.x baseline- Consistent with libpg_query parser choice
- Simple to implement and test
- Well-understood by developers familiar with PostgreSQL
- Document and test thoroughly.
7. Relational features¶
7.1 Primary keys¶
- Default rowid PK if none specified
- Unique index enforces PK uniqueness
7.2 Foreign keys¶
- Enforced at statement time (Note: differs from SQL standard which enforces at transaction commit)
- Requires index on referenced parent key (enforced by CREATE TABLE)
- Auto-create index on child FK columns if not present
- Index name:
fk_<table>_<column>_idx - This ensures FK checks are efficient and avoids full table scans
Statement-time enforcement means: - Each INSERT/UPDATE/DELETE statement validates FK constraints immediately - Violations cause immediate error and statement rollback - This differs from PostgreSQL/MySQL which defer validation to COMMIT
0.x actions: - RESTRICT / NO ACTION on delete/update
Optional later: - CASCADE, SET NULL - Deferred constraint checking (transaction-commit time)
8. Trigram substring search index¶
8.1 Why trigrams¶
LIKE '%pattern%' cannot use a normal B+Tree index efficiently. Trigrams allow: - candidate retrieval via inverted index - intersection of posting lists - final verification using actual substring match
8.2 Index data model¶
For each indexed TEXT column: - Normalize to uppercase (configurable; 0.x assumes uppercase inputs are already normalized) - Generate trigrams across a canonical form (define whitespace/punctuation handling)
Store: - trigram (3-byte token or packed u32) - postings list: sorted list of row ids (or record ids) - optional payload: store albumId with trackId for fast join pruning
8.3 Query evaluation¶
Given pattern: - if len(pattern) < 3: do not use trigram (require additional filters or fall back with cap) - generate trigrams of pattern - order trigrams by increasing postings frequency - intersect progressively until candidate set <= threshold - verify each candidate by substring match
Case Sensitivity Rules: - Trigram index stores uppercase-normalized trigrams - Patterns are uppercased before query processing - LIKE '%abc%' and LIKE '%ABC%' use the same trigram index entries - Case-insensitive matching is the default for trigram-indexed columns - For case-sensitive search, use B+Tree index with exact match predicates
Pattern length guardrails: - Patterns < 3 characters: never use trigram index (too many matches) - Patterns 3-5 characters: use trigram only if combined with other filters or if rarest trigram count < threshold - Patterns > 5 characters: use trigram index, but cap results if rarest trigram exceeds threshold
8.4 Broad-pattern guardrails¶
Maintain postings count per trigram. - If rarest trigram count exceeds a threshold (e.g., >100k), require additional predicate or cap results. - Provide an engine setting for thresholds.
8.5 Storage format for postings (0.x baseline)¶
- Store postings lists in a dedicated B+Tree keyed by trigram:
- key: trigram
- value: compressed postings blob (delta-encoded varints)
- Updates (0.x choice: in-memory buffers):
- Maintain small in-memory buffers per trigram (max buffer size: 4KB)
- Flush buffers to B+Tree during transaction commit
- If buffer exceeds size, flush immediately and create new buffer
- This provides bounded write amplification and simple implementation
Postings compression: - Use delta encoding between consecutive row IDs - Encode deltas as varints (smaller deltas = fewer bytes) - Typical music library IDs are sequential, yielding high compression ratios
Write amplification: - Each INSERT/UPDATE may update multiple trigram postings - Buffering reduces per-row write overhead - Future optimization: batch multiple transactions before flush
9. Planner rules for target workload¶
For queries like artist→album→track: - If artist.id = ?: - drive from artist PK seek - then album index (artistId, name, id) - then track index (albumId, trackNumber, id) - avoid full sort by streaming in index order
For contains predicates: - Use trigram index to produce candidate sets first, then join outward. - Prefer driving from the most selective candidate set (smallest).
9.1 Index statistics (heuristic-based)¶
0.x uses simple heuristics without full statistics collection: - Assume uniform distribution for equality predicates - For trigram indexes: use actual posting list counts (stored in index metadata) - For B+Tree indexes: estimate selectivity based on index type: - Primary key: assume high selectivity (1 row) - Unique index: assume high selectivity (1 row) - Non-unique index: assume moderate selectivity (0.1% of rows)
Future enhancement: Collect and maintain statistics (histograms, distinct counts) for cost-based optimization.
10. Testing strategy (critical)¶
10.1 Test layers¶
- Pure unit tests (Nim)
- Pager read/write, freelist correctness
- WAL frame encoding, checksum verification
- B+Tree invariants (ordering, search correctness, cursor iteration)
-
Trigram generation and postings encode/decode
-
Property tests (Nim + Python)
- Random sequences of operations preserve invariants
- “index results == scan results”
-
“FK never violated” under random mutations
-
Crash-injection tests (Python harness + Faulty VFS)
- Kill at every critical write step during commit
-
Reopen and verify:
- committed txns visible
- uncommitted txns not visible
- no structural corruption
-
Differential tests (Python)
- For supported SQL subset, run query on:
- DecentDb
- PostgreSQL
- Compare result sets for deterministic queries (no timestamps/random)
10.2 Faulty VFS requirements¶
Test hooks must simulate: - partial writes (write N bytes then fail) - re-ordered writes (optional) - dropped fsync (simulate fsync “success” without durability) - abrupt crash at labeled “failpoints”
10.3 CI requirements¶
- Run fast unit/property suites on every PR (Linux/Windows/macOS)
- Nightly jobs run extended crash + fuzz suites
- Coverage reporting target (project-specific; track trend)
11. Benchmarks and performance budgets¶
Define microbenchmarks: - PK point lookup latency - FK join expansion latency (artist->albums->tracks) - Contains search (trigram) latency with typical patterns - Bulk load throughput
Track: - P50/P95 latency - WAL size growth - checkpoint time - recovery time after crash
Memory Usage Budgets: - Peak memory during queries: must not exceed 2x configured cache size - Query execution memory: abort if exceeds max_query_memory (default: 64MB) - Sort operations: External Merge Sort (ADR-0022). Spills to disk if sort_buffer_size (default: 16MB) is exceeded. Max 64 spill files (approx 1GB capacity with default buffer). - Join operations: use nested-loop with index, materialize only when necessary - Memory tracking: report in benchmark results, fail if >20% over budget
12. Future compatibility: Npgsql / PostgreSQL wire protocol¶
Not planned for 0.x. If pursued: - Implement pgwire subset as a server endpoint - Add minimal catalog responses for clients - Maintain dialect compatibility with libpg_query parser
13. Error handling¶
13.1 Error codes¶
Define error categories: - ERR_IO: File I/O errors (disk full, permissions, corruption) - ERR_CORRUPTION: Database corruption detected (checksum mismatch) - ERR_CONSTRAINT: Constraint violation (FK, unique, NOT NULL) - ERR_TRANSACTION: Transaction errors (deadlock, timeout) - ERR_SQL: SQL syntax or semantic errors - ERR_INTERNAL: Internal engine errors (should not occur)
13.2 Error propagation¶
- Errors propagate through the execution pipeline using Result types
- Each layer (SQL → planner → exec → storage) can add context
- Transaction rollback on errors:
- Statement-level errors: rollback current statement only
- Transaction-level errors: rollback entire transaction
- Corruption errors: mark database as read-only and require recovery
13.3 Error messages¶
- Include error code, human-readable message, and context
- For constraint violations: include table/column and violating value
- For corruption: include page ID and expected vs actual checksum
14. Memory management¶
14.1 Memory pools¶
- Page cache: fixed-size pool of page buffers (configurable, default: 1000 pages)
- Row materialization: reusable buffers per operator to avoid per-row alloc
- Trigram buffers: small per-trigram buffers (max 4KB each)
- Sort buffers: managed pool for external merge sort operations (see ADR-0025)
14.2 Memory limits¶
- Configurable maximum memory usage (default: 256MB)
- Page cache eviction when limit reached (LRU policy)
- Query execution aborts if memory limit exceeded during execution
- Per-query memory limits to prevent single queries from consuming all memory
14.3 Out-of-memory handling¶
- Pre-allocate memory at startup where possible
- If allocation fails:
- Abort current query with
ERR_INTERNAL - Rollback current transaction
- Log error with memory usage statistics
- Do not crash the process; allow graceful degradation
14.4 Memory leak prevention and monitoring¶
- Track memory allocations with tags for different subsystems
- Periodic scanning for unreleased resources
- Integration with leak detection tools during testing
- Connection-scoped cleanup for per-connection resources (see ADR-0025)
15. Schema versioning and evolution¶
15.1 Schema cookie¶
- Stored in main DB header (page 1)
- Incremented on any DDL operation (CREATE TABLE, CREATE INDEX, DROP)
- Used to detect schema changes and invalidate cached metadata
15.2 Backward compatibility¶
- File format version stored in header
- Engine can read older format versions (read-only compatibility)
- Writing to older formats may trigger automatic upgrade (with user confirmation)
15.3 Schema changes (0.x baseline)¶
- Supported: CREATE TABLE, CREATE INDEX, DROP TABLE, DROP INDEX, ALTER TABLE
- ALTER TABLE operations: ADD COLUMN, DROP COLUMN
- Not supported (post-1.0): RENAME COLUMN, MODIFY COLUMN, ADD CONSTRAINT
- Schema changes require exclusive lock (no active readers or writers)
15.4 Migration strategy¶
- When opening a database with older schema version:
- Read and validate existing schema
- Apply any necessary migrations (e.g., new system tables)
- Update schema cookie
- Migrations are idempotent and crash-safe
16. Configuration system¶
16.1 Configuration options¶
Database-level configuration (set at open time): - page_size: 4096, 8192, or 16384 (default: 4096) - cache_size_mb: Maximum page cache size in MB (default: 4MB) - wal_sync_mode: FULL (fsync), NORMAL (fdatasync), TESTING_ONLY_UNSAFE_NOSYNC (requires compile flag) - checkpoint_timeout_sec: Max wait for readers before forced checkpoint (default: 30) - trigram_postings_threshold: Max postings before requiring additional filters (default: 100000) - temp_dir: Directory for temporary sort files (default: system temp)
Safety Note: TESTING_ONLY_UNSAFE_NOSYNC requires the engine to be compiled with -d:allowUnsafeSyncMode flag. This mode provides no durability guarantees and must never be used in production.
16.2 Runtime configuration¶
Some options can be changed at runtime: - trigram_postings_threshold: adjust based on workload - checkpoint_threshold_mb: adjust for write-heavy workloads
16.3 Configuration API¶
# Open database with configuration
db = open("dbfile", config{
page_size: 8192,
cache_size_mb: 16,
wal_sync_mode: FULL
})
# Change runtime configuration
db.set_config("trigram_postings_threshold", 50000)
17. B+Tree space management¶
17.1 Node split¶
- When a B+Tree node exceeds capacity, split into two nodes
- Split point: middle key (balanced split)
- Propagate split up the tree as needed
17.2 Page utilization monitoring¶
- Track average page utilization per B+Tree
- If utilization drops below 50% (configurable), trigger compaction
- Compaction: rebuild B+Tree from scratch, freeing empty pages
17.3 Merge/rebalance (post-1.0)¶
- Not implemented in 0.x to simplify code
- Compaction provides equivalent space recovery
- Future: implement merge for delete-heavy workloads