Storage Engine¶
The storage engine manages how data is organized on disk.
Page-Based Storage¶
DecentDb uses a page-based storage model where all data is stored in fixed-size pages.
Page Size¶
Default: 4096 bytes (4KB)
Valid sizes: 2048, 4096, 8192, 16384 bytes
Set at database creation and cannot be changed.
Why 4KB? - Matches most OS page sizes - Matches SSD block sizes - Good balance of I/O efficiency and memory usage - Low internal fragmentation for typical rows
Page Types¶
- Database Header Page (Page 1)
- Magic number, format version
- Page size, schema cookie
- Root page pointers
- Freelist head
-
Checksums
-
B+Tree Internal Pages
- Store routing keys
- Point to child pages
-
Compact varint-based cell format
-
B+Tree Leaf Pages
- Store actual data
- Variable-length values
-
Overflow pointers for large data
-
Overflow Pages
- Store values too large to fit inline in a B+Tree leaf
- Chained for very large values
-
Separate allocation from B+Tree
-
Freelist Pages
- Track free pages
- Reused for new allocations
- Reduces file growth
Record Format¶
Row Storage¶
Rows are stored as key-value pairs in B+Tree leaves:
Encoding¶
Row values are encoded as a small self-describing record:
This is defined in src/record/record.nim and ADR 0030.
Value kinds (payload encoding): - NULL: empty payload - BOOL: 1 byte (0/1) - INT64: ZigZag + varint-encoded uint64 (compact for small signed ints) - FLOAT64: 8 bytes (IEEE 754, little-endian) - TEXT/BLOB: raw bytes - TEXT/BLOB overflow pointers: 8-byte payload containing [overflow_page_id u32][overflow_len u32]
Some builds may also store opportunistically-compressed TEXT/BLOB payloads and transparently decompress on decode.
Overflow Handling¶
DecentDB uses overflow pages in two places:
1) Record value overflow (large TEXT/BLOB values stored out-of-line) 2) B+Tree value overflow (very large encoded records stored out-of-line)
Record value overflow¶
Large TEXT/BLOB payloads may be stored as an overflow chain. In that case, the record stores a small overflow pointer payload:
Overflow page layout¶
Overflow pages store chunks of bytes and link to the next page:
For record value overflow, overflow_len is the total length of the logical value bytes.
B+Tree value overflow¶
If the entire encoded record does not fit inline in a leaf cell, the leaf cell can store an overflow page id and the record bytes are stored out-of-line as an overflow chain.
B+Tree Structure¶
Node Layout¶
Leaf Page:
Each cell (FormatVersion 4):
control packs both “inline length” and “overflow pointer”: - is_overflow = control & 1 - value = control >> 1 - If is_overflow == 0: value is the inline payload length in bytes - If is_overflow == 1: value is the overflow page id (no inline payload)
Internal Page:
Each cell (FormatVersion 4):
Tree Operations¶
Search: 1. Start at root page 2. If internal page, find child pointer for key range 3. Repeat until leaf page 4. Scan leaf for exact key
Insert: 1. Find target leaf page 2. If space available, insert sorted 3. If full, split leaf into two pages 4. Update parent with new separator key 5. If root splits, create new root level
Delete: 1. Find target leaf page 2. Remove cell 3. Re-encode remaining cells 4. Note: Merge not implemented yet
Split: - Split point: Middle of sorted keys - Creates two roughly equal pages - Updates parent with separator key - May cascade up to root
Fanout¶
With 4KB pages: - Internal nodes: ~340 keys (12 bytes each) - Leaf nodes: ~50-100 entries (depends on value size) - Tree height: Usually 2-3 levels - Can store millions of rows with 2-3 disk reads
Page Cache¶
The pager maintains an in-memory cache of recently used pages.
Cache Management¶
Pin/Unpin: - Pages in active use are "pinned" - Pinned pages are not evicted - Operations must unpin when done
Dirty Tracking: - Modified pages marked "dirty" - Written to WAL immediately - Written to main file at checkpoint
Eviction: - Unpinned, clean pages can be evicted - Simple policy: First unpinned found - Future: LRU or clock algorithm
Cache Configuration¶
Sizing Guidelines: - Small DB (< 100MB): 1K-4K pages - Medium DB (100MB - 1GB): 4K-16K pages
- Large DB (> 1GB): 16K+ pages - Aim for 20-30% of working set in cache
Free Space Management¶
Freelist¶
Tracks pages that are no longer in use:
Allocation: 1. Check freelist first 2. If empty, extend file 3. Return page ID
Deallocation: 1. Add page to freelist 2. Clear page content 3. Available for reuse
B+Tree Compaction¶
Pages with low utilization can be rebuilt:
This: 1. Scans all entries 2. Builds new compact tree 3. Frees old pages to freelist
Storage Durability¶
Page Write Order¶
- Write dirty pages to WAL
- fsync WAL
- Mark transaction committed
- (Later) Copy to main file at checkpoint
Checksums¶
Each page has a CRC-32C checksum: - Verified on every read - Detects corruption immediately - Fail fast for data integrity
Recovery¶
On open: 1. Read database header 2. Verify header checksum 3. Scan WAL from last checkpoint 4. Apply committed frames 5. Discard uncommitted frames 6. Database ready
Storage Statistics¶
Monitor storage health:
# Database stats
decentdb exec --db=my.ddb --dbInfo --verbose
# Shows:
# - Page size
# - Total pages
# - Cache usage
# - WAL size
# - Free pages
Best Practices¶
- Choose appropriate page size
- 4KB for most workloads
- 8KB if many large rows
-
2KB for memory-constrained
-
Size cache appropriately
- Monitor hit rate
- Increase if many disk reads
-
Balance with other app memory
-
Checkpoint regularly
- Prevents large WAL
- Faster recovery
-
Reclaims WAL space
-
Monitor page utilization
- Rebuild indexes if < 50%
-
Check after bulk deletes
-
Use overflow for large data
- Don't store huge blobs in main table
- Consider file storage with paths
Further Reading¶
- B+Tree Details - Node structure, splits
- WAL & Recovery - Durability, checkpoints
- Configuration - Cache settings