Transaction Log
IndexTables uses a Delta Lake-style transaction log for atomic operations and time travel.
Overview
The transaction log is stored in the _transaction_log/ directory and contains JSON files that record all changes to the index.
s3://bucket/my_index/
_transaction_log/
00000000000000000001.json
00000000000000000002.json
00000000000000000003.checkpoint.json
Transaction Types
AddAction
Records a new split being added:
{
"add": {
"path": "partition=2024-01-01/abc123.split",
"size": 104857600,
"stats": { "numRecords": 10000 }
}
}
RemoveAction
Records a split being logically deleted:
{
"remove": {
"path": "partition=2024-01-01/abc123.split",
"deletionTimestamp": 1704067200000
}
}
Checkpoints
Checkpoints consolidate transaction log state for faster reads:
// Configure checkpoint interval
spark.conf.set("spark.indextables.checkpoint.enabled", "true")
spark.conf.set("spark.indextables.checkpoint.interval", "10")
Compression
Transaction logs are GZIP compressed by default (60-70% size reduction):
spark.conf.set("spark.indextables.transaction.compression.enabled", "true")
SQL Commands
CHECKPOINT INDEXTABLES
Force a checkpoint at the current version. This consolidates transaction log state and upgrades the table to the latest protocol version.
CHECKPOINT INDEXTABLES 's3://bucket/my_index';
Use this to:
- Optimize read performance by creating a checkpoint
- Force protocol upgrade on existing tables
- Create a checkpoint at a specific point in time
TRUNCATE INDEXTABLES TIME TRAVEL
Remove all historical transaction log versions, keeping only the current state. After truncation, time travel to earlier versions is no longer possible.
-- Preview what would be deleted
TRUNCATE INDEXTABLES TIME TRAVEL 's3://bucket/my_index' DRY RUN;
-- Actually truncate
TRUNCATE INDEXTABLES TIME TRAVEL 's3://bucket/my_index';
This command:
- Creates a checkpoint at the current version (if none exists)
- Deletes all transaction log version files older than the checkpoint
- Deletes all older checkpoint files
- Preserves all data files (splits) — only metadata is affected
Use this to:
- Reduce transaction log storage overhead
- Clean up after many small write operations
- Prepare a table for archival (remove history)
Benefits
- Atomicity: Writes are all-or-nothing
- Consistency: Readers see consistent snapshots
- Durability: Committed writes survive failures
- Audit trail: Full history of changes