Purge-On-Write
Automatic cleanup of orphaned splits and old transaction logs during writes.
Configuration
// Enable purge-on-write
spark.conf.set("spark.indextables.purgeOnWrite.enabled", "true")
// Run after merge-on-write
spark.conf.set("spark.indextables.purgeOnWrite.triggerAfterMerge", "true")
// Run after N writes (0 = disabled)
spark.conf.set("spark.indextables.purgeOnWrite.triggerAfterWrites", "10")
// Retention periods
spark.conf.set("spark.indextables.purgeOnWrite.splitRetentionHours", "168") // 7 days
spark.conf.set("spark.indextables.purgeOnWrite.txLogRetentionHours", "720") // 30 days
Trigger Modes
- After merge-on-write: Runs automatically after each merge
- After N writes: Runs after configured number of writes
Usage
// Complete automatic table hygiene
df.write
.format("io.indextables.spark.core.IndexTables4SparkTableProvider")
.option("spark.indextables.mergeOnWrite.enabled", "true")
.option("spark.indextables.purgeOnWrite.enabled", "true")
.option("spark.indextables.purgeOnWrite.triggerAfterMerge", "true")
.option("spark.indextables.purgeOnWrite.triggerAfterWrites", "20")
.save("s3://bucket/path")
Safety Features
- Disabled by default
- Respects minimum retention periods
- Propagates credentials automatically
- Graceful failure handling (doesn't fail writes)
- Per-session counters (reset between sessions)
When to Use
- High-frequency write workloads
- Tables with frequent merge operations
- Long-running Spark applications
- Development/testing with rapid iteration