Skip to main content

Index Writer Settings

Fine-tune index writer performance for your workload.

Memory Settings

// Total heap allocated for indexing
spark.conf.set("spark.indextables.indexWriter.heapSize", "100M")

// Maximum buffer size before flush (prevents 100MB native limit)
spark.conf.set("spark.indextables.indexWriter.maxBatchBufferSize", "90M")
Memory Sizing
  • Start with 100M heap for most workloads
  • Increase to 500M-1G for large documents
  • Keep maxBatchBufferSize below 100M to avoid native errors

Batching

// Documents per batch (affects memory usage and latency)
spark.conf.set("spark.indextables.indexWriter.batchSize", "10000")
Batch SizeMemoryLatencyUse Case
1,000LowLowSmall documents
10,000MediumMediumDefault
50,000HighHighLarge bulk loads

Threading

// Parallel indexing threads
spark.conf.set("spark.indextables.indexWriter.threads", "2")

Recommend 1-2 threads per executor core.

Split Conversion

// Parallelism for tantivy -> split conversion
spark.conf.set("spark.indextables.splitConversion.maxParallelism", "4")
// Default: max(1, availableProcessors / 4)

Statistics

// Truncate long text in stats (prevents transaction log bloat)
spark.conf.set("spark.indextables.stats.truncation.enabled", "true")
spark.conf.set("spark.indextables.stats.truncation.maxLength", "32")

// Data skipping statistics
spark.conf.set("spark.indextables.dataSkippingNumIndexedCols", "32")

All Settings Reference

SettingDefaultDescription
indexWriter.heapSize100MMemory for indexing
indexWriter.batchSize10000Documents per batch
indexWriter.maxBatchBufferSize90MMax buffer before flush
indexWriter.threads2Indexing threads
splitConversion.maxParallelismautoSplit conversion parallelism
stats.truncation.enabledtrueTruncate long values
stats.truncation.maxLength32Max chars for stats