Skip to main content

Essential Settings

Key configuration settings for IndexTables.

Index Writer

// Heap size for indexing (supports G, M, K suffixes)
spark.conf.set("spark.indextables.indexWriter.heapSize", "100M")

// Documents per batch
spark.conf.set("spark.indextables.indexWriter.batchSize", "10000")

// Maximum batch buffer size (prevents native memory errors)
spark.conf.set("spark.indextables.indexWriter.maxBatchBufferSize", "90M")

// Indexing threads
spark.conf.set("spark.indextables.indexWriter.threads", "2")

Transaction Log

// Enable checkpoints for faster reads
spark.conf.set("spark.indextables.checkpoint.enabled", "true")
spark.conf.set("spark.indextables.checkpoint.interval", "10")

// Enable compression (default: true)
spark.conf.set("spark.indextables.transaction.compression.enabled", "true")

Read Settings

// Default result limit when no LIMIT specified
spark.conf.set("spark.indextables.read.defaultLimit", "250")

Working Directories

Auto-detected on Databricks/EMR, or configure manually:

spark.conf.set("spark.indextables.indexWriter.tempDirectoryPath", "/local_disk0/temp")
spark.conf.set("spark.indextables.cache.directoryPath", "/local_disk0/cache")
spark.conf.set("spark.indextables.merge.tempDirectoryPath", "/local_disk0/merge-temp")

Field Indexing

df.write
.format("io.indextables.spark.core.IndexTables4SparkTableProvider")
// Field types
.option("spark.indextables.indexing.typemap.title", "string")
.option("spark.indextables.indexing.typemap.content", "text")
// Fast fields for aggregations
.option("spark.indextables.indexing.fastfields", "score,timestamp")
.save("s3://bucket/my_index")

Quick Start Template

// Recommended settings for production
spark.conf.set("spark.indextables.indexWriter.heapSize", "200M")
spark.conf.set("spark.indextables.checkpoint.enabled", "true")
spark.conf.set("spark.indextables.checkpoint.interval", "10")

Next Steps