Complete reference of all IndexTables configuration settings. Settings can be configured as:
Write options : .option("spark.indextables.setting", "value") on DataFrameWriter
Read options : .option("spark.indextables.setting", "value") on DataFrameReader
Spark config : spark.conf.set("spark.indextables.setting", "value")
Cluster properties : Set in your cluster configuration
Preferred Configuration Method
For write operations, prefer using .option() on the DataFrameWriter rather than Spark properties. This makes the configuration explicit and avoids affecting other operations.
Index Writer
Settings that control how data is indexed during write operations.
Setting Default Description spark.indextables.indexWriter.heapSize100M Memory allocated for indexing (supports "100M", "2G" syntax) spark.indextables.indexWriter.batchSize10000 Documents per batch during indexing (1-1000000) spark.indextables.indexWriter.maxBatchBufferSize90M Maximum buffer size before flush spark.indextables.indexWriter.threads2 Indexing threads per executor (1-16) spark.indextables.indexWriter.tempDirectoryPathauto Working directory for index creation spark.indextables.splitConversion.maxParallelismauto Parallelism for split conversion
Write Optimization
Setting Default Description spark.indextables.optimizeWrite.enabledtrue Enable optimized write with automatic partitioning spark.indextables.optimizeWrite.targetRecordsPerSplit1000000 Target records per split file spark.indextables.autoSize.enabledfalse Enable auto-sizing based on historical data spark.indextables.autoSize.targetSplitSize100M Target split size (supports "100M", "1G" syntax) spark.indextables.autoSize.inputRowCountestimated Explicit row count for V2 API
Shuffle-Based Write Optimization
Produces consistently-sized splits by requesting a shuffle via Spark's RequiresDistributionAndOrdering interface and letting AQE coalesce partitions. See Optimized Writes for full details.
Setting Default Description spark.indextables.write.optimizeWrite.enabledfalse Enable shuffle-based write optimization spark.indextables.write.optimizeWrite.targetSplitSize1G Target on-disk split size (supports "512M", "1G", "2G" syntax) spark.indextables.write.optimizeWrite.distributionModehash Distribution mode: "hash", "balanced", or "none" spark.indextables.write.optimizeWrite.maxSplitSize4G Maximum split size before rolling (balanced mode only) spark.indextables.write.optimizeWrite.samplingRatio1.1 Split-to-shuffle-data ratio for size estimation spark.indextables.write.optimizeWrite.minRowsForEstimation10000 Minimum rows for history-based estimation
Arrow FFI Write Path
Settings for the zero-copy Arrow FFI columnar write path, which replaces the legacy TANT batch serialization with Arrow C Data Interface FFI for ~31% average write throughput improvement.
Setting Default Description spark.indextables.write.arrowFfi.enabledtrue Enable Arrow FFI columnar ingestion. Set to false to use legacy TANT batch path. spark.indextables.write.arrowFfi.batchSize8192 Rows per Arrow batch sent to Rust via FFI
Arrow FFI Aggregation Reads
Setting Default Description spark.indextables.read.aggregation.arrowFfi.enabledtrue Enable Arrow FFI for all aggregation paths (simple, GROUP BY, bucket). Set to false for legacy per-bucket JNI reads.
Transaction Log
Settings for transaction log management, checkpointing, and caching.
Checkpointing
Setting Default Description spark.indextables.checkpoint.enabledtrue Enable automatic checkpoints spark.indextables.checkpoint.interval10 Create checkpoint every N transactions spark.indextables.checkpoint.parallelism8 Thread pool size for parallel I/O spark.indextables.checkpoint.read.timeoutSeconds30 Timeout for parallel reads spark.indextables.transaction.compression.enabledtrue Enable GZIP compression for transaction files
Transaction Log Cache
Setting Default Description spark.indextables.transaction.cache.enabledtrue Enable transaction log caching spark.indextables.transaction.cache.expirationSeconds300 Legacy TTL for all caches (overrides individual TTLs) spark.indextables.cache.checkpoint.ttl5 Checkpoint info cache TTL in minutes spark.indextables.cache.checkpoint.size200 Maximum checkpoint cache entries spark.indextables.cache.log.ttl5 Version log cache TTL in minutes spark.indextables.cache.log.size1000 Maximum version log cache entries spark.indextables.cache.snapshot.ttl10 Snapshot cache TTL in minutes spark.indextables.cache.snapshot.size100 Maximum snapshot cache entries spark.indextables.cache.filelist.ttl2 File list cache TTL in minutes spark.indextables.cache.filelist.size50 Maximum file list cache entries spark.indextables.cache.metadata.ttl30 Metadata cache TTL in minutes spark.indextables.cache.metadata.size100 Maximum metadata cache entries
Settings for the high-performance Avro-based transaction log state format.
Setting Default Description spark.indextables.state.formatavro State format: "avro" or "json" spark.indextables.state.compressionzstd Compression: "zstd", "snappy", or "none" spark.indextables.state.compressionLevel3 Zstd compression level (1-22) spark.indextables.state.entriesPerManifest50000 Maximum entries per manifest file
State Compaction
Setting Default Description spark.indextables.state.compaction.tombstoneThreshold0.10 Compact when tombstones exceed 10% spark.indextables.state.compaction.maxManifests20 Compact when manifest count exceeds limit spark.indextables.state.compaction.afterMergetrue Auto-compact after MERGE SPLITS
State Retention
Setting Default Description spark.indextables.state.retention.versions2 Keep N old state versions spark.indextables.state.retention.hours168 State file retention (7 days)
Concurrent Write Retry
Setting Default Description spark.indextables.state.retry.maxAttempts10 Retry attempts on concurrent write conflict spark.indextables.state.retry.baseDelayMs100 Initial backoff delay in milliseconds spark.indextables.state.retry.maxDelayMs5000 Maximum backoff delay in milliseconds
Legacy Retention
Setting Default Description spark.indextables.logRetention.duration2592000000 Log retention in milliseconds (30 days) spark.indextables.checkpointRetention.duration7200000 Checkpoint retention in milliseconds (2 hours) spark.indextables.cleanup.enabledtrue Enable automatic file cleanup
Read Settings
Settings that control read operations and query execution.
Setting Default Description spark.indextables.read.modefast Read mode: fast (default limit 250) or complete (no limit, streams all results). See Read Mode below. spark.indextables.read.defaultLimit250 Default result limit when no LIMIT clause (applies in fast mode only) spark.indextables.read.columnar.enabledtrue Enable Arrow FFI columnar reads for all split types. Set to false to force legacy row-based reads. spark.indextables.docBatch.enabledtrue Enable batch document retrieval spark.indextables.docBatch.maxSize1000 Documents per batch (1-10000)
Read Task Optimization
Setting Default Description spark.indextables.read.splitsPerTaskauto Splits per task: "auto" for dynamic selection, or fixed numeric value spark.indextables.read.maxSplitsPerTask8 Maximum splits per task when using auto mode spark.indextables.read.aggregate.splitsPerTask(fallback) Override for aggregate queries spark.indextables.read.aggregate.maxSplitsPerTask(fallback) Max splits for aggregate queries
With auto, the system dynamically adjusts split batching based on cluster size:
Small tables (splits ≤ 2× default parallelism): Uses 1 split per task for maximum parallelism
Large tables : Batches splits to maintain ~4× over-subscription, capped at maxSplitsPerTask
Set to a fixed number (e.g., "1") to disable auto mode.
Read Mode
The read.mode setting controls whether IndexTables applies a default result limit or streams all matching results.
Mode Default Limit Behavior fast (default) 250 rows Applies defaultLimit when no explicit LIMIT clause. Best for interactive queries. completeNo limit Streams all matching results in ~128K-row batches with bounded ~24MB memory. No artificial row cap.
When to use complete mode: Use complete mode when IndexTables is used as a data source for extracts, ETL pipelines, or any workload where the default 250-row limit would cause correctness issues — for example, when using Companion Mode to query entire partitions or large date ranges against Delta/Iceberg tables. In fast mode, results beyond the default limit are silently truncated, which can produce incomplete extracts.
// Set complete mode for ETL workloads spark.conf.set("spark.indextables.read.mode", "complete")
complete mode removes the safety net of the default 250-row limit. For interactive ad-hoc queries, fast mode prevents accidental full-table scans. Switch to complete only when you need all matching rows.
Prescan Filtering
Setting Default Description spark.indextables.read.prescan.enabledfalse Enable prescan filtering spark.indextables.read.prescan.minSplitThreshold2 × defaultParallelism Minimum splits to trigger prescan spark.indextables.read.prescan.maxConcurrency4 × availableProcessors Maximum concurrent prescan threads spark.indextables.read.prescan.timeoutMs30000 Timeout per split in milliseconds
Filter Pushdown
Settings that control which filter operations are pushed down to the index.
Setting Default Description spark.indextables.filter.stringPattern.pushdownfalse Master switch - enables all string pattern filters spark.indextables.filter.stringStartsWith.pushdownfalse Enable prefix matching (most efficient) spark.indextables.filter.stringEndsWith.pushdownfalse Enable suffix matching (less efficient) spark.indextables.filter.stringContains.pushdownfalse Enable substring matching (least efficient)
IndexQuery Safety
Setting Default Description spark.indextables.indexquery.indexall.maxUnqualifiedFields10 Max fields for unqualified _indexall indexquery. Set to 0 to disable.
String Pattern Performance
startsWith : Most efficient - uses sorted index terms
endsWith : Less efficient - requires term scanning
contains : Least efficient - cannot leverage index structure
Enable only the patterns you need for best performance.
Prewarm Cache
Settings for the PREWARM CACHE SQL command.
Setting Default Description spark.indextables.prewarm.enabledfalse Enable prewarm on read spark.indextables.prewarm.segmentsTERM_DICT,FAST_FIELD,POSTINGS,FIELD_NORM Segments to prewarm spark.indextables.prewarm.fields(all) Specific fields to prewarm spark.indextables.prewarm.splitsPerTask2 Splits per Spark task spark.indextables.prewarm.partitionFilter(empty) Partition filter clause spark.indextables.prewarm.failOnMissingFieldtrue Fail if specified field doesn't exist spark.indextables.prewarm.catchUpNewHostsfalse Prewarm on new hosts added to cluster
Async Prewarm
Setting Default Description spark.indextables.prewarm.async.maxConcurrent1 Maximum concurrent async prewarm jobs per worker spark.indextables.prewarm.async.completedJobRetentionMs3600000 Retention period for completed job metadata (1 hour)
Disk Cache (L2)
Settings for the L2 disk cache on NVMe storage.
Setting Default Description spark.indextables.cache.disk.enabledauto Enable disk cache (auto-enabled when NVMe detected) spark.indextables.cache.disk.pathauto Cache directory path spark.indextables.cache.disk.maxSize0 (auto) Maximum cache size (0 = auto 2/3 of disk) spark.indextables.cache.disk.manifestSyncInterval30 Seconds between manifest writes spark.indextables.cache.disk.writeQueue.modesize Write queue mode: fragment (bounded slots) or size (byte-based backpressure) spark.indextables.cache.disk.writeQueue.capacity1G Write queue capacity: slot count (fragment mode) or byte limit like 1G (size mode) spark.indextables.cache.disk.dropWritesWhenFulltrue Drop query-path writes instead of blocking when the write queue is full spark.indextables.cache.disk.writeQueue.maxBudget(unlimited) Maximum native memory budget for the write queue. Limits how much Rust-side memory the write queue can consume. spark.indextables.cache.coalesceMaxGap512K Maximum gap between parquet byte ranges to coalesce into a single fetch. Lower values reduce over-fetch for narrow projections on wide tables.
In-Memory Cache
Setting Default Description spark.indextables.cache.maxSize200000000 Maximum in-memory cache size in bytes spark.indextables.cache.directoryPathauto Custom cache directory spark.indextables.cache.prewarm.enabledfalse Enable proactive cache warming
S3 Configuration
Settings for Amazon S3 storage access.
Setting Default Description spark.indextables.aws.accessKey- AWS access key ID spark.indextables.aws.secretKey- AWS secret access key spark.indextables.aws.sessionToken- AWS session token (for temporary credentials) spark.indextables.aws.credentialsProviderClass- Custom credential provider class (FQN) spark.indextables.s3.maxConcurrency4 Parallel upload threads (1-32) spark.indextables.s3.partSize64M Multipart upload part size spark.indextables.s3.streamingThreshold100M Threshold for streaming upload spark.indextables.s3.multipartThreshold100M Threshold for multipart upload
If no credentials are configured, IndexTables automatically uses the EC2 instance's IAM role. This is the recommended approach for production.
Azure Configuration
Settings for Azure Blob Storage access.
Setting Default Description spark.indextables.azure.accountName- Storage account name spark.indextables.azure.accountKey- Storage account key spark.indextables.azure.connectionString- Full connection string spark.indextables.azure.tenantId- Azure AD tenant ID for OAuth spark.indextables.azure.clientId- Service Principal client ID spark.indextables.azure.clientSecret- Service Principal client secret spark.indextables.azure.bearerToken- Explicit OAuth bearer token spark.indextables.azure.endpoint- Custom Azure endpoint
Databricks Unity Catalog
Settings for Unity Catalog credential provider integration.
Setting Default Description spark.indextables.databricks.workspaceUrl- Databricks workspace URL (required) spark.indextables.databricks.apiToken- Databricks API token (required) spark.indextables.databricks.credential.refreshBuffer.minutes40 Minutes before expiration to refresh credentials spark.indextables.databricks.cache.maxSize100 Maximum cached credential entries spark.indextables.databricks.fallback.enabledtrue Fallback to READ if READ_WRITE fails spark.indextables.databricks.retry.attempts3 Retry attempts on API failure
See Unity Catalog Configuration for setup instructions.
Field Indexing
Settings that control how fields are indexed.
Setting Default Description spark.indextables.indexing.typemap.<field>string Field indexing type: string, text, json, ip spark.indextables.indexing.fastfields(auto) Comma-separated list of fast fields spark.indextables.indexing.storeonlyfields(empty) Fields stored but not indexed spark.indextables.indexing.indexonlyfields(empty) Fields indexed but not stored spark.indextables.indexing.tokenizer.<field>default Tokenizer: default, whitespace, raw spark.indextables.indexing.json.modefull JSON indexing mode spark.indextables.indexing.text.indexRecordOptionposition Index record option spark.indextables.indexing.indexrecordoption.<field>- Per-field index record option
Text Token Length
Control the maximum token length for text fields. Tokens exceeding this limit are filtered out (not truncated).
Setting Default Description spark.indextables.indexing.text.maxTokenLength255 Global default token length limit spark.indextables.indexing.tokenLength.<length>- List-based config: spark.indextables.indexing.tokenLength.255: "content,body" spark.indextables.indexing.tokenLength.<field>- Per-field config: spark.indextables.indexing.tokenLength.content: "255"
Named constants:
tantivy_max = 65,530 (maximum Tantivy supports, use for URLs/base64)
default = 255 (Quickwit-compatible)
legacy = 40 (original tantivy4java default)
The default token length changed from 40 to 255 bytes. Tokens between 41-255 bytes that were previously filtered out will now be indexed. To maintain previous behavior, set:
spark.conf.set("spark.indextables.indexing.text.maxTokenLength", "legacy")
Merge Settings
Settings for the MERGE SPLITS operation.
Setting Default Description spark.indextables.merge.heapSize1G Heap size for merge operations spark.indextables.merge.batchSize(defaultParallelism) Merge groups per batch spark.indextables.merge.maxConcurrentBatches2 Maximum concurrent merge batches spark.indextables.merge.maxSourceSplitsPerMerge1000 Maximum source splits per merge operation spark.indextables.merge.tempDirectoryPathauto Temporary directory for merge working files spark.indextables.merge.debugfalse Enable merge debug logging
Merge-on-Write
Setting Default Description spark.indextables.mergeOnWrite.enabledfalse Enable automatic merging during writes spark.indextables.mergeOnWrite.targetSize4G Target merged split size
Purge Settings
Settings for the PURGE operation and automatic cleanup.
Setting Default Description spark.indextables.purge.defaultRetentionHours168 Default retention period (7 days) spark.indextables.purge.minRetentionHours24 Minimum retention period (1 day) spark.indextables.purge.retentionCheckEnabledtrue Enable retention validation spark.indextables.purge.parallelismauto Parallelism for purge operations spark.indextables.purge.maxFilesToDelete1000000 Maximum files to delete per operation
Purge-on-Write
Setting Default Description spark.indextables.purgeOnWrite.enabledfalse Enable automatic purging during writes spark.indextables.purgeOnWrite.triggerAfterMergetrue Trigger purge after merge operations spark.indextables.purgeOnWrite.triggerAfterWrites0 Trigger purge after N writes (0 = disabled) spark.indextables.purgeOnWrite.splitRetentionHours168 Split file retention (7 days) spark.indextables.purgeOnWrite.txLogRetentionHours720 Transaction log retention (30 days)
Protocol Management
Advanced settings for protocol version management.
Setting Default Description spark.indextables.protocol.checkEnabledtrue Enable protocol version checking spark.indextables.protocol.autoUpgradefalse Automatically upgrade protocol version spark.indextables.protocol.enforceReaderVersiontrue Enforce minimum reader version spark.indextables.protocol.enforceWriterVersiontrue Enforce minimum writer version
Companion Mode
Settings for the BUILD INDEXTABLES COMPANION command. See Companion Mode for full details.
Setting Default Description spark.indextables.companion.writerHeapSize1G Tantivy writer heap size per executor task spark.indextables.companion.readerBatchSize8192 Parquet reader batch size spark.indextables.companion.sync.batchSizedefaultParallelism Tasks per Spark job spark.indextables.companion.sync.maxConcurrentBatches6 Maximum concurrent Spark jobs during sync spark.indextables.companion.schedulerPoolindextables-companion Spark scheduler pool name for batch parallelism spark.indextables.companion.sync.distributedLogRead.enabledtrue Distribute transaction log reads across executors (avoids driver OOM for tables with millions of files) spark.indextables.companion.sync.arrowFfi.enabledtrue Use Arrow FFI for distributed log reads (zero-copy columnar export) spark.indextables.companion.tableRootDesignator- Named table root for cross-region reads. When set, companion reads resolve parquet paths from the named root instead of the default source path. Fails if the designated root is not found.
Streaming Companion Sync
Settings for continuous streaming sync via WITH STREAMING POLL INTERVAL.
Setting Default Description spark.indextables.companion.sync.maxConsecutiveErrors10 Abort streaming after N consecutive errors spark.indextables.companion.sync.errorBackoffMultiplier2 Base for exponential backoff on error (capped at 10x poll interval) spark.indextables.companion.sync.quietPollLogInterval10 Log no-change polls every N cycles (suppress noise) spark.indextables.companion.sync.maxIncrementalCommits100 Fall back to full scan when Delta version gap exceeds this
Iceberg
Settings for Apache Iceberg catalog integration in companion mode.
Setting Default Description spark.indextables.iceberg.catalogType- Catalog type: rest, glue, or hive spark.indextables.iceberg.uri- Catalog URI spark.indextables.iceberg.warehouse- Warehouse location spark.indextables.iceberg.token- Authentication token spark.indextables.iceberg.credential- Authentication credential spark.indextables.iceberg.s3Endpoint- S3-compatible endpoint (e.g., MinIO) spark.indextables.iceberg.s3PathStyleAccessfalse Enable S3 path-style access
Skipped Files Tracking
Settings for handling problematic files during operations.
Setting Default Description spark.indextables.skippedFiles.trackingEnabledtrue Enable skipped files tracking spark.indextables.skippedFiles.cooldownDuration24 Hours before retrying skipped files