Skip to main content

Configuration Reference

Complete reference of all IndexTables configuration settings. Settings can be configured as:

  • Write options: .option("spark.indextables.setting", "value") on DataFrameWriter
  • Read options: .option("spark.indextables.setting", "value") on DataFrameReader
  • Spark config: spark.conf.set("spark.indextables.setting", "value")
  • Cluster properties: Set in your cluster configuration
Preferred Configuration Method

For write operations, prefer using .option() on the DataFrameWriter rather than Spark properties. This makes the configuration explicit and avoids affecting other operations.

Index Writer

Settings that control how data is indexed during write operations.

SettingDefaultDescription
spark.indextables.indexWriter.heapSize100MMemory allocated for indexing (supports "100M", "2G" syntax)
spark.indextables.indexWriter.batchSize10000Documents per batch during indexing (1-1000000)
spark.indextables.indexWriter.maxBatchBufferSize90MMaximum buffer size before flush
spark.indextables.indexWriter.threads2Indexing threads per executor (1-16)
spark.indextables.indexWriter.tempDirectoryPathautoWorking directory for index creation
spark.indextables.splitConversion.maxParallelismautoParallelism for split conversion

Write Optimization

SettingDefaultDescription
spark.indextables.optimizeWrite.enabledtrueEnable optimized write with automatic partitioning
spark.indextables.optimizeWrite.targetRecordsPerSplit1000000Target records per split file
spark.indextables.autoSize.enabledfalseEnable auto-sizing based on historical data
spark.indextables.autoSize.targetSplitSize100MTarget split size (supports "100M", "1G" syntax)
spark.indextables.autoSize.inputRowCountestimatedExplicit row count for V2 API

Shuffle-Based Write Optimization

Produces consistently-sized splits by requesting a shuffle via Spark's RequiresDistributionAndOrdering interface and letting AQE coalesce partitions. See Optimized Writes for full details.

SettingDefaultDescription
spark.indextables.write.optimizeWrite.enabledfalseEnable shuffle-based write optimization
spark.indextables.write.optimizeWrite.targetSplitSize1GTarget on-disk split size (supports "512M", "1G", "2G" syntax)
spark.indextables.write.optimizeWrite.distributionModehashDistribution mode: "hash", "balanced", or "none"
spark.indextables.write.optimizeWrite.maxSplitSize4GMaximum split size before rolling (balanced mode only)
spark.indextables.write.optimizeWrite.samplingRatio1.1Split-to-shuffle-data ratio for size estimation
spark.indextables.write.optimizeWrite.minRowsForEstimation10000Minimum rows for history-based estimation

Arrow FFI Write Path

Settings for the zero-copy Arrow FFI columnar write path, which replaces the legacy TANT batch serialization with Arrow C Data Interface FFI for ~31% average write throughput improvement.

SettingDefaultDescription
spark.indextables.write.arrowFfi.enabledtrueEnable Arrow FFI columnar ingestion. Set to false to use legacy TANT batch path.
spark.indextables.write.arrowFfi.batchSize8192Rows per Arrow batch sent to Rust via FFI

Arrow FFI Aggregation Reads

SettingDefaultDescription
spark.indextables.read.aggregation.arrowFfi.enabledtrueEnable Arrow FFI for all aggregation paths (simple, GROUP BY, bucket). Set to false for legacy per-bucket JNI reads.

Transaction Log

Settings for transaction log management, checkpointing, and caching.

Checkpointing

SettingDefaultDescription
spark.indextables.checkpoint.enabledtrueEnable automatic checkpoints
spark.indextables.checkpoint.interval10Create checkpoint every N transactions
spark.indextables.checkpoint.parallelism8Thread pool size for parallel I/O
spark.indextables.checkpoint.read.timeoutSeconds30Timeout for parallel reads
spark.indextables.transaction.compression.enabledtrueEnable GZIP compression for transaction files

Transaction Log Cache

SettingDefaultDescription
spark.indextables.transaction.cache.enabledtrueEnable transaction log caching
spark.indextables.transaction.cache.expirationSeconds300Legacy TTL for all caches (overrides individual TTLs)
spark.indextables.cache.checkpoint.ttl5Checkpoint info cache TTL in minutes
spark.indextables.cache.checkpoint.size200Maximum checkpoint cache entries
spark.indextables.cache.log.ttl5Version log cache TTL in minutes
spark.indextables.cache.log.size1000Maximum version log cache entries
spark.indextables.cache.snapshot.ttl10Snapshot cache TTL in minutes
spark.indextables.cache.snapshot.size100Maximum snapshot cache entries
spark.indextables.cache.filelist.ttl2File list cache TTL in minutes
spark.indextables.cache.filelist.size50Maximum file list cache entries
spark.indextables.cache.metadata.ttl30Metadata cache TTL in minutes
spark.indextables.cache.metadata.size100Maximum metadata cache entries

State Format (Avro)

Settings for the high-performance Avro-based transaction log state format.

SettingDefaultDescription
spark.indextables.state.formatavroState format: "avro" or "json"
spark.indextables.state.compressionzstdCompression: "zstd", "snappy", or "none"
spark.indextables.state.compressionLevel3Zstd compression level (1-22)
spark.indextables.state.entriesPerManifest50000Maximum entries per manifest file

State Compaction

SettingDefaultDescription
spark.indextables.state.compaction.tombstoneThreshold0.10Compact when tombstones exceed 10%
spark.indextables.state.compaction.maxManifests20Compact when manifest count exceeds limit
spark.indextables.state.compaction.afterMergetrueAuto-compact after MERGE SPLITS

State Retention

SettingDefaultDescription
spark.indextables.state.retention.versions2Keep N old state versions
spark.indextables.state.retention.hours168State file retention (7 days)

Concurrent Write Retry

SettingDefaultDescription
spark.indextables.state.retry.maxAttempts10Retry attempts on concurrent write conflict
spark.indextables.state.retry.baseDelayMs100Initial backoff delay in milliseconds
spark.indextables.state.retry.maxDelayMs5000Maximum backoff delay in milliseconds

Legacy Retention

SettingDefaultDescription
spark.indextables.logRetention.duration2592000000Log retention in milliseconds (30 days)
spark.indextables.checkpointRetention.duration7200000Checkpoint retention in milliseconds (2 hours)
spark.indextables.cleanup.enabledtrueEnable automatic file cleanup

Read Settings

Settings that control read operations and query execution.

SettingDefaultDescription
spark.indextables.read.modefastRead mode: fast (default limit 250) or complete (no limit, streams all results). See Read Mode below.
spark.indextables.read.defaultLimit250Default result limit when no LIMIT clause (applies in fast mode only)
spark.indextables.read.columnar.enabledtrueEnable Arrow FFI columnar reads for all split types. Set to false to force legacy row-based reads.
spark.indextables.docBatch.enabledtrueEnable batch document retrieval
spark.indextables.docBatch.maxSize1000Documents per batch (1-10000)

Read Task Optimization

SettingDefaultDescription
spark.indextables.read.splitsPerTaskautoSplits per task: "auto" for dynamic selection, or fixed numeric value
spark.indextables.read.maxSplitsPerTask8Maximum splits per task when using auto mode
spark.indextables.read.aggregate.splitsPerTask(fallback)Override for aggregate queries
spark.indextables.read.aggregate.maxSplitsPerTask(fallback)Max splits for aggregate queries
Auto Mode Behavior

With auto, the system dynamically adjusts split batching based on cluster size:

  • Small tables (splits ≤ 2× default parallelism): Uses 1 split per task for maximum parallelism
  • Large tables: Batches splits to maintain ~4× over-subscription, capped at maxSplitsPerTask

Set to a fixed number (e.g., "1") to disable auto mode.

Read Mode

The read.mode setting controls whether IndexTables applies a default result limit or streams all matching results.

ModeDefault LimitBehavior
fast (default)250 rowsApplies defaultLimit when no explicit LIMIT clause. Best for interactive queries.
completeNo limitStreams all matching results in ~128K-row batches with bounded ~24MB memory. No artificial row cap.

When to use complete mode: Use complete mode when IndexTables is used as a data source for extracts, ETL pipelines, or any workload where the default 250-row limit would cause correctness issues — for example, when using Companion Mode to query entire partitions or large date ranges against Delta/Iceberg tables. In fast mode, results beyond the default limit are silently truncated, which can produce incomplete extracts.

// Set complete mode for ETL workloads
spark.conf.set("spark.indextables.read.mode", "complete")
warning

complete mode removes the safety net of the default 250-row limit. For interactive ad-hoc queries, fast mode prevents accidental full-table scans. Switch to complete only when you need all matching rows.

Prescan Filtering

SettingDefaultDescription
spark.indextables.read.prescan.enabledfalseEnable prescan filtering
spark.indextables.read.prescan.minSplitThreshold2 × defaultParallelismMinimum splits to trigger prescan
spark.indextables.read.prescan.maxConcurrency4 × availableProcessorsMaximum concurrent prescan threads
spark.indextables.read.prescan.timeoutMs30000Timeout per split in milliseconds

Filter Pushdown

Settings that control which filter operations are pushed down to the index.

SettingDefaultDescription
spark.indextables.filter.stringPattern.pushdownfalseMaster switch - enables all string pattern filters
spark.indextables.filter.stringStartsWith.pushdownfalseEnable prefix matching (most efficient)
spark.indextables.filter.stringEndsWith.pushdownfalseEnable suffix matching (less efficient)
spark.indextables.filter.stringContains.pushdownfalseEnable substring matching (least efficient)

IndexQuery Safety

SettingDefaultDescription
spark.indextables.indexquery.indexall.maxUnqualifiedFields10Max fields for unqualified _indexall indexquery. Set to 0 to disable.
String Pattern Performance
  • startsWith: Most efficient - uses sorted index terms
  • endsWith: Less efficient - requires term scanning
  • contains: Least efficient - cannot leverage index structure

Enable only the patterns you need for best performance.

Prewarm Cache

Settings for the PREWARM CACHE SQL command.

SettingDefaultDescription
spark.indextables.prewarm.enabledfalseEnable prewarm on read
spark.indextables.prewarm.segmentsTERM_DICT,FAST_FIELD,POSTINGS,FIELD_NORMSegments to prewarm
spark.indextables.prewarm.fields(all)Specific fields to prewarm
spark.indextables.prewarm.splitsPerTask2Splits per Spark task
spark.indextables.prewarm.partitionFilter(empty)Partition filter clause
spark.indextables.prewarm.failOnMissingFieldtrueFail if specified field doesn't exist
spark.indextables.prewarm.catchUpNewHostsfalsePrewarm on new hosts added to cluster

Async Prewarm

SettingDefaultDescription
spark.indextables.prewarm.async.maxConcurrent1Maximum concurrent async prewarm jobs per worker
spark.indextables.prewarm.async.completedJobRetentionMs3600000Retention period for completed job metadata (1 hour)

Disk Cache (L2)

Settings for the L2 disk cache on NVMe storage.

SettingDefaultDescription
spark.indextables.cache.disk.enabledautoEnable disk cache (auto-enabled when NVMe detected)
spark.indextables.cache.disk.pathautoCache directory path
spark.indextables.cache.disk.maxSize0 (auto)Maximum cache size (0 = auto 2/3 of disk)
spark.indextables.cache.disk.manifestSyncInterval30Seconds between manifest writes
spark.indextables.cache.disk.writeQueue.modesizeWrite queue mode: fragment (bounded slots) or size (byte-based backpressure)
spark.indextables.cache.disk.writeQueue.capacity1GWrite queue capacity: slot count (fragment mode) or byte limit like 1G (size mode)
spark.indextables.cache.disk.dropWritesWhenFulltrueDrop query-path writes instead of blocking when the write queue is full
spark.indextables.cache.disk.writeQueue.maxBudget(unlimited)Maximum native memory budget for the write queue. Limits how much Rust-side memory the write queue can consume.
spark.indextables.cache.coalesceMaxGap512KMaximum gap between parquet byte ranges to coalesce into a single fetch. Lower values reduce over-fetch for narrow projections on wide tables.

In-Memory Cache

SettingDefaultDescription
spark.indextables.cache.maxSize200000000Maximum in-memory cache size in bytes
spark.indextables.cache.directoryPathautoCustom cache directory
spark.indextables.cache.prewarm.enabledfalseEnable proactive cache warming

S3 Configuration

Settings for Amazon S3 storage access.

SettingDefaultDescription
spark.indextables.aws.accessKey-AWS access key ID
spark.indextables.aws.secretKey-AWS secret access key
spark.indextables.aws.sessionToken-AWS session token (for temporary credentials)
spark.indextables.aws.credentialsProviderClass-Custom credential provider class (FQN)
spark.indextables.s3.maxConcurrency4Parallel upload threads (1-32)
spark.indextables.s3.partSize64MMultipart upload part size
spark.indextables.s3.streamingThreshold100MThreshold for streaming upload
spark.indextables.s3.multipartThreshold100MThreshold for multipart upload
AWS Credentials

If no credentials are configured, IndexTables automatically uses the EC2 instance's IAM role. This is the recommended approach for production.

Azure Configuration

Settings for Azure Blob Storage access.

SettingDefaultDescription
spark.indextables.azure.accountName-Storage account name
spark.indextables.azure.accountKey-Storage account key
spark.indextables.azure.connectionString-Full connection string
spark.indextables.azure.tenantId-Azure AD tenant ID for OAuth
spark.indextables.azure.clientId-Service Principal client ID
spark.indextables.azure.clientSecret-Service Principal client secret
spark.indextables.azure.bearerToken-Explicit OAuth bearer token
spark.indextables.azure.endpoint-Custom Azure endpoint

Databricks Unity Catalog

Settings for Unity Catalog credential provider integration.

SettingDefaultDescription
spark.indextables.databricks.workspaceUrl-Databricks workspace URL (required)
spark.indextables.databricks.apiToken-Databricks API token (required)
spark.indextables.databricks.credential.refreshBuffer.minutes40Minutes before expiration to refresh credentials
spark.indextables.databricks.cache.maxSize100Maximum cached credential entries
spark.indextables.databricks.fallback.enabledtrueFallback to READ if READ_WRITE fails
spark.indextables.databricks.retry.attempts3Retry attempts on API failure

See Unity Catalog Configuration for setup instructions.

Field Indexing

Settings that control how fields are indexed.

SettingDefaultDescription
spark.indextables.indexing.typemap.<field>stringField indexing type: string, text, json, ip
spark.indextables.indexing.fastfields(auto)Comma-separated list of fast fields
spark.indextables.indexing.storeonlyfields(empty)Fields stored but not indexed
spark.indextables.indexing.indexonlyfields(empty)Fields indexed but not stored
spark.indextables.indexing.tokenizer.<field>defaultTokenizer: default, whitespace, raw
spark.indextables.indexing.json.modefullJSON indexing mode
spark.indextables.indexing.text.indexRecordOptionpositionIndex record option
spark.indextables.indexing.indexrecordoption.<field>-Per-field index record option

Text Token Length

Control the maximum token length for text fields. Tokens exceeding this limit are filtered out (not truncated).

SettingDefaultDescription
spark.indextables.indexing.text.maxTokenLength255Global default token length limit
spark.indextables.indexing.tokenLength.<length>-List-based config: spark.indextables.indexing.tokenLength.255: "content,body"
spark.indextables.indexing.tokenLength.<field>-Per-field config: spark.indextables.indexing.tokenLength.content: "255"

Named constants:

  • tantivy_max = 65,530 (maximum Tantivy supports, use for URLs/base64)
  • default = 255 (Quickwit-compatible)
  • legacy = 40 (original tantivy4java default)
Breaking Change (v0.4.5)

The default token length changed from 40 to 255 bytes. Tokens between 41-255 bytes that were previously filtered out will now be indexed. To maintain previous behavior, set:

spark.conf.set("spark.indextables.indexing.text.maxTokenLength", "legacy")

Merge Settings

Settings for the MERGE SPLITS operation.

SettingDefaultDescription
spark.indextables.merge.heapSize1GHeap size for merge operations
spark.indextables.merge.batchSize(defaultParallelism)Merge groups per batch
spark.indextables.merge.maxConcurrentBatches2Maximum concurrent merge batches
spark.indextables.merge.maxSourceSplitsPerMerge1000Maximum source splits per merge operation
spark.indextables.merge.tempDirectoryPathautoTemporary directory for merge working files
spark.indextables.merge.debugfalseEnable merge debug logging

Merge-on-Write

SettingDefaultDescription
spark.indextables.mergeOnWrite.enabledfalseEnable automatic merging during writes
spark.indextables.mergeOnWrite.targetSize4GTarget merged split size

Purge Settings

Settings for the PURGE operation and automatic cleanup.

SettingDefaultDescription
spark.indextables.purge.defaultRetentionHours168Default retention period (7 days)
spark.indextables.purge.minRetentionHours24Minimum retention period (1 day)
spark.indextables.purge.retentionCheckEnabledtrueEnable retention validation
spark.indextables.purge.parallelismautoParallelism for purge operations
spark.indextables.purge.maxFilesToDelete1000000Maximum files to delete per operation

Purge-on-Write

SettingDefaultDescription
spark.indextables.purgeOnWrite.enabledfalseEnable automatic purging during writes
spark.indextables.purgeOnWrite.triggerAfterMergetrueTrigger purge after merge operations
spark.indextables.purgeOnWrite.triggerAfterWrites0Trigger purge after N writes (0 = disabled)
spark.indextables.purgeOnWrite.splitRetentionHours168Split file retention (7 days)
spark.indextables.purgeOnWrite.txLogRetentionHours720Transaction log retention (30 days)

Protocol Management

Advanced settings for protocol version management.

SettingDefaultDescription
spark.indextables.protocol.checkEnabledtrueEnable protocol version checking
spark.indextables.protocol.autoUpgradefalseAutomatically upgrade protocol version
spark.indextables.protocol.enforceReaderVersiontrueEnforce minimum reader version
spark.indextables.protocol.enforceWriterVersiontrueEnforce minimum writer version

Companion Mode

Settings for the BUILD INDEXTABLES COMPANION command. See Companion Mode for full details.

SettingDefaultDescription
spark.indextables.companion.writerHeapSize1GTantivy writer heap size per executor task
spark.indextables.companion.readerBatchSize8192Parquet reader batch size
spark.indextables.companion.sync.batchSizedefaultParallelismTasks per Spark job
spark.indextables.companion.sync.maxConcurrentBatches6Maximum concurrent Spark jobs during sync
spark.indextables.companion.schedulerPoolindextables-companionSpark scheduler pool name for batch parallelism
spark.indextables.companion.sync.distributedLogRead.enabledtrueDistribute transaction log reads across executors (avoids driver OOM for tables with millions of files)
spark.indextables.companion.sync.arrowFfi.enabledtrueUse Arrow FFI for distributed log reads (zero-copy columnar export)
spark.indextables.companion.tableRootDesignator-Named table root for cross-region reads. When set, companion reads resolve parquet paths from the named root instead of the default source path. Fails if the designated root is not found.

Streaming Companion Sync

Settings for continuous streaming sync via WITH STREAMING POLL INTERVAL.

SettingDefaultDescription
spark.indextables.companion.sync.maxConsecutiveErrors10Abort streaming after N consecutive errors
spark.indextables.companion.sync.errorBackoffMultiplier2Base for exponential backoff on error (capped at 10x poll interval)
spark.indextables.companion.sync.quietPollLogInterval10Log no-change polls every N cycles (suppress noise)
spark.indextables.companion.sync.maxIncrementalCommits100Fall back to full scan when Delta version gap exceeds this

Iceberg

Settings for Apache Iceberg catalog integration in companion mode.

SettingDefaultDescription
spark.indextables.iceberg.catalogType-Catalog type: rest, glue, or hive
spark.indextables.iceberg.uri-Catalog URI
spark.indextables.iceberg.warehouse-Warehouse location
spark.indextables.iceberg.token-Authentication token
spark.indextables.iceberg.credential-Authentication credential
spark.indextables.iceberg.s3Endpoint-S3-compatible endpoint (e.g., MinIO)
spark.indextables.iceberg.s3PathStyleAccessfalseEnable S3 path-style access

Skipped Files Tracking

Settings for handling problematic files during operations.

SettingDefaultDescription
spark.indextables.skippedFiles.trackingEnabledtrueEnable skipped files tracking
spark.indextables.skippedFiles.cooldownDuration24Hours before retrying skipped files