Complete reference of all IndexTables configuration settings. Settings can be configured as:
Write options : .option("spark.indextables.setting", "value") on DataFrameWriter
Read options : .option("spark.indextables.setting", "value") on DataFrameReader
Spark config : spark.conf.set("spark.indextables.setting", "value")
Cluster properties : Set in your cluster configuration
Preferred Configuration Method
For write operations, prefer using .option() on the DataFrameWriter rather than Spark properties. This makes the configuration explicit and avoids affecting other operations.
Index Writer
Settings that control how data is indexed during write operations.
Setting Default Description spark.indextables.indexWriter.heapSize100M Memory allocated for indexing (supports "100M", "2G" syntax) spark.indextables.indexWriter.batchSize10000 Documents per batch during indexing (1-1000000) spark.indextables.indexWriter.maxBatchBufferSize90M Maximum buffer size before flush spark.indextables.indexWriter.threads2 Indexing threads per executor (1-16) spark.indextables.indexWriter.tempDirectoryPathauto Working directory for index creation spark.indextables.splitConversion.maxParallelismauto Parallelism for split conversion
Write Optimization
Setting Default Description spark.indextables.optimizeWrite.enabledtrue Enable optimized write with automatic partitioning spark.indextables.optimizeWrite.targetRecordsPerSplit1000000 Target records per split file spark.indextables.autoSize.enabledfalse Enable auto-sizing based on historical data spark.indextables.autoSize.targetSplitSize100M Target split size (supports "100M", "1G" syntax) spark.indextables.autoSize.inputRowCountestimated Explicit row count for V2 API
Transaction Log
Settings for transaction log management, checkpointing, and caching.
Checkpointing
Setting Default Description spark.indextables.checkpoint.enabledtrue Enable automatic checkpoints spark.indextables.checkpoint.interval10 Create checkpoint every N transactions spark.indextables.checkpoint.parallelism8 Thread pool size for parallel I/O spark.indextables.checkpoint.read.timeoutSeconds30 Timeout for parallel reads spark.indextables.transaction.compression.enabledtrue Enable GZIP compression for transaction files
Transaction Log Cache
Setting Default Description spark.indextables.transaction.cache.enabledtrue Enable transaction log caching spark.indextables.transaction.cache.expirationSeconds300 Cache TTL (5 minutes)
Retention
Setting Default Description spark.indextables.logRetention.duration2592000000 Log retention in milliseconds (30 days) spark.indextables.checkpointRetention.duration7200000 Checkpoint retention in milliseconds (2 hours) spark.indextables.cleanup.enabledtrue Enable automatic file cleanup
Read Settings
Settings that control read operations and query execution.
Setting Default Description spark.indextables.read.defaultLimit250 Default result limit when no LIMIT clause spark.indextables.docBatch.enabledtrue Enable batch document retrieval spark.indextables.docBatch.maxSize1000 Documents per batch (1-10000)
Prescan Filtering
Setting Default Description spark.indextables.read.prescan.enabledfalse Enable prescan filtering spark.indextables.read.prescan.minSplitThreshold2 × defaultParallelism Minimum splits to trigger prescan spark.indextables.read.prescan.maxConcurrency4 × availableProcessors Maximum concurrent prescan threads spark.indextables.read.prescan.timeoutMs30000 Timeout per split in milliseconds
Filter Pushdown
Settings that control which filter operations are pushed down to the index.
Setting Default Description spark.indextables.filter.stringPattern.pushdownfalse Master switch - enables all string pattern filters spark.indextables.filter.stringStartsWith.pushdownfalse Enable prefix matching (most efficient) spark.indextables.filter.stringEndsWith.pushdownfalse Enable suffix matching (less efficient) spark.indextables.filter.stringContains.pushdownfalse Enable substring matching (least efficient)
String Pattern Performance
startsWith : Most efficient - uses sorted index terms
endsWith : Less efficient - requires term scanning
contains : Least efficient - cannot leverage index structure
Enable only the patterns you need for best performance.
Prewarm Cache
Settings for the PREWARM CACHE SQL command.
Setting Default Description spark.indextables.prewarm.enabledfalse Enable prewarm on read spark.indextables.prewarm.segmentsTERM_DICT,FAST_FIELD,POSTINGS,FIELD_NORM Segments to prewarm spark.indextables.prewarm.fields(all) Specific fields to prewarm spark.indextables.prewarm.splitsPerTask2 Splits per Spark task spark.indextables.prewarm.partitionFilter(empty) Partition filter clause spark.indextables.prewarm.failOnMissingFieldtrue Fail if specified field doesn't exist spark.indextables.prewarm.catchUpNewHostsfalse Prewarm on new hosts added to cluster
Disk Cache (L2)
Settings for the L2 disk cache on NVMe storage.
Setting Default Description spark.indextables.cache.disk.enabledauto Enable disk cache (auto-enabled when NVMe detected) spark.indextables.cache.disk.pathauto Cache directory path spark.indextables.cache.disk.maxSize0 (auto) Maximum cache size (0 = auto 2/3 of disk) spark.indextables.cache.disk.manifestSyncInterval30 Seconds between manifest writes
In-Memory Cache
Setting Default Description spark.indextables.cache.maxSize200000000 Maximum in-memory cache size in bytes spark.indextables.cache.directoryPathauto Custom cache directory spark.indextables.cache.prewarm.enabledfalse Enable proactive cache warming
S3 Configuration
Settings for Amazon S3 storage access.
Setting Default Description spark.indextables.aws.accessKey- AWS access key ID spark.indextables.aws.secretKey- AWS secret access key spark.indextables.aws.sessionToken- AWS session token (for temporary credentials) spark.indextables.aws.credentialsProviderClass- Custom credential provider class (FQN) spark.indextables.s3.maxConcurrency4 Parallel upload threads (1-32) spark.indextables.s3.partSize64M Multipart upload part size spark.indextables.s3.streamingThreshold100M Threshold for streaming upload spark.indextables.s3.multipartThreshold100M Threshold for multipart upload
If no credentials are configured, IndexTables automatically uses the EC2 instance's IAM role. This is the recommended approach for production.
Azure Configuration
Settings for Azure Blob Storage access.
Setting Default Description spark.indextables.azure.accountName- Storage account name spark.indextables.azure.accountKey- Storage account key spark.indextables.azure.connectionString- Full connection string spark.indextables.azure.tenantId- Azure AD tenant ID for OAuth spark.indextables.azure.clientId- Service Principal client ID spark.indextables.azure.clientSecret- Service Principal client secret spark.indextables.azure.bearerToken- Explicit OAuth bearer token spark.indextables.azure.endpoint- Custom Azure endpoint
Databricks Unity Catalog
Settings for Unity Catalog credential provider integration.
Setting Default Description spark.indextables.databricks.workspaceUrl- Databricks workspace URL (required) spark.indextables.databricks.apiToken- Databricks API token (required) spark.indextables.databricks.credential.refreshBuffer.minutes40 Minutes before expiration to refresh credentials spark.indextables.databricks.cache.maxSize100 Maximum cached credential entries spark.indextables.databricks.fallback.enabledtrue Fallback to READ if READ_WRITE fails spark.indextables.databricks.retry.attempts3 Retry attempts on API failure
See Unity Catalog Configuration for setup instructions.
Field Indexing
Settings that control how fields are indexed.
Setting Default Description spark.indextables.indexing.typemap.<field>string Field indexing type: string, text, json spark.indextables.indexing.fastfields(auto) Comma-separated list of fast fields spark.indextables.indexing.storeonlyfields(empty) Fields stored but not indexed spark.indextables.indexing.indexonlyfields(empty) Fields indexed but not stored spark.indextables.indexing.tokenizer.<field>default Tokenizer: default, whitespace, raw spark.indextables.indexing.json.modefull JSON indexing mode spark.indextables.indexing.text.indexRecordOptionposition Index record option spark.indextables.indexing.indexrecordoption.<field>- Per-field index record option
Merge Settings
Settings for the MERGE SPLITS operation.
Setting Default Description spark.indextables.merge.heapSize1G Heap size for merge operations spark.indextables.merge.batchSize(defaultParallelism) Merge groups per batch spark.indextables.merge.maxConcurrentBatches2 Maximum concurrent merge batches spark.indextables.merge.maxSourceSplitsPerMerge1000 Maximum source splits per merge operation spark.indextables.merge.tempDirectoryPathauto Temporary directory for merge working files spark.indextables.merge.debugfalse Enable merge debug logging
Merge-on-Write
Setting Default Description spark.indextables.mergeOnWrite.enabledfalse Enable automatic merging during writes spark.indextables.mergeOnWrite.targetSize4G Target merged split size
Purge Settings
Settings for the PURGE operation and automatic cleanup.
Setting Default Description spark.indextables.purge.defaultRetentionHours168 Default retention period (7 days) spark.indextables.purge.minRetentionHours24 Minimum retention period (1 day) spark.indextables.purge.retentionCheckEnabledtrue Enable retention validation spark.indextables.purge.parallelismauto Parallelism for purge operations spark.indextables.purge.maxFilesToDelete1000000 Maximum files to delete per operation
Purge-on-Write
Setting Default Description spark.indextables.purgeOnWrite.enabledfalse Enable automatic purging during writes spark.indextables.purgeOnWrite.triggerAfterMergetrue Trigger purge after merge operations spark.indextables.purgeOnWrite.triggerAfterWrites0 Trigger purge after N writes (0 = disabled) spark.indextables.purgeOnWrite.splitRetentionHours168 Split file retention (7 days) spark.indextables.purgeOnWrite.txLogRetentionHours720 Transaction log retention (30 days)
Protocol Management
Advanced settings for protocol version management.
Setting Default Description spark.indextables.protocol.checkEnabledtrue Enable protocol version checking spark.indextables.protocol.autoUpgradefalse Automatically upgrade protocol version spark.indextables.protocol.enforceReaderVersiontrue Enforce minimum reader version spark.indextables.protocol.enforceWriterVersiontrue Enforce minimum writer version
Skipped Files Tracking
Settings for handling problematic files during operations.
Setting Default Description spark.indextables.skippedFiles.trackingEnabledtrue Enable skipped files tracking spark.indextables.skippedFiles.cooldownDuration24 Hours before retrying skipped files