Skip to main content

Configuration Reference

Complete reference of all IndexTables configuration settings. Settings can be configured as:

  • Write options: .option("spark.indextables.setting", "value") on DataFrameWriter
  • Read options: .option("spark.indextables.setting", "value") on DataFrameReader
  • Spark config: spark.conf.set("spark.indextables.setting", "value")
  • Cluster properties: Set in your cluster configuration
Preferred Configuration Method

For write operations, prefer using .option() on the DataFrameWriter rather than Spark properties. This makes the configuration explicit and avoids affecting other operations.

Index Writer

Settings that control how data is indexed during write operations.

SettingDefaultDescription
spark.indextables.indexWriter.heapSize100MMemory allocated for indexing (supports "100M", "2G" syntax)
spark.indextables.indexWriter.batchSize10000Documents per batch during indexing (1-1000000)
spark.indextables.indexWriter.maxBatchBufferSize90MMaximum buffer size before flush
spark.indextables.indexWriter.threads2Indexing threads per executor (1-16)
spark.indextables.indexWriter.tempDirectoryPathautoWorking directory for index creation
spark.indextables.splitConversion.maxParallelismautoParallelism for split conversion

Write Optimization

SettingDefaultDescription
spark.indextables.optimizeWrite.enabledtrueEnable optimized write with automatic partitioning
spark.indextables.optimizeWrite.targetRecordsPerSplit1000000Target records per split file
spark.indextables.autoSize.enabledfalseEnable auto-sizing based on historical data
spark.indextables.autoSize.targetSplitSize100MTarget split size (supports "100M", "1G" syntax)
spark.indextables.autoSize.inputRowCountestimatedExplicit row count for V2 API

Transaction Log

Settings for transaction log management, checkpointing, and caching.

Checkpointing

SettingDefaultDescription
spark.indextables.checkpoint.enabledtrueEnable automatic checkpoints
spark.indextables.checkpoint.interval10Create checkpoint every N transactions
spark.indextables.checkpoint.parallelism8Thread pool size for parallel I/O
spark.indextables.checkpoint.read.timeoutSeconds30Timeout for parallel reads
spark.indextables.transaction.compression.enabledtrueEnable GZIP compression for transaction files

Transaction Log Cache

SettingDefaultDescription
spark.indextables.transaction.cache.enabledtrueEnable transaction log caching
spark.indextables.transaction.cache.expirationSeconds300Cache TTL (5 minutes)

Retention

SettingDefaultDescription
spark.indextables.logRetention.duration2592000000Log retention in milliseconds (30 days)
spark.indextables.checkpointRetention.duration7200000Checkpoint retention in milliseconds (2 hours)
spark.indextables.cleanup.enabledtrueEnable automatic file cleanup

Read Settings

Settings that control read operations and query execution.

SettingDefaultDescription
spark.indextables.read.defaultLimit250Default result limit when no LIMIT clause
spark.indextables.docBatch.enabledtrueEnable batch document retrieval
spark.indextables.docBatch.maxSize1000Documents per batch (1-10000)

Prescan Filtering

SettingDefaultDescription
spark.indextables.read.prescan.enabledfalseEnable prescan filtering
spark.indextables.read.prescan.minSplitThreshold2 × defaultParallelismMinimum splits to trigger prescan
spark.indextables.read.prescan.maxConcurrency4 × availableProcessorsMaximum concurrent prescan threads
spark.indextables.read.prescan.timeoutMs30000Timeout per split in milliseconds

Filter Pushdown

Settings that control which filter operations are pushed down to the index.

SettingDefaultDescription
spark.indextables.filter.stringPattern.pushdownfalseMaster switch - enables all string pattern filters
spark.indextables.filter.stringStartsWith.pushdownfalseEnable prefix matching (most efficient)
spark.indextables.filter.stringEndsWith.pushdownfalseEnable suffix matching (less efficient)
spark.indextables.filter.stringContains.pushdownfalseEnable substring matching (least efficient)
String Pattern Performance
  • startsWith: Most efficient - uses sorted index terms
  • endsWith: Less efficient - requires term scanning
  • contains: Least efficient - cannot leverage index structure

Enable only the patterns you need for best performance.

Prewarm Cache

Settings for the PREWARM CACHE SQL command.

SettingDefaultDescription
spark.indextables.prewarm.enabledfalseEnable prewarm on read
spark.indextables.prewarm.segmentsTERM_DICT,FAST_FIELD,POSTINGS,FIELD_NORMSegments to prewarm
spark.indextables.prewarm.fields(all)Specific fields to prewarm
spark.indextables.prewarm.splitsPerTask2Splits per Spark task
spark.indextables.prewarm.partitionFilter(empty)Partition filter clause
spark.indextables.prewarm.failOnMissingFieldtrueFail if specified field doesn't exist
spark.indextables.prewarm.catchUpNewHostsfalsePrewarm on new hosts added to cluster

Disk Cache (L2)

Settings for the L2 disk cache on NVMe storage.

SettingDefaultDescription
spark.indextables.cache.disk.enabledautoEnable disk cache (auto-enabled when NVMe detected)
spark.indextables.cache.disk.pathautoCache directory path
spark.indextables.cache.disk.maxSize0 (auto)Maximum cache size (0 = auto 2/3 of disk)
spark.indextables.cache.disk.manifestSyncInterval30Seconds between manifest writes

In-Memory Cache

SettingDefaultDescription
spark.indextables.cache.maxSize200000000Maximum in-memory cache size in bytes
spark.indextables.cache.directoryPathautoCustom cache directory
spark.indextables.cache.prewarm.enabledfalseEnable proactive cache warming

S3 Configuration

Settings for Amazon S3 storage access.

SettingDefaultDescription
spark.indextables.aws.accessKey-AWS access key ID
spark.indextables.aws.secretKey-AWS secret access key
spark.indextables.aws.sessionToken-AWS session token (for temporary credentials)
spark.indextables.aws.credentialsProviderClass-Custom credential provider class (FQN)
spark.indextables.s3.maxConcurrency4Parallel upload threads (1-32)
spark.indextables.s3.partSize64MMultipart upload part size
spark.indextables.s3.streamingThreshold100MThreshold for streaming upload
spark.indextables.s3.multipartThreshold100MThreshold for multipart upload
AWS Credentials

If no credentials are configured, IndexTables automatically uses the EC2 instance's IAM role. This is the recommended approach for production.

Azure Configuration

Settings for Azure Blob Storage access.

SettingDefaultDescription
spark.indextables.azure.accountName-Storage account name
spark.indextables.azure.accountKey-Storage account key
spark.indextables.azure.connectionString-Full connection string
spark.indextables.azure.tenantId-Azure AD tenant ID for OAuth
spark.indextables.azure.clientId-Service Principal client ID
spark.indextables.azure.clientSecret-Service Principal client secret
spark.indextables.azure.bearerToken-Explicit OAuth bearer token
spark.indextables.azure.endpoint-Custom Azure endpoint

Databricks Unity Catalog

Settings for Unity Catalog credential provider integration.

SettingDefaultDescription
spark.indextables.databricks.workspaceUrl-Databricks workspace URL (required)
spark.indextables.databricks.apiToken-Databricks API token (required)
spark.indextables.databricks.credential.refreshBuffer.minutes40Minutes before expiration to refresh credentials
spark.indextables.databricks.cache.maxSize100Maximum cached credential entries
spark.indextables.databricks.fallback.enabledtrueFallback to READ if READ_WRITE fails
spark.indextables.databricks.retry.attempts3Retry attempts on API failure

See Unity Catalog Configuration for setup instructions.

Field Indexing

Settings that control how fields are indexed.

SettingDefaultDescription
spark.indextables.indexing.typemap.<field>stringField indexing type: string, text, json
spark.indextables.indexing.fastfields(auto)Comma-separated list of fast fields
spark.indextables.indexing.storeonlyfields(empty)Fields stored but not indexed
spark.indextables.indexing.indexonlyfields(empty)Fields indexed but not stored
spark.indextables.indexing.tokenizer.<field>defaultTokenizer: default, whitespace, raw
spark.indextables.indexing.json.modefullJSON indexing mode
spark.indextables.indexing.text.indexRecordOptionpositionIndex record option
spark.indextables.indexing.indexrecordoption.<field>-Per-field index record option

Merge Settings

Settings for the MERGE SPLITS operation.

SettingDefaultDescription
spark.indextables.merge.heapSize1GHeap size for merge operations
spark.indextables.merge.batchSize(defaultParallelism)Merge groups per batch
spark.indextables.merge.maxConcurrentBatches2Maximum concurrent merge batches
spark.indextables.merge.maxSourceSplitsPerMerge1000Maximum source splits per merge operation
spark.indextables.merge.tempDirectoryPathautoTemporary directory for merge working files
spark.indextables.merge.debugfalseEnable merge debug logging

Merge-on-Write

SettingDefaultDescription
spark.indextables.mergeOnWrite.enabledfalseEnable automatic merging during writes
spark.indextables.mergeOnWrite.targetSize4GTarget merged split size

Purge Settings

Settings for the PURGE operation and automatic cleanup.

SettingDefaultDescription
spark.indextables.purge.defaultRetentionHours168Default retention period (7 days)
spark.indextables.purge.minRetentionHours24Minimum retention period (1 day)
spark.indextables.purge.retentionCheckEnabledtrueEnable retention validation
spark.indextables.purge.parallelismautoParallelism for purge operations
spark.indextables.purge.maxFilesToDelete1000000Maximum files to delete per operation

Purge-on-Write

SettingDefaultDescription
spark.indextables.purgeOnWrite.enabledfalseEnable automatic purging during writes
spark.indextables.purgeOnWrite.triggerAfterMergetrueTrigger purge after merge operations
spark.indextables.purgeOnWrite.triggerAfterWrites0Trigger purge after N writes (0 = disabled)
spark.indextables.purgeOnWrite.splitRetentionHours168Split file retention (7 days)
spark.indextables.purgeOnWrite.txLogRetentionHours720Transaction log retention (30 days)

Protocol Management

Advanced settings for protocol version management.

SettingDefaultDescription
spark.indextables.protocol.checkEnabledtrueEnable protocol version checking
spark.indextables.protocol.autoUpgradefalseAutomatically upgrade protocol version
spark.indextables.protocol.enforceReaderVersiontrueEnforce minimum reader version
spark.indextables.protocol.enforceWriterVersiontrueEnforce minimum writer version

Skipped Files Tracking

Settings for handling problematic files during operations.

SettingDefaultDescription
spark.indextables.skippedFiles.trackingEnabledtrueEnable skipped files tracking
spark.indextables.skippedFiles.cooldownDuration24Hours before retrying skipped files