Your First Production Index

This guide walks through creating a production-ready index on cloud storage.

Configure Cloud Access

S3

If no credentials are configured, IndexTables will automatically use the EC2 instance's IAM role. This is the recommended approach for production.

To explicitly configure credentials:

spark.conf.set("spark.indextables.aws.accessKey", "YOUR_ACCESS_KEY")
spark.conf.set("spark.indextables.aws.secretKey", "YOUR_SECRET_KEY")

Azure

spark.conf.set("spark.indextables.azure.accountName", "YOUR_ACCOUNT")
spark.conf.set("spark.indextables.azure.accountKey", "YOUR_KEY")

Databricks Unity Catalog

If using Unity Catalog External Locations to access S3 data, configure the Unity Catalog credential provider. We recommend setting these as cluster Spark properties:

spark.indextables.databricks.workspaceUrl https://<workspace>.cloud.databricks.com
spark.indextables.databricks.apiToken <your-token>
spark.indextables.aws.credentialsProviderClass io.indextables.spark.auth.unity.UnityCatalogAWSCredentialProvider

Alternatively, you can configure these in your notebook:

spark.conf.set("spark.indextables.databricks.workspaceUrl", "https://<workspace>.cloud.databricks.com")
spark.conf.set("spark.indextables.databricks.apiToken", dbutils.secrets.get("scope", "token"))

// Or use your notebook's token directly
spark.conf.set("spark.indextables.databricks.apiToken",
  dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get())

spark.conf.set("spark.indextables.aws.credentialsProviderClass",
  "io.indextables.spark.auth.unity.UnityCatalogAWSCredentialProvider")

External Location Requirements

Your S3 path must be configured as a Unity Catalog External Location. The following are required:

The metastore must have external_access_enabled set to true
You must have the EXTERNAL_USE_LOCATION privilege on the external location

See the generateTemporaryPathCredentials API for details.

Credentials are resolved on the driver and broadcast to executors — no network calls from executors to Databricks.

Write with Partitioning

df.write
  .format("io.indextables.spark.core.IndexTables4SparkTableProvider")
  .partitionBy("date")
  .option("spark.indextables.indexing.typemap.message", "text")
  .option("spark.indextables.indexing.fastfields", "timestamp,severity")
  .save("s3://my-bucket/logs")

Query with Partition Pruning

val logs = spark.read
  .format("io.indextables.spark.core.IndexTables4SparkTableProvider")
  .load("s3://my-bucket/logs")

// Partition filter + full-text search
logs.filter($"date" === "2024-01-15")
    .filter($"message" indexquery "error AND database")
    .show()

Monitor with SQL Commands

-- Check disk cache usage
DESCRIBE INDEXTABLES DISK CACHE;

-- View storage statistics
DESCRIBE INDEXTABLES STORAGE STATS;

Performance Tips

Use partitioning for time-series data
Configure fast fields for aggregation columns
Enable L2 disk cache on Databricks/EMR (auto-enabled on NVMe)

Recommended Cluster Configurations

IndexTables benefits from NVMe storage for the L2 disk cache. Here are configurations we've tested.

For all clusters, set the following to ensure caching and prewarming works properly:

spark.locality.wait 30s

Query Clusters

For query workloads, use instances with high memory and NVMe storage:

Instance Type	vCPUs	Memory	Storage
r6id.2xlarge	8	64 GB	NVMe
i4i.2xlarge	8	64 GB	NVMe

spark.executor.memory 27016m

Indexing Clusters

For write/indexing workloads, compute-optimized instances work well:

Instance Type	vCPUs	Memory	Storage
c6id.2xlarge	8	32 GB	NVMe

spark.executor.memory 16348m

EMR NVMe Configuration

IndexTables auto-enables NVMe caching when /local_disk0 is detected (standard on Databricks). On EMR, NVMe is typically mounted elsewhere, so you need to either:

Option 1: Create a symlink (recommended)

In your EMR bootstrap script, create a symlink to your NVMe mount:

ln -s /mnt/nvme /local_disk0

Option 2: Configure paths manually

Set the temp and cache directories explicitly:

spark.indextables.cache.disk.path /mnt/nvme/indextables_cache
spark.indextables.indexWriter.tempDirectoryPath /mnt/nvme/indextables_temp
spark.indextables.merge.tempDirectoryPath /mnt/nvme/indextables_merge

Next Steps

S3 Configuration - Advanced S3 settings
Azure Configuration - Azure Blob Storage
Performance Tuning - Databricks optimization

Configure Cloud Access​

S3​

Azure​

Databricks Unity Catalog​

Write with Partitioning​

Query with Partition Pruning​

Monitor with SQL Commands​

Performance Tips​

Recommended Cluster Configurations​

Query Clusters​

Indexing Clusters​

EMR NVMe Configuration​

Next Steps​