Skip to main content

Databricks Deployment

IndexTables is optimized for Databricks with automatic detection of local NVMe storage.

Installation

  1. Download the shaded JAR from Maven Central:
    https://repo1.maven.org/maven2/io/indextables/indextables_spark/0.5.5_spark_3.5.3/indextables_spark-0.5.5_spark_3.5.3-linux-x86_64-shaded.jar
  2. Upload it to a Unity Catalog volume (e.g., /Volumes/my_catalog/my_schema/artifacts/)
  3. Create an init script that copies the JAR to the Databricks jars directory:
#!/bin/sh
cp /Volumes/my_catalog/my_schema/artifacts/indextables_spark-0.5.5_spark_3.5.3-linux-x86_64-shaded.jar /databricks/jars
  1. Upload the init script to your volume and configure it in your cluster settings under Advanced Options > Init Scripts

Requirements

ComponentVersion
Databricks Runtime15.4 LTS or 16.4 LTS
Scala2.12

Register SQL Extensions

SET spark.sql.extensions=io.indextables.spark.extensions.IndexTables4SparkExtensions

Auto-Detected Settings

When /local_disk0 is detected, these settings are automatically configured:

  • Temp directory: /local_disk0/temp
  • Cache directory: /local_disk0/cache
  • Disk cache: Enabled

Unity Catalog Integration

If using Unity Catalog External Locations to access S3 data, configure the Unity Catalog credential provider. We recommend setting these as cluster Spark properties:

spark.sql.extensions io.indextables.spark.extensions.IndexTables4SparkExtensions
spark.indextables.databricks.workspaceUrl https://<workspace>.cloud.databricks.com
spark.indextables.databricks.apiToken <your-token>
spark.indextables.aws.credentialsProviderClass io.indextables.spark.auth.unity.UnityCatalogAWSCredentialProvider

Alternatively, configure in your notebook:

spark.conf.set("spark.indextables.databricks.workspaceUrl", "https://<workspace>.cloud.databricks.com")
spark.conf.set("spark.indextables.databricks.apiToken", dbutils.secrets.get("scope", "token"))

# Or use your notebook's token directly
spark.conf.set("spark.indextables.databricks.apiToken",
dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get())

spark.conf.set("spark.indextables.aws.credentialsProviderClass",
"io.indextables.spark.auth.unity.UnityCatalogAWSCredentialProvider")
External Location Requirements

Your S3 path must be configured as a Unity Catalog External Location. The following are required:

  • The metastore must have external_access_enabled set to true
  • You must have the EXTERNAL_USE_LOCATION privilege on the external location

See the generateTemporaryPathCredentials API for details.

Credentials are resolved on the driver and broadcast to executors — no network calls from executors to Databricks.

Photon

We recommend disabling Photon as it doesn't accelerate IndexTables workloads.

For all clusters, set the following to ensure caching and prewarming works properly:

spark.locality.wait 30s

Query Clusters

For query workloads, use instances with high memory and NVMe storage:

Instance TypevCPUsMemoryStorage
r6id.2xlarge864 GBNVMe
i4i.2xlarge864 GBNVMe
spark.executor.memory 27016m

Indexing Clusters

For write/indexing workloads, compute-optimized instances work well:

Instance TypevCPUsMemoryStorage
c6id.2xlarge832 GBNVMe
spark.executor.memory 16348m

Companion Mode with Unity Catalog

Companion Mode on Databricks supports Unity Catalog table name resolution for Delta tables. Pass a table name instead of a storage path, and IndexTables resolves the storage location and credentials automatically:

BUILD INDEXTABLES COMPANION FOR DELTA 'schema.events'
CATALOG 'my_catalog'
INDEXING MODES ('message':'text', 'src_ip':'ipaddress')
AT LOCATION 's3://warehouse/companion/events'

This requires the Unity Catalog credential provider to be configured (see Unity Catalog Integration above).

Scheduler Mode

Companion mode runs sync batches as concurrent Spark jobs. For this to work correctly, set spark.scheduler.mode=FAIR in your cluster configuration. This is the default on Databricks, but verify it is set if companion batches appear to run sequentially.

Performance Settings

# Recommended for Databricks
spark.conf.set("spark.indextables.indexWriter.heapSize", "200M")
spark.conf.set("spark.indextables.s3.maxConcurrency", "8")