Skip to main content

Databricks Deployment

IndexTables is optimized for Databricks with automatic detection of local NVMe storage.

Installation

  1. Download the shaded JAR from the releases page
  2. Upload it to a Unity Catalog volume (e.g., /Volumes/my_catalog/my_schema/artifacts/)
  3. Create an init script that copies the JAR to the Databricks jars directory:
#!/bin/sh
cp /Volumes/my_catalog/my_schema/artifacts/indextables_spark-0.4.0-linux-x86_64-shaded.jar /databricks/jars
  1. Upload the init script to your volume and configure it in your cluster settings under Advanced Options > Init Scripts

Requirements

ComponentVersion
Apache Spark3.5.3
Java11 or later
Scala2.12

Register SQL Extensions

SET spark.sql.extensions=io.indextables.spark.extensions.IndexTables4SparkExtensions

Auto-Detected Settings

When /local_disk0 is detected, these settings are automatically configured:

  • Temp directory: /local_disk0/temp
  • Cache directory: /local_disk0/cache
  • Disk cache: Enabled

Unity Catalog Integration

If using Unity Catalog External Locations to access S3 data, configure the Unity Catalog credential provider. We recommend setting these as cluster Spark properties:

spark.indextables.databricks.workspaceUrl https://<workspace>.cloud.databricks.com
spark.indextables.databricks.apiToken <your-token>
spark.indextables.aws.credentialsProviderClass io.indextables.spark.auth.unity.UnityCatalogAWSCredentialProvider

Alternatively, configure in your notebook:

spark.conf.set("spark.indextables.databricks.workspaceUrl", "https://<workspace>.cloud.databricks.com")
spark.conf.set("spark.indextables.databricks.apiToken", dbutils.secrets.get("scope", "token"))

# Or use your notebook's token directly
spark.conf.set("spark.indextables.databricks.apiToken",
dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get())

spark.conf.set("spark.indextables.aws.credentialsProviderClass",
"io.indextables.spark.auth.unity.UnityCatalogAWSCredentialProvider")
External Location Requirements

Your S3 path must be configured as a Unity Catalog External Location. The following are required:

  • The metastore must have external_access_enabled set to true
  • You must have the EXTERNAL_USE_LOCATION privilege on the external location

See the generateTemporaryPathCredentials API for details.

Credentials are resolved on the driver and broadcast to executors — no network calls from executors to Databricks.

For all clusters, set the following to ensure caching and prewarming works properly:

spark.locality.wait 30s

Query Clusters

For query workloads, use instances with high memory and NVMe storage:

Instance TypevCPUsMemoryStorage
r6id.2xlarge864 GBNVMe
i4i.2xlarge864 GBNVMe
spark.executor.memory 27016m

Indexing Clusters

For write/indexing workloads, compute-optimized instances work well:

Instance TypevCPUsMemoryStorage
c6id.2xlarge832 GBNVMe
spark.executor.memory 16348m

Performance Settings

# Recommended for Databricks
spark.conf.set("spark.indextables.indexWriter.heapSize", "200M")
spark.conf.set("spark.indextables.s3.maxConcurrency", "8")

Photon Compatibility

IndexTables works with Photon-enabled clusters. Aggregations and filters are pushed down before Photon processes results.