Databricks Deployment
IndexTables is optimized for Databricks with automatic detection of local NVMe storage.
Installation
- Download the shaded JAR from the releases page
- Upload it to a Unity Catalog volume (e.g.,
/Volumes/my_catalog/my_schema/artifacts/) - Create an init script that copies the JAR to the Databricks jars directory:
#!/bin/sh
cp /Volumes/my_catalog/my_schema/artifacts/indextables_spark-0.4.0-linux-x86_64-shaded.jar /databricks/jars
- Upload the init script to your volume and configure it in your cluster settings under Advanced Options > Init Scripts
Requirements
| Component | Version |
|---|---|
| Apache Spark | 3.5.3 |
| Java | 11 or later |
| Scala | 2.12 |
Register SQL Extensions
SET spark.sql.extensions=io.indextables.spark.extensions.IndexTables4SparkExtensions
Auto-Detected Settings
When /local_disk0 is detected, these settings are automatically configured:
- Temp directory:
/local_disk0/temp - Cache directory:
/local_disk0/cache - Disk cache: Enabled
Unity Catalog Integration
If using Unity Catalog External Locations to access S3 data, configure the Unity Catalog credential provider. We recommend setting these as cluster Spark properties:
spark.indextables.databricks.workspaceUrl https://<workspace>.cloud.databricks.com
spark.indextables.databricks.apiToken <your-token>
spark.indextables.aws.credentialsProviderClass io.indextables.spark.auth.unity.UnityCatalogAWSCredentialProvider
Alternatively, configure in your notebook:
spark.conf.set("spark.indextables.databricks.workspaceUrl", "https://<workspace>.cloud.databricks.com")
spark.conf.set("spark.indextables.databricks.apiToken", dbutils.secrets.get("scope", "token"))
# Or use your notebook's token directly
spark.conf.set("spark.indextables.databricks.apiToken",
dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get())
spark.conf.set("spark.indextables.aws.credentialsProviderClass",
"io.indextables.spark.auth.unity.UnityCatalogAWSCredentialProvider")
Your S3 path must be configured as a Unity Catalog External Location. The following are required:
- The metastore must have
external_access_enabledset totrue - You must have the
EXTERNAL_USE_LOCATIONprivilege on the external location
See the generateTemporaryPathCredentials API for details.
Credentials are resolved on the driver and broadcast to executors — no network calls from executors to Databricks.
Recommended Cluster Configuration
For all clusters, set the following to ensure caching and prewarming works properly:
spark.locality.wait 30s
Query Clusters
For query workloads, use instances with high memory and NVMe storage:
| Instance Type | vCPUs | Memory | Storage |
|---|---|---|---|
| r6id.2xlarge | 8 | 64 GB | NVMe |
| i4i.2xlarge | 8 | 64 GB | NVMe |
spark.executor.memory 27016m
Indexing Clusters
For write/indexing workloads, compute-optimized instances work well:
| Instance Type | vCPUs | Memory | Storage |
|---|---|---|---|
| c6id.2xlarge | 8 | 32 GB | NVMe |
spark.executor.memory 16348m
Performance Settings
# Recommended for Databricks
spark.conf.set("spark.indextables.indexWriter.heapSize", "200M")
spark.conf.set("spark.indextables.s3.maxConcurrency", "8")
Photon Compatibility
IndexTables works with Photon-enabled clusters. Aggregations and filters are pushed down before Photon processes results.