Skip to main content

L2 Disk Cache

Persistent NVMe caching for 10-50x faster repeated queries.

Auto-Detection

Disk cache is automatically enabled when /local_disk0 is detected (Databricks/EMR NVMe storage).

Configuration

// Explicitly enable/disable
spark.conf.set("spark.indextables.cache.disk.enabled", "true")

// Cache path
spark.conf.set("spark.indextables.cache.disk.path",
"/local_disk0/tantivy4spark_slicecache")

// Maximum size (0 = auto, uses 2/3 available disk)
spark.conf.set("spark.indextables.cache.disk.maxSize", "100G")

// Compression (lz4, zstd, none)
spark.conf.set("spark.indextables.cache.disk.compression", "lz4")
spark.conf.set("spark.indextables.cache.disk.minCompressSize", "4K")

// Manifest sync interval (seconds)
spark.conf.set("spark.indextables.cache.disk.manifestSyncInterval", "30")

Monitoring

DESCRIBE INDEXTABLES DISK CACHE;

Flushing

FLUSH INDEXTABLES DISK CACHE;

Benefits

  • Persistent: Survives JVM restarts
  • Fast: NVMe speeds vs S3 latency
  • Automatic: No manual warming needed
  • Compressed: LZ4 reduces storage usage

When NOT to Use

  • Spinning disk systems (no benefit, may hurt performance)
  • Ephemeral clusters with no local storage
  • One-time queries (cache won't be reused)

To Disable

spark.conf.set("spark.indextables.cache.disk.enabled", "false")