Skip to main content

PREWARM CACHE

Pre-warm index caches across all executors to make queries instant and dramatically reduce object storage calls.

Why Prewarm?

Without prewarming, the first query on each executor must fetch index segments from object storage (S3/Azure), which can take 50-200ms per segment. Prewarming loads these segments into the local disk cache ahead of time, so queries execute in 1-5ms instead.

Benefits:

  • Instant queries: Eliminate cold-start latency
  • Reduced S3/Azure costs: 90%+ fewer GET requests
  • Predictable performance: No variance between first and subsequent queries
Storage Requirements

Prewarming requires fast local disks. Use NVMe instance storage (recommended) or a fast EBS volume class. Slow disks will negate the benefits of prewarming.

Syntax

PREWARM INDEXTABLES CACHE '<path>'
[FOR SEGMENTS (<segments>)]
[ON FIELDS (<fields>)]
[WITH PERWORKER PARALLELISM OF <n>]
[WHERE <partition_predicate>]

Segment Selection

Choose which index segments to prewarm based on your query patterns:

Basic Queries (Default)

For simple keyword searches, the default segments are sufficient:

-- Default: TERM_DICT, POSTINGS
PREWARM INDEXTABLES CACHE 's3://bucket/logs';

Range Queries and Aggregations

For queries with >, <, >=, <= filters or aggregations (COUNT, SUM, AVG, MIN, MAX), add FAST_FIELD and FIELD_NORM:

PREWARM INDEXTABLES CACHE 's3://bucket/logs'
FOR SEGMENTS (TERM_DICT, POSTINGS, FAST_FIELD, FIELD_NORM);

All Segments

For comprehensive prewarming (largest cache footprint):

PREWARM INDEXTABLES CACHE 's3://bucket/logs'
FOR SEGMENTS (TERM_DICT, POSTINGS, FAST_FIELD, FIELD_NORM, POSITIONS, DOC_STORE);

Field Selection for Wide Tables

For tables with many columns, prewarm only the fields used in your WHERE clauses to reduce cache usage:

-- Only prewarm fields used in filters
PREWARM INDEXTABLES CACHE 's3://bucket/logs'
FOR SEGMENTS (TERM_DICT, POSTINGS, FAST_FIELD)
ON FIELDS (timestamp, status, user_id, error_code);

This is especially important for wide tables (50+ columns) where prewarming all fields would consume excessive disk space.

Examples

Basic Prewarm

PREWARM INDEXTABLES CACHE 's3://bucket/logs';

Prewarm for Analytics Workloads

-- Optimized for COUNT, SUM, AVG and range filters
PREWARM INDEXTABLES CACHE 's3://bucket/logs'
FOR SEGMENTS (TERM_DICT, POSTINGS, FAST_FIELD, FIELD_NORM);

Prewarm Specific Partition

PREWARM INDEXTABLES CACHE 's3://bucket/logs'
FOR SEGMENTS (TERM_DICT, POSTINGS, FAST_FIELD)
WHERE date = '2024-01-15';

Full Options

PREWARM INDEXTABLES CACHE 's3://bucket/logs'
FOR SEGMENTS (TERM_DICT, FAST_FIELD, POSTINGS, FIELD_NORM)
ON FIELDS (timestamp, status, message)
WITH PERWORKER PARALLELISM OF 4
WHERE region = 'us-east';

Async Mode

For large tables or production environments, use async mode to start prewarming in the background without blocking your session. This allows you to continue running queries while cache prewarming happens in separate threads.

Async Syntax

PREWARM INDEXTABLES CACHE '<path>'
[FOR SEGMENTS (<segments>)]
[ON FIELDS (<fields>)]
[WITH PERWORKER PARALLELISM OF <n>]
[WHERE <partition_predicate>]
ASYNC MODE

Starting an Async Prewarm

-- Start background prewarming
PREWARM INDEXTABLES CACHE 's3://bucket/logs' ASYNC MODE;

This returns immediately with a job ID that you can use to monitor progress:

job_idasync_modestatus
prewarm-abc123truestarted

Monitoring Async Jobs

Check the status of all async prewarm jobs across the cluster:

DESCRIBE INDEXTABLES PREWARM JOBS;

See DESCRIBE PREWARM JOBS for output details.

Waiting for Completion

If you need to block until async prewarming completes (e.g., before running benchmarks):

-- Wait indefinitely
WAIT FOR INDEXTABLES PREWARM JOBS;

-- Wait with timeout (in seconds)
WAIT FOR INDEXTABLES PREWARM JOBS TIMEOUT 300;

See WAIT FOR PREWARM JOBS for more details.

Async Examples

-- Start async prewarm for analytics workload
PREWARM INDEXTABLES CACHE 's3://bucket/logs'
FOR SEGMENTS (TERM_DICT, POSTINGS, FAST_FIELD, FIELD_NORM)
ASYNC MODE;

-- Start async prewarm for specific partition
PREWARM INDEXTABLES CACHE 's3://bucket/logs'
FOR SEGMENTS (TERM_DICT, POSTINGS)
WHERE date >= '2024-01-01'
ASYNC MODE;

-- Check job status
DESCRIBE INDEXTABLES PREWARM JOBS;

-- Wait for all jobs to complete before running benchmark
WAIT FOR INDEXTABLES PREWARM JOBS TIMEOUT 600;

Companion Mode Segments

When prewarming companion mode indexes, two additional segment types are available for parquet-backed data:

Parquet Fast Fields

For companion splits using HYBRID or PARQUET_ONLY fast field mode, prewarm parquet fast field data to accelerate aggregations:

PREWARM INDEXTABLES CACHE 's3://warehouse/events_index'
FOR SEGMENTS (TERM_DICT, POSTINGS, FAST_FIELD, PARQUET_FAST_FIELDS);
Auto-Detection

When FAST_FIELD is requested on companion splits with HYBRID or PARQUET_ONLY mode, parquet fast fields are automatically included — you don't need to specify PARQUET_FAST_FIELDS explicitly.

Parquet Columns

For document retrieval workloads, prewarm the parquet column data:

PREWARM INDEXTABLES CACHE 's3://warehouse/events_index'
FOR SEGMENTS (TERM_DICT, POSTINGS, PARQUET_COLUMNS);

Full Companion Prewarm

PREWARM INDEXTABLES CACHE 's3://warehouse/events_index'
FOR SEGMENTS (TERM_DICT, POSTINGS, FAST_FIELD, PARQUET_FAST_FIELDS, PARQUET_COLUMNS)
ON FIELDS (timestamp, status, message);

Monitor Disk Cache Usage

Check how much disk space your cache is using:

DESCRIBE INDEXTABLES DISK CACHE;

This shows cache size, hit rate, and available capacity per executor.

Segment Reference

SQL NameDescriptionUse Case
TERM_DICT, TERMTerm dictionary (FST)All queries (default)
POSTINGS, POSTING_LISTSInverted index postingsAll queries (default)
FAST_FIELD, FASTFIELDFast fields for aggregationsRange queries, aggregations
FIELD_NORM, FIELDNORMField norms for scoringRange queries, aggregations
POSITIONS, POSITION_LISTSTerm positionsPhrase queries
DOC_STORE, STOREDocument storageRetrieving field values
PARQUET_FAST_FIELDS, PARQUET_FASTParquet fast field dataAggregations on companion splits
PARQUET_COLUMNS, PARQUET_COLSParquet column dataDocument retrieval on companion splits

Default segments: TERM_DICT, POSTINGS

Output

ColumnDescription
hostExecutor hostname
assigned_hostExpected hostname for locality
locality_hitsTasks that ran on correct host
locality_missesTasks that ran on wrong host
splits_prewarmedCount of splits prewarmed
segmentsComma-separated segment names
fieldsField names (or "all")
duration_msDuration in milliseconds
statussuccess, partial, no_splits
job_idAsync job identifier (async mode only)
async_modeWhether async mode was used

Configuration

spark.conf.set("spark.indextables.prewarm.splitsPerTask", "2")
spark.conf.set("spark.indextables.prewarm.maxRetries", "10")
spark.conf.set("spark.indextables.prewarm.failOnMissingField", "true")

Async Mode Configuration

// Maximum concurrent async prewarm jobs per worker (default: 1)
spark.conf.set("spark.indextables.prewarm.async.maxConcurrent", "2")

// How long to retain completed job metadata in milliseconds (default: 3600000 = 1 hour)
spark.conf.set("spark.indextables.prewarm.async.completedJobRetentionMs", "3600000")