Skip to main content

Companion Mode

Build full-text search indexes over your existing Delta Lake tables, Apache Iceberg tables, or Parquet datasets — without duplicating data.

Overview

Companion Mode creates index-only splits that reference the parquet files already backing your table. The Tantivy inverted index (term dictionaries, postings lists, positions) lives in the companion split. Column data stays exactly where it is — in your Delta table, Iceberg table, or Parquet directory.

Key benefits:

  • No data duplication — the source table remains the single system of record
  • 45–70% smaller indexes — companion splits contain only index structures, not document data
  • Incremental sync — re-running the command indexes only new or changed files
  • Transparent reads — queries work identically to standalone IndexTables; companion mode is auto-detected
  • Format-agnostic — same model for Delta Lake, Apache Iceberg, and raw Parquet

Supported Source Formats

FormatSource IdentifierExample
Delta LakeStorage path or Unity Catalog table name's3://warehouse/events' or 'schema.events'
Apache IcebergNamespace-qualified table name'prod.web_events'
ParquetDirectory path's3://logs/firewall/'

Syntax

Delta Lake (Path-Based)

BUILD INDEXTABLES COMPANION FOR DELTA '<storage_path>'
[INDEXING MODES ('<field>':'<mode>' [, ...])]
[FASTFIELDS MODE {HYBRID | PARQUET_ONLY | DISABLED}]
[HASHED FASTFIELDS {INCLUDE | EXCLUDE} ('<field>' [, ...])]
[TARGET INPUT SIZE <size>]
[WRITER HEAP SIZE <size>]
[FROM VERSION <number>]
[WHERE <partition_predicates>]
[INVALIDATE ALL PARTITIONS]
AT LOCATION '<destination_path>'
[DRY RUN]

Delta Lake (Unity Catalog)

BUILD INDEXTABLES COMPANION FOR DELTA '<schema.table>'
CATALOG '<catalog_name>' [TYPE '<catalog_type>']
[INDEXING MODES ('<field>':'<mode>' [, ...])]
[FASTFIELDS MODE {HYBRID | PARQUET_ONLY | DISABLED}]
[HASHED FASTFIELDS {INCLUDE | EXCLUDE} ('<field>' [, ...])]
[TARGET INPUT SIZE <size>]
[WRITER HEAP SIZE <size>]
[FROM VERSION <number>]
[WHERE <partition_predicates>]
[INVALIDATE ALL PARTITIONS]
AT LOCATION '<destination_path>'
[DRY RUN]

Apache Iceberg

BUILD INDEXTABLES COMPANION FOR ICEBERG '<namespace.table_name>'
[CATALOG '<catalog_name>' [TYPE '<catalog_type>']]
[WAREHOUSE '<warehouse_path>']
[INDEXING MODES ('<field>':'<mode>' [, ...])]
[FASTFIELDS MODE {HYBRID | PARQUET_ONLY | DISABLED}]
[HASHED FASTFIELDS {INCLUDE | EXCLUDE} ('<field>' [, ...])]
[TARGET INPUT SIZE <size>]
[WRITER HEAP SIZE <size>]
[FROM SNAPSHOT <snapshot_id>]
[WHERE <partition_predicates>]
[INVALIDATE ALL PARTITIONS]
AT LOCATION '<destination_path>'
[DRY RUN]

Parquet

BUILD INDEXTABLES COMPANION FOR PARQUET '<parquet_directory>'
[SCHEMA SOURCE '<parquet_file>']
[INDEXING MODES ('<field>':'<mode>' [, ...])]
[FASTFIELDS MODE {HYBRID | PARQUET_ONLY | DISABLED}]
[HASHED FASTFIELDS {INCLUDE | EXCLUDE} ('<field>' [, ...])]
[TARGET INPUT SIZE <size>]
[WRITER HEAP SIZE <size>]
[WHERE <partition_predicates>]
[INVALIDATE ALL PARTITIONS]
AT LOCATION '<destination_path>'
[DRY RUN]

Format-Specific Clause Restrictions

ClauseDeltaIcebergParquet
FROM VERSIONYes
FROM SNAPSHOTYes
SCHEMA SOURCEYes
WAREHOUSEYes
CATALOG / TYPEYesYes

Parameters Reference

ParameterDefaultDescription
INDEXING MODESAll fields as stringPer-field indexing mode: 'field':'mode' pairs
FASTFIELDS MODEHYBRIDFast field strategy: HYBRID, PARQUET_ONLY, or DISABLED
HASHED FASTFIELDSall eligibleControl which string fields get U64 hashed fast fields for aggregations. Use INCLUDE to whitelist or EXCLUDE to blacklist specific fields.
TARGET INPUT SIZE2GMaximum cumulative parquet file size per companion split
WRITER HEAP SIZE1GTantivy writer memory budget per executor task
FROM VERSIONStart sync from a specific Delta version (Delta only)
FROM SNAPSHOTTime-travel to a specific Iceberg snapshot ID (Iceberg only)
WHEREPartition predicates to filter which files are indexed
INVALIDATE ALL PARTITIONSoffOverride WHERE-scoped invalidation to invalidate splits across all partitions
DRY RUNoffPreview the sync plan without creating splits
AT LOCATION(required)Destination path for the companion index
CATALOGCatalog name for Unity Catalog (Delta) or Iceberg catalogs
TYPECatalog type (e.g., rest, glue, hive)
WAREHOUSEWarehouse location (Iceberg only)
SCHEMA SOURCEParquet file to use for schema detection (Parquet only)

Indexing Modes

Control how each field is indexed in the companion split:

ModeBehaviorUse Case
textFull-text search with tokenizationLog messages, descriptions, free-form text
stringExact-match indexing (default)Status codes, IDs, categories
jsonJSON field indexingStructured JSON payloads
ipaddress / ipIP address field typeSource IPs, destination IPs
INDEXING MODES ('message':'text', 'src_ip':'ipaddress', 'severity':'string', 'payload':'json')

Fields not listed in INDEXING MODES default to string.

Compact String Indexing Modes

For high-cardinality string fields (trace IDs, UUIDs, request IDs), standard string indexing can produce large term dictionaries. Compact string indexing modes reduce index size by hashing values or stripping high-cardinality tokens from text.

ModeWhat Gets IndexedQuery Support
exact_onlyxxHash64 of the raw string (U64 field)Term queries only (search values are auto-hashed)
text_uuid_exactonlyTokenized text with UUIDs stripped + companion U64 hash per UUIDFull-text on text, exact match on UUIDs
text_uuid_stripTokenized text with UUIDs stripped (UUIDs discarded)Full-text only
text_custom_exactonlyTokenized text with regex matches stripped + companion U64 hash per matchFull-text on text, exact match on regex pattern
text_custom_stripTokenized text with regex matches stripped (matches discarded)Full-text only
INDEXING MODES (
'trace_id':'exact_only',
'message':'text_uuid_exactonly',
'error_log':'text_custom_exactonly(ERR-\\d{4})',
'notes':'text_uuid_strip'
)

Query Behavior

Queries on compact string fields are transparently rewritten at search time:

  • Term queries on exact_only fields automatically hash the search value before matching
  • Term queries on *_exactonly fields redirect UUID/pattern matches to the companion hash field
  • Full-text queries (parseQuery()) on exact_only fields are converted to hashed term queries
  • Full-text queries on *_exactonly fields work normally on the stripped text, with UUID/pattern matches redirected to the companion hash

No changes to your query code are needed — rewriting is handled internally.

Query Limitations

Wildcard, regex, and phrase prefix queries are not supported on exact_only fields because only the hash is stored, not the original string. These queries return a clear error:

Cannot use wildcard query on exact_only field 'trace_id'...

Range queries on exact_only fields are handled by Spark as a post-filter on the underlying parquet data rather than being pushed down to the index.

Examples

-- High-cardinality trace IDs: hash-only indexing (smallest index size)
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/traces'
INDEXING MODES ('trace_id':'exact_only', 'span_id':'exact_only', 'message':'text')
AT LOCATION 's3://warehouse/traces_index'

-- Log messages with UUIDs: full-text search + exact UUID lookup
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/logs'
INDEXING MODES ('message':'text_uuid_exactonly', 'request_id':'exact_only')
AT LOCATION 's3://warehouse/logs_index'

-- Custom pattern: extract and hash error codes from log lines
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/errors'
INDEXING MODES ('error_log':'text_custom_exactonly(ERR-\\d{4})')
AT LOCATION 's3://warehouse/errors_index'

Fast Field Modes

Fast fields control how columnar data is stored for aggregations and range queries:

ModeCompanion Split ContainsTradeoffs
HYBRID (default)Fast fields in both tantivy index and parquetBest read performance; moderate split size
PARQUET_ONLYFast fields only in parquet source filesSmallest companion splits (60–70% savings); aggregations read from parquet
DISABLEDNo fast fieldsIndex-only; no aggregation or range query support
-- Smallest possible companion splits
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/events'
FASTFIELDS MODE PARQUET_ONLY
INDEXING MODES ('message':'text')
AT LOCATION 's3://warehouse/events_index'

Unity Catalog Integration

On Databricks, you can pass a Unity Catalog table name instead of a raw storage path. IndexTables resolves the table's storage location and credentials automatically:

BUILD INDEXTABLES COMPANION FOR DELTA 'schema.events'
CATALOG 'unity_catalog'
INDEXING MODES ('message':'text', 'src_ip':'ipaddress')
AT LOCATION 's3://warehouse/companion/events'
  • The table name format is 'schema.table' (no three-part catalog.schema.table — the catalog is specified in the CATALOG clause)
  • Storage location is resolved from Unity Catalog metadata
  • Credentials are resolved automatically via the Unity Catalog credential provider
  • Path-based syntax ('s3://...') continues to work unchanged
Prerequisites

Unity Catalog integration requires the credential provider to be configured. See Databricks Deployment for setup instructions.

Iceberg Catalog Configuration

Iceberg tables require a catalog for metadata resolution. Configure via SQL clauses or Spark properties:

SQL Clauses

BUILD INDEXTABLES COMPANION FOR ICEBERG 'prod.web_events'
CATALOG 'rest_catalog' TYPE 'rest'
WAREHOUSE 's3://iceberg-warehouse'
AT LOCATION 's3://warehouse/companion/web_events'

Spark Properties

PropertyDescription
spark.indextables.iceberg.catalogTypeCatalog type: rest, glue, hive
spark.indextables.iceberg.uriCatalog URI
spark.indextables.iceberg.warehouseWarehouse location
spark.indextables.iceberg.tokenAuthentication token
spark.indextables.iceberg.credentialAuthentication credential
spark.indextables.iceberg.s3EndpointS3-compatible endpoint (e.g., MinIO)
spark.indextables.iceberg.s3PathStyleAccessEnable S3 path-style access

Supported Catalog Types

TypeDescription
restREST catalog (e.g., Tabular, Polaris)
glueAWS Glue Data Catalog
hiveHive Metastore (HMS)

Incremental Sync

Companion mode automatically detects changes and indexes only new or modified files:

  1. First run — indexes all parquet files in the source table
  2. Subsequent runs — performs a file-level anti-join against existing companion splits to identify:
    • New files from appends → indexed
    • Rewritten files from OPTIMIZE, DELETE, UPDATE, or MERGE INTO → affected companion splits invalidated and re-indexed
    • Unchanged files → skipped entirely

Re-run the same command to sync:

-- Only new or modified files are processed
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/events'
AT LOCATION 's3://warehouse/events_index'

No separate pipelines, CDC streams, or version tracking required. If a sync is interrupted, restarting picks up where it left off.

WHERE-Scoped Invalidation

When a WHERE clause is specified, only splits whose partition values fall within the WHERE range are candidates for invalidation. Splits outside the range are untouched — even if their source files no longer exist. This avoids unnecessary re-indexing when you only care about a subset of partitions.

To override this behavior and invalidate splits across all partitions, add INVALIDATE ALL PARTITIONS:

BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/events'
WHERE date >= '2024-02-01'
INVALIDATE ALL PARTITIONS
AT LOCATION 's3://warehouse/events_index'

Distributed Log Reading

For source tables with millions of files, reading the transaction log on the driver can cause OOM errors. Distributed log reading distributes checkpoint and manifest reads across Spark executors via RDDs, and pushes WHERE predicates to Rust via native PartitionFilter so filtered-out entries never cross the JNI boundary.

Arrow FFI (zero-copy columnar export) is used by default for all distributed log reads, eliminating serialization overhead.

Both features are enabled by default:

spark.conf.set("spark.indextables.companion.sync.distributedLogRead.enabled", "true")
spark.conf.set("spark.indextables.companion.sync.arrowFfi.enabled", "true")

Read Path

Companion mode is transparent at read time:

  • Auto-detected from transaction log metadata — no user configuration needed
  • Document data is resolved from the original parquet files automatically
  • All standard filters, aggregations, and IndexQuery operations work identically to standalone mode
  • A write guard prevents accidental direct writes (non-companion INSERT/APPEND) to companion-mode tables

Arrow FFI Columnar Reads

Companion splits use zero-copy Arrow FFI columnar reads by default. Data flows directly from parquet through Rust Arrow into Spark's columnar engine with no row-by-row serialization. This is enabled by default:

spark.conf.set("spark.indextables.read.columnar.enabled", "true")

Set to false to force the row-based path. Non-companion splits always use the row path regardless of this setting.

MERGE SPLITS with Companion

MERGE SPLITS works with companion splits and preserves companion metadata:

  • companionSourceFiles are concatenated from all source splits
  • The maximum companionDeltaVersion / source_version is retained
  • companionFastFieldMode is preserved (must be consistent across merged splits)

PREWARM CACHE with Companion

PREWARM CACHE supports two additional segments for companion splits:

SegmentAliasesDescription
PARQUET_FAST_FIELDSPARQUET_FASTPreload parquet fast field data for aggregations
PARQUET_COLUMNSPARQUET_COLSPreload parquet column data for document retrieval
PREWARM INDEXTABLES CACHE 's3://warehouse/events_index'
FOR SEGMENTS (TERM_DICT, FAST_FIELD, PARQUET_FAST_FIELDS, PARQUET_COLUMNS);

Auto-detection: When FAST_FIELD is requested on companion splits using HYBRID or PARQUET_ONLY mode, parquet fast fields are automatically included — no need to specify PARQUET_FAST_FIELDS explicitly.

Output Schema

ColumnTypeDescription
table_pathStringIndexTables destination path
source_pathStringSource table path
statusStringsuccess, no_action, dry_run, or error
source_versionLongDelta version, Iceberg snapshot ID, or null (Parquet)
splits_createdIntNumber of companion splits created
splits_invalidatedIntNumber of old splits removed
parquet_files_indexedIntNumber of parquet files indexed
parquet_bytes_downloadedLongTotal parquet bytes downloaded
split_bytes_uploadedLongTotal companion split bytes uploaded
duration_msLongWall-clock duration
messageStringHuman-readable status message

Configuration Reference

PropertyDefaultDescription
spark.indextables.companion.writerHeapSize1GWriter heap size (overridden by SQL WRITER HEAP SIZE clause)
spark.indextables.companion.readerBatchSize8192Parquet reader batch size
spark.indextables.companion.sync.batchSizedefaultParallelismTasks per Spark job
spark.indextables.companion.sync.maxConcurrentBatches6Maximum concurrent Spark jobs during sync
spark.indextables.companion.schedulerPoolindextables-companionSpark scheduler pool name for batch parallelism
spark.indextables.companion.sync.distributedLogRead.enabledtrueDistribute transaction log reads across executors
spark.indextables.companion.sync.arrowFfi.enabledtrueUse Arrow FFI for distributed log reads
spark.indextables.read.columnar.enabledtrueEnable Arrow FFI columnar reads for companion splits
Scheduler Mode

Concurrent batch execution requires spark.scheduler.mode=FAIR. This is the default on Databricks. On open-source Spark, set it explicitly in your cluster configuration.

Examples

Delta Lake (Path-Based)

BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/events'
INDEXING MODES ('message':'text', 'src_ip':'ipaddress', 'severity':'string')
AT LOCATION 's3://warehouse/events_index'

Delta Lake (Unity Catalog)

BUILD INDEXTABLES COMPANION FOR DELTA 'security.events'
CATALOG 'unity_catalog'
INDEXING MODES ('message':'text', 'src_ip':'ipaddress')
FASTFIELDS MODE HYBRID
AT LOCATION 's3://warehouse/companion/security_events'

Apache Iceberg (REST Catalog)

BUILD INDEXTABLES COMPANION FOR ICEBERG 'prod.web_events'
CATALOG 'rest_catalog' TYPE 'rest'
WAREHOUSE 's3://iceberg-warehouse'
INDEXING MODES ('message':'text', 'user_agent':'text')
AT LOCATION 's3://warehouse/companion/web_events'

Parquet

BUILD INDEXTABLES COMPANION FOR PARQUET 's3://logs/firewall/'
SCHEMA SOURCE 's3://logs/firewall/part-00000.parquet'
INDEXING MODES ('message':'text', 'src_ip':'ipaddress')
AT LOCATION 's3://warehouse/companion/firewall_logs'

Incremental Sync

-- First run: indexes all files
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/events'
INDEXING MODES ('message':'text')
AT LOCATION 's3://warehouse/events_index'

-- Subsequent runs: only new/changed files
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/events'
INDEXING MODES ('message':'text')
AT LOCATION 's3://warehouse/events_index'

Dry Run

-- Preview what would be indexed without making changes
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/events'
INDEXING MODES ('message':'text', 'src_ip':'ipaddress')
AT LOCATION 's3://warehouse/events_index'
DRY RUN

Hashed Fastfields

-- Only generate hashed fast fields for specific columns
BUILD INDEXTABLES COMPANION FOR PARQUET 's3://logs/events/'
HASHED FASTFIELDS INCLUDE ('title', 'category')
AT LOCATION 's3://warehouse/companion/events'

-- Exclude large or irrelevant string fields from hashed fast fields
BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/documents'
HASHED FASTFIELDS EXCLUDE ('raw_html', 'full_body')
INDEXING MODES ('title':'text', 'summary':'text')
AT LOCATION 's3://warehouse/companion/documents'

Custom Sizing

BUILD INDEXTABLES COMPANION FOR DELTA 's3://warehouse/large_events'
INDEXING MODES ('message':'text')
FASTFIELDS MODE PARQUET_ONLY
TARGET INPUT SIZE 4G
WRITER HEAP SIZE 2G
WHERE year >= 2025
AT LOCATION 's3://warehouse/companion/large_events'