Skip to main content

Filter Pushdown

IndexTables pushes filters down to the search engine for efficient execution.

Equality Filters

// Exact equality
df.filter($"status" === "active")

// Null-safe equality
df.filter($"status" <=> "active")

// NOT equal
df.filter($"status" =!= "deleted")
// Or using NOT
df.filter(!($"status" === "deleted"))

Range Filters

All numeric, date, and timestamp fields support range queries:

// Greater than / less than
df.filter($"score" > 0.5)
df.filter($"count" < 100)

// Greater than or equal / less than or equal
df.filter($"count" >= 10)
df.filter($"count" <= 1000)

// Combined ranges
df.filter($"count" >= 10 && $"count" < 100)

// Date/timestamp ranges
df.filter($"timestamp" >= "2024-01-01")
df.filter($"event_date" < "2024-12-31")

IN Clause

df.filter($"region".isin("us-east", "us-west", "eu-west"))
df.filter($"status".isin("pending", "active", "completed"))

NULL Checks

// IS NOT NULL - fully pushed down
df.filter($"category".isNotNull)

// IS NULL - evaluated by Spark (not pushed down)
df.filter($"category".isNull)
note

IS NULL filters are not pushed down because the search engine doesn't index null values. Spark will post-filter these results.

Compound Filters

Combine filters with AND, OR, and NOT:

// AND
df.filter($"status" === "active" && $"region" === "us-east")

// OR
df.filter($"priority" === "high" || $"priority" === "critical")

// NOT
df.filter(!($"status" === "deleted"))

// Complex combinations
df.filter(($"status" === "error" || $"status" === "warning") && $"region" === "us-east")

String Pattern Filters

String pattern filters are supported but disabled by default for performance reasons. Enable via configuration:

// Enable all string pattern pushdowns
spark.conf.set("spark.indextables.filter.stringPattern.pushdown", "true")

// Or enable individually
spark.conf.set("spark.indextables.filter.stringStartsWith.pushdown", "true") // Most efficient
spark.conf.set("spark.indextables.filter.stringEndsWith.pushdown", "true")
spark.conf.set("spark.indextables.filter.stringContains.pushdown", "true") // Least efficient

Once enabled:

// StartsWith (most efficient - uses prefix queries)
df.filter($"message".startsWith("ERROR"))

// EndsWith
df.filter($"filename".endsWith(".log"))

// Contains
df.filter($"content".contains("exception"))

Full-Text Search with IndexQuery

For text fields, use the indexquery operator for full-text search:

import org.apache.spark.sql.indextables.IndexQueryExpression._

// Single term
df.filter($"content" indexquery "error")

// Boolean queries
df.filter($"content" indexquery "error AND database")
df.filter($"content" indexquery "error OR warning")
df.filter($"content" indexquery "error NOT timeout")

// Phrase search
df.filter($"content" indexquery "\"connection refused\"")

See IndexQuery Syntax for full details.

Partition Filters

Partition column filters enable partition pruning:

// Only reads matching partitions
df.filter($"date" === "2024-01-15")
.filter($"message" indexquery "error")

// Range on partition columns
df.filter($"date" >= "2024-01-01" && $"date" < "2024-02-01")

Nested JSON Field Filters

Nested fields in Struct, Array, and Map types support filter pushdown:

// Struct fields
df.filter($"user.name" === "Alice")
df.filter($"user.age" > 28)

// Deeply nested
df.filter($"request.headers.contentType" === "application/json")

// Range on nested fields
df.filter($"metadata.score" >= 0.5)

What Gets Pushed Down

Filter TypePushed DownNotes
= (EqualTo)YesString fields only (text fields use IndexQuery)
<=> (EqualNullSafe)YesString fields only
> (GreaterThan)YesAll fields including nested JSON
>= (GreaterThanOrEqual)YesAll fields including nested JSON
< (LessThan)YesAll fields including nested JSON
<= (LessThanOrEqual)YesAll fields including nested JSON
INYesFull pushdown
IS NOT NULLYesRegular fields only (not nested JSON)
IS NULLNoSearch engine doesn't index nulls
ANDYesIf both children are supported
ORYesIf both children are supported
NOTYesIf child is supported
LIKE 'prefix%'ConfigEnable with stringStartsWith.pushdown
LIKE '%suffix'ConfigEnable with stringEndsWith.pushdown
LIKE '%sub%'ConfigEnable with stringContains.pushdown
indexqueryYesText fields - full-text search
UDFNoEvaluated by Spark