Skip to main content

Aggregate Pushdown

IndexTables pushes aggregation operations directly to the search engine for optimal performance.

Supported Aggregations

FunctionDescriptionRequirements
COUNT(*)Count all documentsNone
COUNT(col)Count non-null valuesNone
SUM(col)Sum of valuesFast field
AVG(col)Average of valuesFast field
MIN(col)Minimum valueFast field
MAX(col)Maximum valueFast field

Basic Usage

// Simple count
df.agg(count("*")).show()

// Multiple aggregations
df.agg(
count("*").as("total"),
sum("score").as("total_score"),
avg("score").as("avg_score")
).show()

With Filters

// Count with filter
df.filter($"status" === "error")
.agg(count("*"))
.show()

// Aggregation with IndexQuery
df.filter($"message" indexquery "database connection")
.agg(count("*"), avg("latency"))
.show()

Fast Fields Requirement

For SUM, AVG, MIN, MAX, configure fast fields at write time:

df.write
.format("io.indextables.spark.core.IndexTables4SparkTableProvider")
.option("spark.indextables.indexing.fastfields", "score,latency,timestamp")
.save("path")

SQL Usage

-- Count all documents
SELECT COUNT(*) FROM logs;

-- Aggregations with filter
SELECT COUNT(*), AVG(latency), MAX(latency)
FROM logs
WHERE status = 'error';

-- With IndexQuery
SELECT COUNT(*), SUM(bytes)
FROM logs
WHERE message indexquery 'timeout';

GROUP BY

Simple GROUP BY on string fields:

df.groupBy("region")
.agg(count("*"), avg("latency"))
.show()

For bucket-based grouping (time intervals, ranges), see Bucket Aggregations.

Performance

Aggregate pushdown provides significant speedups:

  • Avoids transferring all documents to Spark
  • Executes directly in optimized search engine code
  • Works with partition pruning for further optimization