Skip to main content

Bucket Aggregations

Bucket aggregations enable time-series and distribution analysis using SQL GROUP BY.

Available Functions

FunctionDescription
indextables_histogram(col, interval)Fixed-interval numeric bucketing
indextables_date_histogram(col, interval)Time-based bucketing
indextables_range(col, ...)Custom named range buckets

Histogram

Bucket numeric values by fixed intervals:

SELECT
indextables_histogram(price, 50.0) as price_bucket,
COUNT(*) as cnt,
SUM(quantity) as total_qty
FROM products
GROUP BY indextables_histogram(price, 50.0)

Date Histogram

Bucket timestamps by time intervals:

SELECT
indextables_date_histogram(event_time, '1d') as day_bucket,
COUNT(*) as cnt
FROM events
GROUP BY indextables_date_histogram(event_time, '1d')

Supported Intervals

SuffixMeaning
msMilliseconds
sSeconds
mMinutes
hHours
dDays

Examples: '1h', '30m', '7d', '100ms'

Range

Create custom named buckets:

SELECT
indextables_range(price,
'cheap', NULL, 50.0,
'mid', 50.0, 100.0,
'expensive', 100.0, NULL
) as tier,
COUNT(*) as cnt
FROM products
GROUP BY indextables_range(price,
'cheap', NULL, 50.0,
'mid', 50.0, 100.0,
'expensive', 100.0, NULL
)

NULL indicates unbounded (no lower/upper limit).

Requirements

Fields must be configured as fast fields:

df.write
.format("io.indextables.spark.core.IndexTables4SparkTableProvider")
.option("spark.indextables.indexing.fastfields", "price,event_time,quantity")
.save("path")

With Sub-Aggregations

Combine bucket aggregations with other aggregates:

SELECT
indextables_date_histogram(timestamp, '1h') as hour,
COUNT(*) as requests,
AVG(latency) as avg_latency,
MAX(latency) as max_latency
FROM logs
WHERE status = 'error'
GROUP BY indextables_date_histogram(timestamp, '1h')
ORDER BY hour