Bucket Aggregations

Bucket aggregations enable time-series and distribution analysis using SQL GROUP BY.

Available Functions

Function	Description
`indextables_histogram(col, interval)`	Fixed-interval numeric bucketing
`indextables_date_histogram(col, interval)`	Time-based bucketing
`indextables_range(col, ...)`	Custom named range buckets

Histogram

Bucket numeric values by fixed intervals:

SELECT
  indextables_histogram(price, 50.0) as price_bucket,
  COUNT(*) as cnt,
  SUM(quantity) as total_qty
FROM products
GROUP BY indextables_histogram(price, 50.0)

Date Histogram

Bucket timestamps by time intervals:

SELECT
  indextables_date_histogram(event_time, '1d') as day_bucket,
  COUNT(*) as cnt
FROM events
GROUP BY indextables_date_histogram(event_time, '1d')

Supported Intervals

Suffix	Meaning
`ms`	Milliseconds
`s`	Seconds
`m`	Minutes
`h`	Hours
`d`	Days

Examples: '1h', '30m', '7d', '100ms'

Range

Create custom named buckets:

SELECT
  indextables_range(price,
    'cheap', NULL, 50.0,
    'mid', 50.0, 100.0,
    'expensive', 100.0, NULL
  ) as tier,
  COUNT(*) as cnt
FROM products
GROUP BY indextables_range(price,
  'cheap', NULL, 50.0,
  'mid', 50.0, 100.0,
  'expensive', 100.0, NULL
)

NULL indicates unbounded (no lower/upper limit).

Requirements

Fields must be configured as fast fields:

df.write
  .format("io.indextables.spark.core.IndexTables4SparkTableProvider")
  .option("spark.indextables.indexing.fastfields", "price,event_time,quantity")
  .save("path")

With Sub-Aggregations

Combine bucket aggregations with other aggregates:

SELECT
  indextables_date_histogram(timestamp, '1h') as hour,
  COUNT(*) as requests,
  AVG(latency) as avg_latency,
  MAX(latency) as max_latency
FROM logs
WHERE status = 'error'
GROUP BY indextables_date_histogram(timestamp, '1h')
ORDER BY hour

Available Functions​

Histogram​

Date Histogram​

Supported Intervals​

Range​

Requirements​

With Sub-Aggregations​