Skip to main content

Partitioning

Partitioning organizes data into directories for efficient query pruning.

Basic Partitioning

df.write
.format("io.indextables.spark.core.IndexTables4SparkTableProvider")
.partitionBy("date", "region")
.save("s3://bucket/logs")

Creates structure:

s3://bucket/logs/
_transaction_log/
date=2024-01-01/region=us-east/
abc123.split
date=2024-01-01/region=eu-west/
def456.split

Partition Pruning

Queries with partition filters skip irrelevant directories:

// Only reads date=2024-01-15 partition
df.filter($"date" === "2024-01-15")
.filter($"message" indexquery "error")
.show()

Best Practices

Choose Partition Columns Wisely

  • Use columns frequently filtered on (date, region, tenant)
  • Avoid high-cardinality columns (user_id)
  • Keep partition count manageable (< 10,000)

Partition by Time

For time-series data, partition by date or hour:

df.withColumn("date", to_date($"timestamp"))
.write
.partitionBy("date")
.save("path")

Compound Partitions

Combine multiple dimensions:

df.write
.partitionBy("year", "month", "day") // Hierarchical
.save("path")

df.write
.partitionBy("region", "date") // Multi-dimensional
.save("path")

Managing Partitions

Drop Old Partitions

DROP INDEXTABLES PARTITIONS FROM 's3://bucket/logs'
WHERE date < '2023-01-01';

Merge Within Partitions

MERGE SPLITS 's3://bucket/logs'
WHERE date = '2024-01-01'
TARGET SIZE 4G;

See DROP PARTITIONS for more details.