Partitioning
Partitioning organizes data into directories for efficient query pruning.
Basic Partitioning
df.write
.format("io.indextables.spark.core.IndexTables4SparkTableProvider")
.partitionBy("date", "region")
.save("s3://bucket/logs")
Creates structure:
s3://bucket/logs/
_transaction_log/
date=2024-01-01/region=us-east/
abc123.split
date=2024-01-01/region=eu-west/
def456.split
Partition Pruning
Queries with partition filters skip irrelevant directories:
// Only reads date=2024-01-15 partition
df.filter($"date" === "2024-01-15")
.filter($"message" indexquery "error")
.show()
Best Practices
Choose Partition Columns Wisely
- Use columns frequently filtered on (date, region, tenant)
- Avoid high-cardinality columns (user_id)
- Keep partition count manageable (< 10,000)
Partition by Time
For time-series data, partition by date or hour:
df.withColumn("date", to_date($"timestamp"))
.write
.partitionBy("date")
.save("path")
Compound Partitions
Combine multiple dimensions:
df.write
.partitionBy("year", "month", "day") // Hierarchical
.save("path")
df.write
.partitionBy("region", "date") // Multi-dimensional
.save("path")
Managing Partitions
Drop Old Partitions
DROP INDEXTABLES PARTITIONS FROM 's3://bucket/logs'
WHERE date < '2023-01-01';
Merge Within Partitions
MERGE SPLITS 's3://bucket/logs'
WHERE date = '2024-01-01'
TARGET SIZE 4G;
See DROP PARTITIONS for more details.