AWS EMR Deployment

IndexTables is optimized for EMR with automatic NVMe storage detection.

Installation

Add to EMR cluster configuration:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.jars.packages": "io.indextables:indextables4spark_2.12:1.0.0",
      "spark.sql.extensions": "io.indextables.spark.extensions.IndexTables4SparkExtensions"
    }
  }
]

Instance Storage

EMR instances with NVMe storage (i3, r5d, etc.) automatically enable disk cache:

Instance	Storage	Cache Size
i3.xlarge	950GB NVMe	~630GB
i3.2xlarge	1.9TB NVMe	~1.2TB
r5d.xlarge	150GB NVMe	~100GB

IAM Role Configuration

Use instance profile for S3 access (no credentials needed):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::your-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::your-bucket"
    }
  ]
}

Performance Settings

# Recommended for EMR
spark.conf.set("spark.indextables.indexWriter.heapSize", "200M")
spark.conf.set("spark.indextables.s3.maxConcurrency", "8")

EMR Serverless

IndexTables works with EMR Serverless. Package the JAR with your application:

aws emr-serverless start-job-run \
  --application-id $APP_ID \
  --execution-role-arn $ROLE_ARN \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://bucket/app.jar",
      "sparkSubmitParameters": "--packages io.indextables:indextables4spark_2.12:1.0.0"
    }
  }'

Installation​

Instance Storage​

IAM Role Configuration​

Performance Settings​

EMR Serverless​

Installation

Instance Storage

IAM Role Configuration

Performance Settings

EMR Serverless