Quickstart

Get up and running with IndexTables in 5 minutes.

1. Create Sample Data

import spark.implicits._

val data = Seq(
  (1, "Introduction to Machine Learning", "machine learning basics tutorial"),
  (2, "Advanced Deep Learning", "neural networks deep learning AI"),
  (3, "Data Engineering Best Practices", "spark hadoop data pipelines"),
  (4, "Search Engine Architecture", "tantivy lucene search indexing")
).toDF("id", "title", "content")

2. Write an Index

data.write
  .format("io.indextables.spark.core.IndexTables4SparkTableProvider")
  .option("spark.indextables.indexing.typemap.title", "string")
  .option("spark.indextables.indexing.typemap.content", "text")
  .mode("overwrite")
  .save("/tmp/my_index")

Field Types

String fields (default): Exact matching, full filter pushdown
Text fields: Full-text search with IndexQuery

3. Query the Index

val df = spark.read
  .format("io.indextables.spark.core.IndexTables4SparkTableProvider")
  .load("/tmp/my_index")

// Standard SQL filters (pushed down for string fields)
df.filter($"title" === "Introduction to Machine Learning").show()

// Full-text search with IndexQuery
import org.apache.spark.sql.indextables.IndexQueryExpression._
df.filter($"content" indexquery "machine learning").show()

4. Run Aggregations

// Aggregations are pushed down to the search engine
df.agg(count("*")).show()

// With filters
df.filter($"content" indexquery "deep learning")
  .agg(count("*"))
  .show()

5. Create a Temp View

df.createOrReplaceTempView("articles")

6. Use SQL

-- Query with IndexQuery
SELECT * FROM articles
WHERE content indexquery 'machine AND learning'

-- Aggregations
SELECT COUNT(*) FROM articles
WHERE content indexquery 'neural AND networks'

Next Steps

Field Types - Understand string vs text fields
Configuration - Tune for your workload
IndexQuery Syntax - Master full-text search
Your First Production Index - Deploy to S3

1. Create Sample Data​

2. Write an Index​

3. Query the Index​

4. Run Aggregations​

5. Create a Temp View​

6. Use SQL​

Next Steps​

1. Create Sample Data

2. Write an Index

3. Query the Index

4. Run Aggregations

5. Create a Temp View

6. Use SQL

Next Steps