Skip to main content

Quickstart

Get up and running with IndexTables in 5 minutes.

1. Create Sample Data

import spark.implicits._

val data = Seq(
(1, "Introduction to Machine Learning", "machine learning basics tutorial"),
(2, "Advanced Deep Learning", "neural networks deep learning AI"),
(3, "Data Engineering Best Practices", "spark hadoop data pipelines"),
(4, "Search Engine Architecture", "tantivy lucene search indexing")
).toDF("id", "title", "content")

2. Write an Index

data.write
.format("io.indextables.spark.core.IndexTables4SparkTableProvider")
.option("spark.indextables.indexing.typemap.title", "string")
.option("spark.indextables.indexing.typemap.content", "text")
.mode("overwrite")
.save("/tmp/my_index")
Field Types
  • String fields (default): Exact matching, full filter pushdown
  • Text fields: Full-text search with IndexQuery

3. Query the Index

val df = spark.read
.format("io.indextables.spark.core.IndexTables4SparkTableProvider")
.load("/tmp/my_index")

// Standard SQL filters (pushed down for string fields)
df.filter($"title" === "Introduction to Machine Learning").show()

// Full-text search with IndexQuery
import org.apache.spark.sql.indextables.IndexQueryExpression._
df.filter($"content" indexquery "machine learning").show()

4. Run Aggregations

// Aggregations are pushed down to the search engine
df.agg(count("*")).show()

// With filters
df.filter($"content" indexquery "deep learning")
.agg(count("*"))
.show()

5. Create a Temp View

df.createOrReplaceTempView("articles")

6. Use SQL

-- Query with IndexQuery
SELECT * FROM articles
WHERE content indexquery 'machine AND learning'

-- Aggregations
SELECT COUNT(*) FROM articles
WHERE content indexquery 'neural AND networks'

Next Steps