The entry point to programming Spark with the Dataset and DataFrame API.

A SparkSession can be used to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read data from various sources. A SparkSession is created using the SparkSession.builder method.

The SparkSession provides access to:

  • sql - Execute SQL queries and return results as DataFrames
  • read - Read data from external sources (CSV, JSON, Parquet, tables, etc.)
  • createDataFrame - Create DataFrames from in-memory data
  • table - Load tables as DataFrames
  • range - Generate DataFrames with sequences of numbers
  • catalog - Access the Spark catalog for metadata operations
  • udf - Register and manage user-defined functions
  • conf - Configure Spark runtime settings
// Create a SparkSession
const spark = await SparkSession.builder()
.remote("sc://localhost:15002")
.appName("MyApp")
.getOrCreate();

// Execute SQL
const df = await spark.sql("SELECT * FROM users WHERE age > 21");
await df.show();

// Read data
const csvDf = spark.read.csv("data.csv");

// Create DataFrame from data
const data = [new Row(schema, ["Alice", 30]), new Row(schema, ["Bob", 25])];
const df2 = spark.createDataFrame(data, schema);

1.0.0

Constructors

Properties

catalog: Catalog = ...
udf: UDFRegistration = ...
client: Client

Methods

builder

  • Creates a SparkSessionBuilder for constructing a SparkSession.

    Returns SparkSessionBuilder

    A new SparkSessionBuilder instance

    const spark = await SparkSession.builder()
    .remote("sc://localhost:15002")
    .appName("MyApplication")
    .config("spark.sql.shuffle.partitions", "10")
    .getOrCreate();

config

  • get conf(): RuntimeConfig

    Runtime configuration interface for this SparkSession.

    Returns RuntimeConfig

    The RuntimeConfig instance for getting/setting Spark configuration

    await spark.conf.set("spark.sql.shuffle.partitions", "200");
    const value = await spark.conf.get("spark.sql.shuffle.partitions");

core

  • Returns the Spark version this SparkSession is connected to.

    Returns Promise<string>

    A promise that resolves to the Spark version string

    const version = await spark.version();
    console.log(`Connected to Spark ${version}`);
  • Executes a SQL query and returns the result as a DataFrame.

    Parameters

    • sqlStr: string

      The SQL query string to execute

    Returns Promise<DataFrame>

    A promise that resolves to a DataFrame containing the query results

    const df = await spark.sql("SELECT * FROM users WHERE age > 21");
    await df.show();

    // Use with table joins
    const joined = await spark.sql(`
    SELECT u.name, o.amount
    FROM users u
    JOIN orders o ON u.id = o.user_id
    `);
  • Creates a DataFrame from an array of Row objects with a specified schema.

    Parameters

    • data: Row[]

      An array of Row objects

    • schema: StructType

      The schema defining the structure of the DataFrame

    Returns DataFrame

    A new DataFrame containing the provided data

    const schema = new StructType([
    new StructField("name", DataTypes.StringType),
    new StructField("age", DataTypes.IntegerType)
    ]);

    const data = [
    new Row(schema, { name: "Alice", age: 30 }),
    new Row(schema, { name: "Bob", age: 25 })
    ];

    const df = spark.createDataFrame(data, schema);
    await df.show();
  • Creates a DataFrame from an Apache Arrow Table.

    Parameters

    • table: Table

      An Apache Arrow Table object

    • schema: StructType

      The Spark schema defining the structure of the DataFrame

    Returns DataFrame

    A new DataFrame containing the Arrow table data

    import { Table } from 'apache-arrow';

    // Assuming you have an Arrow table
    const arrowTable = ...; // Arrow Table
    const schema = new StructType([...]);

    const df = spark.createDataFrameFromArrowTable(arrowTable, schema);
    await df.show();
  • Creates a DataFrame with a single column of Long values starting from 0 to end (exclusive).

    Parameters

    • end: number | bigint

      The end value (exclusive)

    Returns DataFrame

    A DataFrame with a single column named "id" containing values from 0 to end-1

  • Creates a DataFrame with a single column of Long values.

    Parameters

    • start: number | bigint

      The start value (inclusive)

    • end: number | bigint

      The end value (exclusive)

    Returns DataFrame

    A DataFrame with a single column named "id" containing values from start to end-1

  • Creates a DataFrame with a single column of Long values with a custom step.

    Parameters

    • start: number | bigint

      The start value (inclusive)

    • end: number | bigint

      The end value (exclusive)

    • step: number | bigint

      The increment between consecutive values

    Returns DataFrame

    A DataFrame with a single column named "id"

  • Creates a DataFrame with a single column of Long values with custom step and partitions.

    Parameters

    • start: number | bigint

      The start value (inclusive)

    • end: number | bigint

      The end value (exclusive)

    • step: number | bigint

      The increment between consecutive values

    • numPartitions: number

      The number of partitions for the resulting DataFrame

    Returns DataFrame

    A DataFrame with a single column named "id"

    // Generate 0 to 9
    const df1 = spark.range(10);

    // Generate 5 to 14
    const df2 = spark.range(5, 15);

    // Generate 0, 2, 4, 6, 8
    const df3 = spark.range(0, 10, 2);

    // Generate with 4 partitions
    const df4 = spark.range(0, 100, 1, 4);

io

  • get read(): DataFrameReader

    Returns a DataFrameReader for reading data from external sources.

    Returns DataFrameReader

    A DataFrameReader instance

    // Read CSV
    const df = spark.read.csv("data.csv");

    // Read JSON
    const df2 = spark.read.json("data.json");

    // Read Parquet with options
    const df3 = spark.read
    .option("mergeSchema", "true")
    .parquet("data.parquet");

    // Read from table
    const df4 = spark.read.table("my_table");
  • Returns the specified table as a DataFrame.

    Parameters

    • name: string

      The name of the table to load

    Returns DataFrame

    A DataFrame representing the table

    const users = spark.table("users");
    await users.show();

    // With database qualifier
    const logs = spark.table("production.logs");