Class DataFrame

A distributed collection of data organized into named columns.

Remarks

DataFrame is the primary abstraction in Spark SQL for working with structured data. It provides a domain-specific language for distributed data manipulation and supports a wide variety of operations including selecting, filtering, joining, and aggregating.

DataFrames are lazy - operations are not executed until an action (like collect, count, or show) is called. This allows Spark to optimize the execution plan.

Example

// Create a DataFrame from a table
const df = spark.table("users");

// Select and filter
const result = df.select("name", "age")
                 .filter(col("age").gt(21));

// Show results
await result.show();

// Perform aggregations
const avgAge = await df.groupBy("department")
                       .avg("age")
                       .collect();

Since

1.0.0

Index

Constructors

constructor

new DataFrame(spark: SparkSession, plan: LogicalPlan): DataFrame
Parameters
- spark: SparkSession
- plan: LogicalPlan
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:72

Properties

`Readonly`spark

spark: SparkSession

`Readonly`plan

plan: LogicalPlan

Accessors

write

get write(): DataFrameWriter
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrame.ts:345

stat

get stat(): DataFrameStatFunctions
Returns DataFrameStatFunctions
- Defined in src/org/apache/spark/sql/DataFrame.ts:356

na

get na(): DataFrameNaFunctions
Returns DataFrameNaFunctions
- Defined in src/org/apache/spark/sql/DataFrame.ts:360

Methods

schema

schema(): Promise<StructType>
Returns the schema of this DataFrame.

Returns Promise<StructType>
- Defined in src/org/apache/spark/sql/DataFrame.ts:120

printSchema

printSchema(level?: number): Promise<void>
Parameters
- level: number = 0
Returns Promise<void>
- Defined in src/org/apache/spark/sql/DataFrame.ts:130

explain

explain(): Promise<void>
Returns Promise<void>
- Defined in src/org/apache/spark/sql/DataFrame.ts:139
explain(mode: string): Promise<void>
Parameters
- mode: string
Returns Promise<void>
- Defined in src/org/apache/spark/sql/DataFrame.ts:140
explain(mode: boolean): Promise<void>
Parameters
- mode: boolean
Returns Promise<void>
- Defined in src/org/apache/spark/sql/DataFrame.ts:141

withWatermark

withWatermark(eventTime: string, delayThreshold: string): Promise<DataFrame>
Parameters
- eventTime: string
- delayThreshold: string
Returns Promise<DataFrame>
- Defined in src/org/apache/spark/sql/DataFrame.ts:291

inputFiles

inputFiles(): Promise<string[]>
Returns Promise<string[]>
- Defined in src/org/apache/spark/sql/DataFrame.ts:295

sameSemantics

sameSemantics(other: DataFrame): Promise<boolean>
Parameters
- other: DataFrame
Returns Promise<boolean>
- Defined in src/org/apache/spark/sql/DataFrame.ts:299

semanticHash

semanticHash(): Promise<number>
Returns Promise<number>
- Defined in src/org/apache/spark/sql/DataFrame.ts:303

persist

persist(): Promise<DataFrame>
Persist this DataFrame with the default storage level (MEMORY_AND_DISK).

Returns Promise<DataFrame>
- Defined in src/org/apache/spark/sql/DataFrame.ts:310
persist(newLevel: StorageLevel): Promise<DataFrame>
Persist this DataFrame with the given storage level.
Parameters
- newLevel: StorageLevel
  a storage level.
Returns Promise<DataFrame>
See
[[StorageLevel]]
- Defined in src/org/apache/spark/sql/DataFrame.ts:316

cache

cache(): Promise<DataFrame>
Persist this DataFrame with the default storage level (MEMORY_AND_DISK).

Returns Promise<DataFrame>
- Defined in src/org/apache/spark/sql/DataFrame.ts:323

unpersist

unpersist(blocking?: boolean): Promise<DataFrame>
Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.
Parameters
- blocking: boolean = false
  Whether to block until all blocks are deleted.
Returns Promise<DataFrame>
- Defined in src/org/apache/spark/sql/DataFrame.ts:334

storageLevel

storageLevel(): Promise<StorageLevel>
Get the DataFrame's current storage level, or StorageLevel.NONE if not persisted.

Returns Promise<StorageLevel>
- Defined in src/org/apache/spark/sql/DataFrame.ts:341

writeTo

writeTo(tableName: string): DataFrameWriterV2
Create a write builder for writing to a table using V2 API
Parameters
- tableName: string
Returns DataFrameWriterV2
- Defined in src/org/apache/spark/sql/DataFrame.ts:352

show

show(): Promise<void>
Displays the Dataset in a tabular form. For example:

Returns Promise<void>
- Defined in src/org/apache/spark/sql/DataFrame.ts:500
show(numRows: number): Promise<void>
Displays the Dataset in a tabular form. For example:
Parameters
- numRows: number
  Number of rows to show
Returns Promise<void>
- Defined in src/org/apache/spark/sql/DataFrame.ts:501
show(numRows: number, truncate: number | boolean): Promise<void>
Displays the Dataset in a tabular form. For example:
Parameters
- numRows: number
  Number of rows to show
- truncate: number | boolean
  If set to true, truncate the displayed columns to 20 characters, default is true
Returns Promise<void>
- Defined in src/org/apache/spark/sql/DataFrame.ts:502
show(
    numRows: number,
    truncate: number | boolean,
    vertical: boolean,
): Promise<void>
Displays the Dataset in a tabular form. For example:
Parameters
- numRows: number
  Number of rows to show
- truncate: number | boolean
  If set to true, truncate the displayed columns to 20 characters, default is true
- vertical: boolean
  If set to true, print output rows vertically (one line per column value)
Returns Promise<void>
- Defined in src/org/apache/spark/sql/DataFrame.ts:503

select

select(...cols: string[]): DataFrame
Parameters
- ...cols: string[]
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:515
select(...cols: Column[]): DataFrame
Parameters
- ...cols: Column[]
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:516

selectExpr

selectExpr(...cols: string[]): DataFrame
Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

{{{ // The following are equivalent: df.selectExpr("colA", "colB as newName", "abs(colC)") df.select(expr("colA"), expr("colB as newName"), expr("abs(colC)")) // TODO: support expr(..) function }}}
Parameters
- ...cols: string[]
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:533

col

col(colName: string): Column
Selects column based on the column name and returns it as a [[org.apache.spark.sql.Column]].
Parameters
- colName: string
  string column name
Returns Column
Column

Note
The column name can also reference to a nested column like a.b.
- Defined in src/org/apache/spark/sql/DataFrame.ts:545

colRegex

colRegex(colName: string): Column
Selects column based on the column name specified as a regex and returns it as [[org.apache.spark.sql.Column]].
Parameters
- colName: string
  string column name specified as a regex
Returns Column
Column
- Defined in src/org/apache/spark/sql/DataFrame.ts:555

metadataColumn

metadataColumn(colName: string): Column
Selects a metadata column based on its logical column name, and returns it as a [[org.apache.spark.sql.Column]].

A metadata column can be accessed this way even if the underlying data source defines a data column with a conflicting name.
Parameters
- colName: string
  string column name
Returns Column
Column
- Defined in src/org/apache/spark/sql/DataFrame.ts:569

groupingSets

groupingSets(
groupingSets: Column[][],
...cols: Column[],
): RelationalGroupedDataset
Parameters
- groupingSets: Column[][]
- ...cols: Column[]
Returns RelationalGroupedDataset
- Defined in src/org/apache/spark/sql/DataFrame.ts:740

action

collect

collect(): Promise<Row[]>
Returns all rows in this DataFrame as an array of Row objects.

Returns Promise<Row[]>
A promise that resolves to an array of Row objects
Remarks
This is an action that triggers execution of the DataFrame's computation. Use with caution on large datasets as it collects all data to the driver.
Example
```
const rows = await df.collect();
rows.forEach(row => {
  console.log(row.getString(0), row.getInt(1));
});
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:383

head

head(): Promise<Row>
Returns the first row.

Returns Promise<Row>
A promise that resolves to the first Row
- Defined in src/org/apache/spark/sql/DataFrame.ts:414
head(n: number): Promise<Row[]>
Returns the first n rows.
Parameters
- n: number
  The number of rows to return
Returns Promise<Row[]>
A promise that resolves to an array of Row objects
Example
```
const firstRow = await df.head();
const first5 = await df.head(5);
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:429

first

first(): Promise<Row>
Returns the first row. Alias for head().

Returns Promise<Row>
A promise that resolves to the first Row
- Defined in src/org/apache/spark/sql/DataFrame.ts:444

take

take(n: number): Promise<Row[]>
Returns the first n rows. Alias for head(n).
Parameters
- n: number
  The number of rows to return
Returns Promise<Row[]>
A promise that resolves to an array of Row objects
- Defined in src/org/apache/spark/sql/DataFrame.ts:455

tail

tail(n: number): Promise<Row[]>
Returns the last n rows in the DataFrame.
Parameters
- n: number
  The number of rows to return from the end
Returns Promise<Row[]>
A promise that resolves to an array of Row objects
Example
```
const lastRows = await df.tail(10);
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:490

count

count(): Promise<bigint>
Returns the number of rows in the Dataset.

Returns Promise<bigint>
- Defined in src/org/apache/spark/sql/DataFrame.ts:691

basic

dtypes

dtypes(): Promise<[string, string][]>
Returns all column names and their data types as an array of tuples.

Returns Promise<[string, string][]>
A promise that resolves to an array of [columnName, dataType] tuples
Example
```
const types = await df.dtypes();
// Returns: [["name", "string"], ["age", "integer"], ...]
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:164

columns

columns(): Promise<string[]>
Returns all column names as an array.

Returns Promise<string[]>
A promise that resolves to an array of column names
Example
```
const cols = await df.columns();
// Returns: ["name", "age", "city"]
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:181

isEmpty

isEmpty(): Promise<boolean>
Returns true if this DataFrame contains zero rows.

Returns Promise<boolean>
A promise that resolves to true if the DataFrame is empty, false otherwise
Example
```
const empty = await df.isEmpty();
if (empty) {
  console.log("No data found");
}
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:200

isLocal

isLocal(): Promise<boolean>
Returns true if the collect and take methods can be run locally without any Spark executors.

Returns Promise<boolean>
A promise that resolves to true if operations can run locally
Example
```
const local = await df.isLocal();
console.log(`Can run locally: ${local}`);
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:217

checkpoint

checkpoint(eager?: boolean): Promise<DataFrame>
Returns a checkpointed version of this DataFrame. Checkpointing can be used to truncate the logical plan of this DataFrame, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with spark.sql.checkpoint.location.
Parameters
- eager: boolean = true
  Whether to checkpoint this DataFrame immediately (default is true). If false, the checkpoint will be performed when the DataFrame is first materialized.
Returns Promise<DataFrame>
- Defined in src/org/apache/spark/sql/DataFrame.ts:251

localCheckpoint

localCheckpoint(
eager?: boolean,
storageLevel?: StorageLevel,
): Promise<DataFrame>
Returns a locally checkpointed version of this DataFrame. Checkpointing can be used to truncate the logical plan of this DataFrame, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to a local temporary directory.

This is a local checkpoint and is less reliable than a regular checkpoint because it is stored in executor storage and may be lost if executors fail.
Parameters
- eager: boolean = true
  Whether to checkpoint this DataFrame immediately (default is true). If false, the checkpoint will be performed when the DataFrame is first materialized.
- OptionalstorageLevel: StorageLevel
  The storage level to use for the local checkpoint. If not specified, the default storage level is used.
Returns Promise<DataFrame>
- Defined in src/org/apache/spark/sql/DataFrame.ts:278

hint

hint(name: string, ...parameters: any[]): DataFrame
Specifies some hint on the current DataFrame. As an example, the following code specifies that one of the plan can be broadcasted:

{{{ df1.join(df2.hint("broadcast")) }}}

the following code specifies that this dataset could be rebalanced with given number of partitions:

{{{ df1.hint("rebalance", 10) }}}
Parameters
- name: string
  the name of the hint
- ...parameters: any[]
  the parameters of the hint, all the parameters should be a Column or Expression or could be converted into a Literal
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:659

dfops

toDF

toDF(...cols: string[]): DataFrame
Returns a new DataFrame with columns renamed.
Parameters
- ...cols: string[]
  New column names. If empty, returns this DataFrame unchanged.
Returns DataFrame
A new DataFrame with the specified column names
Example
```
// Rename columns
const df2 = df.toDF("name", "age", "city");
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:88

to

to(schema: StructType): DataFrame
Returns a new DataFrame with the specified schema applied.
Parameters
- schema: StructType
  The schema to apply to this DataFrame
Returns DataFrame
A new DataFrame with the specified schema
Example
```
const newSchema = new StructType([
  new StructField("name", DataTypes.StringType),
  new StructField("age", DataTypes.IntegerType)
]);
const df2 = df.to(newSchema);
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:113

limit

limit(n: number): DataFrame
Returns a new DataFrame by taking the first n rows.
Parameters
- n: number
  The number of rows to take
Returns DataFrame
A new DataFrame with at most n rows
Example
```
const top10 = df.limit(10);
await top10.show();
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:403

offset

offset(n: number): DataFrame
Returns a new DataFrame by skipping the first n rows.
Parameters
- n: number
  The number of rows to skip
Returns DataFrame
A new DataFrame with the first n rows removed
Example
```
const skipped = df.offset(100);
await skipped.show();
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:473

filter

filter(condition: Column): DataFrame
Filters rows using the given Column condition.
Parameters
- condition: Column
  A Column representing the filter condition
Returns DataFrame
A new DataFrame with rows matching the condition
- Defined in src/org/apache/spark/sql/DataFrame.ts:581
filter(conditionExpr: string): DataFrame
Filters rows using the given SQL expression string.
Parameters
- conditionExpr: string
  A SQL expression string representing the filter condition
Returns DataFrame
A new DataFrame with rows matching the condition
Example
```
// Using Column
const adults = df.filter(col("age").gt(18));

// Using SQL expression
const adults2 = df.filter("age > 18");
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:599

where

where(condition: Column): DataFrame
Filters rows using the given Column condition. Alias for filter().
Parameters
- condition: Column
  A Column representing the filter condition
Returns DataFrame
A new DataFrame with rows matching the condition
- Defined in src/org/apache/spark/sql/DataFrame.ts:613
where(conditionExpr: string): DataFrame
Filters rows using the given SQL expression string. Alias for filter().
Parameters
- conditionExpr: string
  A SQL expression string representing the filter condition
Returns DataFrame
A new DataFrame with rows matching the condition
Example
```
const result = df.where(col("status").equalTo("active"));
const result2 = df.where("status = 'active'");
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:628

streaming

isStreaming

isStreaming(): Promise<boolean>
Returns true if this DataFrame contains one or more sources that continuously return data as it arrives.

Returns Promise<boolean>
A promise that resolves to true if this is a streaming DataFrame
Example
```
const streaming = await df.isStreaming();
if (streaming) {
  console.log("This is a streaming DataFrame");
}
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:236

typedrel

union

union(other: DataFrame): DataFrame
Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION DISTINCT in SQL.

To do a SQL-style union that keeps duplicates, use [[unionAll]].

Also as standard in SQL, this function resolves columns by position (not by name):

{{{ val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") df1.union(df2).show

// output: // +----+----+----+ // |col0|col1|col2| // +----+----+----+ // | 1| 2| 3| // | 4| 5| 6| // +----+----+----+ }}}

Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use [[unionByName]] to resolve columns by field name in the typed objects.
Parameters
- other: DataFrame
Returns DataFrame
Since
2.0.0
- Defined in src/org/apache/spark/sql/DataFrame.ts:1126

unionAll

unionAll(other: DataFrame): DataFrame
Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION ALL in SQL.

To do a SQL-style set union (that does deduplication of elements), use [[union]].

Also as standard in SQL, this function resolves columns by position (not by name).
Parameters
- other: DataFrame
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1140

unionByName

unionByName(other: DataFrame): DataFrame
Returns a new Dataset containing union of rows in this Dataset and another Dataset.

Unlike [[union]], this function resolves columns by name (not by position). This is equivalent to UNION ALL in SQL with column name matching.

When the parameter allowMissingColumns is true, the set of column names in this and other Dataset can differ; missing columns will be filled with null.
Parameters
- other: DataFrame
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1155
unionByName(other: DataFrame, allowMissingColumns: boolean): DataFrame
Returns a new Dataset containing union of rows in this Dataset and another Dataset.

Unlike [[union]], this function resolves columns by name (not by position). This is equivalent to UNION ALL in SQL with column name matching.

When the parameter allowMissingColumns is true, the set of column names in this and other Dataset can differ; missing columns will be filled with null.
Parameters
- other: DataFrame
- allowMissingColumns: boolean
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1156

intersect

intersect(other: DataFrame): DataFrame
Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.
Parameters
- other: DataFrame
Returns DataFrame
Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
- Defined in src/org/apache/spark/sql/DataFrame.ts:1170

intersectAll

intersectAll(other: DataFrame): DataFrame
Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates. This is equivalent to INTERSECT ALL in SQL.
Parameters
- other: DataFrame
Returns DataFrame
Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Also as standard in SQL, this function resolves columns by position (not by name).
- Defined in src/org/apache/spark/sql/DataFrame.ts:1183

except

except(other: DataFrame): DataFrame
Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to EXCEPT DISTINCT in SQL.
Parameters
- other: DataFrame
Returns DataFrame
Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
- Defined in src/org/apache/spark/sql/DataFrame.ts:1195

exceptAll

exceptAll(other: DataFrame): DataFrame
Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL.
Parameters
- other: DataFrame
Returns DataFrame
Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Also as standard in SQL, this function resolves columns by position (not by name).
- Defined in src/org/apache/spark/sql/DataFrame.ts:1208

repartition

repartition(numPartitions: number): DataFrame
Returns a new DataFrame that has exactly numPartitions partitions.

This operation requires a shuffle, making it a wide transformation.
Parameters
- numPartitions: number
  The target number of partitions. Must be positive.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1220
repartition(...partitionExprs: Column[]): DataFrame
Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.

This operation requires a shuffle, making it a wide transformation.
Parameters
- ...partitionExprs: Column[]
  Column expressions to partition by
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1230
repartition(numPartitions: number, ...partitionExprs: Column[]): DataFrame
Returns a new DataFrame partitioned by the given partitioning expressions, using numPartitions partitions. The resulting DataFrame is hash partitioned.

This operation requires a shuffle, making it a wide transformation.
Parameters
- numPartitions: number
  The target number of partitions
- ...partitionExprs: Column[]
  Column expressions to partition by
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1241

coalesce

coalesce(numPartitions: number): DataFrame
Returns a new DataFrame that has exactly numPartitions partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(). This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
Parameters
- numPartitions: number
  The target number of partitions. Must be positive.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1277

repartitionByRange

repartitionByRange(...partitionExprs: Column[]): DataFrame
Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is range partitioned.

At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting DataFrame.

This operation requires a shuffle, making it a wide transformation.
Parameters
- ...partitionExprs: Column[]
  Column expressions to partition by
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1294
repartitionByRange(
numPartitions: number,
...partitionExprs: Column[],
): DataFrame
Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is range partitioned.

At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting DataFrame.

This operation requires a shuffle, making it a wide transformation.
Parameters
- numPartitions: number
  The target number of partitions
- ...partitionExprs: Column[]
  Column expressions to partition by
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1309

mapPartitions

mapPartitions(
    pythonCode: string,
    outputSchema: StructType,
    pythonVersion?: string,
): DataFrame
Apply a function to each partition of the DataFrame.

This method applies a user-defined function to each partition of the DataFrame. The function should take an iterator of rows and return an iterator of rows.
Parameters
- pythonCode: string
  Python code as a string defining the partition processing function
- outputSchema: StructType
  The output schema for the transformed DataFrame
- pythonVersion: string = '3.11'
  Python version (default: '3.11')
Returns DataFrame
A new DataFrame with the function applied to each partition
Example
```
const pythonCode = `
def process_partition(partition):
    for row in partition:
        yield (row.id * 2, row.value)
`;
const schema = DataTypes.createStructType([
  DataTypes.createStructField('id', DataTypes.IntegerType, false),
  DataTypes.createStructField('value', DataTypes.StringType, false),
]);
const result = df.mapPartitions(pythonCode, schema);
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:1359

coGroupMap

coGroupMap(
    other: DataFrame,
    thisGroupingCols: Column[],
    otherGroupingCols: Column[],
    pythonCode: string,
    outputSchema: StructType,
    pythonVersion?: string,
): DataFrame
Co-group two DataFrames and apply a function to each group.

This method groups two DataFrames by the specified columns and applies a user-defined function to each group pair. The function receives the group key and iterators for rows from both DataFrames.
Parameters
- other: DataFrame
  The other DataFrame to co-group with
- thisGroupingCols: Column[]
  Columns to group by for this DataFrame
- otherGroupingCols: Column[]
  Columns to group by for the other DataFrame
- pythonCode: string
  Python code as a string defining the co-group processing function
- outputSchema: StructType
  The output schema for the transformed DataFrame
- pythonVersion: string = '3.11'
  Python version (default: '3.11')
Returns DataFrame
A new DataFrame with the function applied to each co-group
Example
```
const pythonCode = `
def cogroup_func(key, left_rows, right_rows):
    for l in left_rows:
        for r in right_rows:
            yield (key.id, l.value, r.value)
`;
const schema = DataTypes.createStructType([
  DataTypes.createStructField('id', DataTypes.IntegerType, false),
  DataTypes.createStructField('left_value', DataTypes.StringType, false),
  DataTypes.createStructField('right_value', DataTypes.StringType, false),
]);
const result = df1.coGroupMap(df2, [col('id')], [col('id')], pythonCode, schema);
```
- Defined in src/org/apache/spark/sql/DataFrame.ts:1401

untypedrel

groupBy

groupBy(...cols: string[]): RelationalGroupedDataset
Groups the Dataset using the specified columns, so we can run aggregation on them. See [[RelationalGroupedDataset]] for all the available aggregate functions.

{{{ // Compute the average for all numeric columns grouped by department. ds.groupBy($"department").avg()

// Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" )) }}}
Parameters
- ...cols: string[]
Returns RelationalGroupedDataset
- Defined in src/org/apache/spark/sql/DataFrame.ts:680
groupBy(...cols: Column[]): RelationalGroupedDataset
Groups the Dataset using the specified columns, so we can run aggregation on them. See [[RelationalGroupedDataset]] for all the available aggregate functions.

{{{ // Compute the average for all numeric columns grouped by department. ds.groupBy($"department").avg()

// Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" )) }}}
Parameters
- ...cols: Column[]
Returns RelationalGroupedDataset
- Defined in src/org/apache/spark/sql/DataFrame.ts:681

rollup

rollup(...cols: string[]): RelationalGroupedDataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See [[RelationalGroupedDataset]] for all the available aggregate functions.

{{{ // Compute the average for all numeric columns rolled up by department and group. ds.rollup(col("department"), col("group")).avg() }}}
Parameters
- ...cols: string[]
Returns RelationalGroupedDataset
- Defined in src/org/apache/spark/sql/DataFrame.ts:710
rollup(...cols: Column[]): RelationalGroupedDataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See [[RelationalGroupedDataset]] for all the available aggregate functions.

{{{ // Compute the average for all numeric columns rolled up by department and group. ds.rollup(col("department"), col("group")).avg() }}}
Parameters
- ...cols: Column[]
Returns RelationalGroupedDataset
- Defined in src/org/apache/spark/sql/DataFrame.ts:711

cube

cube(...cols: string[]): RelationalGroupedDataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See [[RelationalGroupedDataset]] for all the available aggregate functions.

{{{ // Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg()

// Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" )) }}}
Parameters
- ...cols: string[]
Returns RelationalGroupedDataset
- Defined in src/org/apache/spark/sql/DataFrame.ts:734
cube(...cols: Column[]): RelationalGroupedDataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See [[RelationalGroupedDataset]] for all the available aggregate functions.

{{{ // Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg()

// Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" )) }}}
Parameters
- ...cols: Column[]
Returns RelationalGroupedDataset
- Defined in src/org/apache/spark/sql/DataFrame.ts:735

join

join(right: DataFrame, usingColumn: string): DataFrame
Join with another DataFrame.

Behaves as an INNER JOIN and resolves columns by name (not by position).
Parameters
- right: DataFrame
  Right side of the join operation.
- usingColumn: string
  Name of the column to join on. This column must exist on both sides.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:754
join(right: DataFrame, joinExprs: Column): DataFrame
Inner join with another DataFrame, using the given join expression.
Parameters
- right: DataFrame
  Right side of the join.
- joinExprs: Column
  Join expression.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:763
join(right: DataFrame, joinExprs: Column, joinType: string): DataFrame
Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.
Parameters
- right: DataFrame
  Right side of the join.
- joinExprs: Column
  Join expression.
- joinType: string
  Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti, left_anti.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:777
join(right: DataFrame, usingColumns: string[]): DataFrame
Inner join with another DataFrame using the list of columns to join on.
Parameters
- right: DataFrame
  Right side of the join.
- usingColumns: string[]
  Names of columns to join on. These columns must exist on both sides.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:786
join(right: DataFrame, usingColumns: string[], joinType: string): DataFrame
Join with another DataFrame using the list of columns to join on.
Parameters
- right: DataFrame
  Right side of the join.
- usingColumns: string[]
  Names of columns to join on. These columns must exist on both sides.
- joinType: string
  Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti, left_anti.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:799

crossJoin

crossJoin(right: DataFrame): DataFrame
Explicit cartesian join with another DataFrame.
Parameters
- right: DataFrame
  Right side of the join operation.
Returns DataFrame
Note
Cartesian joins are very expensive without an extra filter that can be pushed down.
- Defined in src/org/apache/spark/sql/DataFrame.ts:828

asOfJoin

asOfJoin(
    right: DataFrame,
    leftAsOf: Column,
    rightAsOf: Column,
    joinExprs?: Column,
    joinType?: string,
    tolerance?: Column,
    allowExactMatches?: boolean,
    direction?: string,
): DataFrame
Perform an as-of join between this DataFrame and another DataFrame.

This is similar to a left-join except that we match on nearest key rather than equal keys. For each row in the left DataFrame, we find the closest match in the right DataFrame based on the as-of column(s) and join condition.
Parameters
- right: DataFrame
  Right side of the join.
- leftAsOf: Column
  Column to join on from the left DataFrame.
- rightAsOf: Column
  Column to join on from the right DataFrame.
- OptionaljoinExprs: Column
  Optional additional join expression.
- OptionaljoinType: string
  Type of join to perform. Default inner.
- Optionaltolerance: Column
  Optional tolerance for inexact matches.
- OptionalallowExactMatches: boolean
  Whether to allow exact matches. Default true.
- Optionaldirection: string
  Direction of search. One of: backward, forward, nearest. Default backward.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:852
asOfJoin(
    right: DataFrame,
    leftAsOf: Column,
    rightAsOf: Column,
    usingColumns?: string[],
    joinType?: string,
    tolerance?: Column,
    allowExactMatches?: boolean,
    direction?: string,
): DataFrame
Perform an as-of join between this DataFrame and another DataFrame using column names.
Parameters
- right: DataFrame
  Right side of the join.
- leftAsOf: Column
  Column name to join on from the left DataFrame.
- rightAsOf: Column
  Column name to join on from the right DataFrame.
- OptionalusingColumns: string[]
  Names of columns to join on. These columns must exist on both sides.
- OptionaljoinType: string
  Type of join to perform. Default inner.
- Optionaltolerance: Column
  Optional tolerance for inexact matches.
- OptionalallowExactMatches: boolean
  Whether to allow exact matches. Default true.
- Optionaldirection: string
  Direction of search. One of: backward, forward, nearest. Default backward.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:876

lateralJoin

lateralJoin(right: DataFrame, joinType?: string, condition?: Column): DataFrame
Perform a lateral join between this DataFrame and another DataFrame.

Lateral joins allow the right side to reference columns from the left side. This is useful for operations like exploding arrays or applying table-valued functions.
Parameters
- right: DataFrame
  Right side of the join (typically a table-valued function or explode).
- OptionaljoinType: string
  Type of join to perform. Must be one of: inner, left, cross. Default inner.
- Optionalcondition: Column
  Optional join condition.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:941

unpivot

unpivot(
    ids: Column[],
    variableColumnName: string,
    valueColumnName: string,
): DataFrame
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed.

This function is useful to massage a DataFrame into a format where some columns are identifier columns ("ids"), while all other columns ("values") are "unpivoted" to the rows, leaving just two non-id columns, named as given by variableColumnName and valueColumnName.

When no "id" columns are given, the unpivoted DataFrame consists of only the "variable" and "value" columns.

All "value" columns must share a least common data type. Unless they are the same data type, all "value" columns are cast to the nearest common data type. For instance, types IntegerType and LongType are cast to LongType, while IntegerType and StringType do not have a common data type and unpivot fails with an AnalysisException.
Parameters
- ids: Column[]
  Id columns
- variableColumnName: string
  Name of the variable column
- valueColumnName: string
  Name of the value column
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:976
unpivot(
    ids: Column[],
    values: Column[],
    variableColumnName: string,
    valueColumnName: string,
): DataFrame
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed.

This function is useful to massage a DataFrame into a format where some columns are identifier columns ("ids"), while all other columns ("values") are "unpivoted" to the rows, leaving just two non-id columns, named as given by variableColumnName and valueColumnName.

When no "id" columns are given, the unpivoted DataFrame consists of only the "variable" and "value" columns.

All "value" columns must share a least common data type. Unless they are the same data type, all "value" columns are cast to the nearest common data type. For instance, types IntegerType and LongType are cast to LongType, while IntegerType and StringType do not have a common data type and unpivot fails with an AnalysisException.
Parameters
- ids: Column[]
  Id columns
- values: Column[]
  Value columns to unpivot
- variableColumnName: string
  Name of the variable column
- valueColumnName: string
  Name of the value column
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:977

melt

melt(
    ids: Column[],
    variableColumnName: string,
    valueColumnName: string,
): DataFrame
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed. This is an alias for unpivot.
Parameters
- ids: Column[]
  Id columns
- variableColumnName: string
  Name of the variable column
- valueColumnName: string
  Name of the value column
Returns DataFrame
See
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)

This is equivalent to calling Dataset#unpivot(Array, Array, String, String) where values is set to all non-id columns that exist in the DataFrame.
- Defined in src/org/apache/spark/sql/DataFrame.ts:1011
melt(
    ids: Column[],
    values: Column[],
    variableColumnName: string,
    valueColumnName: string,
): DataFrame
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse to groupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed. This is an alias for unpivot.
Parameters
- ids: Column[]
  Id columns
- values: Column[]
- variableColumnName: string
  Name of the variable column
- valueColumnName: string
  Name of the value column
Returns DataFrame
See
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)

This is equivalent to calling Dataset#unpivot(Array, Array, String, String) where values is set to all non-id columns that exist in the DataFrame.
- Defined in src/org/apache/spark/sql/DataFrame.ts:1012

transpose

transpose(): DataFrame
Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.

Please note:
- All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type.
- The name of the column into which the original column names are transposed defaults to "key".
- null values in the index column are excluded from the column names for the transposed table, which are ordered in ascending order.
{{{ val df = Seq(("A", 1, 2), ("B", 3, 4)).toDF("id", "val1", "val2") df.show() // output: // +---+----+----+ // | id|val1|val2| // +---+----+----+ // | A| 1| 2| // | B| 3| 4| // +---+----+----+

df.transpose($"id").show() // output: // +----+---+---+ // | key| A| B| // +----+---+---+ // |val1| 1| 3| // |val2| 2| 4| // +----+---+---+ // schema: // root // |-- key: string (nullable = false) // |-- A: integer (nullable = true) // |-- B: integer (nullable = true)

df.transpose().show() // output: // +----+---+---+ // | key| A| B| // +----+---+---+ // |val1| 1| 3| // |val2| 2| 4| // +----+---+---+ // schema: // root // |-- key: string (nullable = false) // |-- A: integer (nullable = true) // |-- B: integer (nullable = true) }}}
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1087
transpose(indexColumn: Column): DataFrame
Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.

Please note:
- All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type.
- The name of the column into which the original column names are transposed defaults to "key".
- null values in the index column are excluded from the column names for the transposed table, which are ordered in ascending order.
{{{ val df = Seq(("A", 1, 2), ("B", 3, 4)).toDF("id", "val1", "val2") df.show() // output: // +---+----+----+ // | id|val1|val2| // +---+----+----+ // | A| 1| 2| // | B| 3| 4| // +---+----+----+

df.transpose($"id").show() // output: // +----+---+---+ // | key| A| B| // +----+---+---+ // |val1| 1| 3| // |val2| 2| 4| // +----+---+---+ // schema: // root // |-- key: string (nullable = false) // |-- A: integer (nullable = true) // |-- B: integer (nullable = true)

df.transpose().show() // output: // +----+---+---+ // | key| A| B| // +----+---+---+ // |val1| 1| 3| // |val2| 2| 4| // +----+---+---+ // schema: // root // |-- key: string (nullable = false) // |-- A: integer (nullable = true) // |-- B: integer (nullable = true) }}}
Parameters
- indexColumn: Column
  The single column that will be treated as the index for the transpose operation. This column will be used to pivot the data, transforming the DataFrame such that the values of the indexColumn become the new columns in the transposed DataFrame.
Returns DataFrame
- Defined in src/org/apache/spark/sql/DataFrame.ts:1088

Class DataFrame

Remarks

Example

Since

Index

Constructors

Properties

Accessors

Methods

action

basic

dfops

streaming

typedrel

untypedrel

Constructors

constructor

Parameters

Returns DataFrame

Properties

Readonlyspark

Readonlyplan

Accessors

write

Returns DataFrameWriter

stat

Returns DataFrameStatFunctions

na

Returns DataFrameNaFunctions

Methods

schema

Returns Promise<StructType>

printSchema

Parameters

Returns Promise<void>

explain

Returns Promise<void>

Parameters

Returns Promise<void>

Parameters

Returns Promise<void>

withWatermark

Parameters

Returns Promise<DataFrame>

inputFiles

Returns Promise<string[]>

sameSemantics

Parameters

Returns Promise<boolean>

semanticHash

Returns Promise<number>

persist

Returns Promise<DataFrame>

Parameters

Returns Promise<DataFrame>

See

cache

Returns Promise<DataFrame>

unpersist

Parameters

Returns Promise<DataFrame>

storageLevel

Returns Promise<StorageLevel>

writeTo

Parameters

Returns DataFrameWriterV2

show

Returns Promise<void>

Parameters

Returns Promise<void>

Parameters

Returns Promise<void>

Parameters

Returns Promise<void>

select

Parameters

Returns DataFrame

Parameters

Returns DataFrame

selectExpr

`Readonly`spark

`Readonly`plan