Class DataFrameWriter

Interface used to write a [[Dataset]] to external storage systems (e.g. file systems, key-value stores, etc). Use DataFrame.write() to access this.

Stable

Since

1.0.0

See

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/DataFrameWriter.html

Author

Kent Yao yao@apache.org

TODO: Some features are not implemented yet:

Index

Constructors

constructor

Properties

Methods

mode format option options partitionBy bucketBy sortBy clusterBy save insertInto saveAsTable jdbc json parquet orc text csv xml

Constructors

constructor

new DataFrameWriter(df: DataFrame): DataFrameWriter
Parameters
- df: DataFrame
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:48

Properties

df

df: DataFrame

Methods

mode

mode(saveMode: string | SaveMode): DataFrameWriter
Parameters
- saveMode: string | SaveMode
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:50

format

format(source: string): DataFrameWriter
Parameters
- source: string
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:71

option

option(key: string, value: string | number | boolean): DataFrameWriter
Parameters
- key: string
- value: string | number | boolean
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:76

options

options(opts: Record<string, string>): DataFrameWriter
Parameters
- opts: Record<string, string>
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:81

partitionBy

partitionBy(...cols: string[]): DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
Parameters
- ...cols: string[]
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:98

bucketBy

bucketBy(
    numBuckets: number,
    colName: string,
    ...colNames: string[],
): DataFrameWriter
Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme, but with a different bucket hash function and is not compatible with Hive's bucketing.
Parameters
- numBuckets: number
- colName: string
- ...colNames: string[]
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:109

sortBy

sortBy(colName: string, ...colNames: string[]): DataFrameWriter
Sorts the output in each bucket by the given columns.
Parameters
- colName: string
- ...colNames: string[]
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:119

clusterBy

clusterBy(colName: string, ...colNames: string[]): DataFrameWriter
Clusters the output by the given columns on the storage. The rows with matching values in the specified clustering columns will be consolidated within the same group.

For instance, if you cluster a dataset by date, the data sharing the same date will be stored together in a file. This arrangement improves query efficiency when you apply selective filters to these clustering columns, thanks to data skipping.
Parameters
- colName: string
- ...colNames: string[]
Returns DataFrameWriter
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:132

save

save(path?: string): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame as the specified table.
Parameters
- Optionalpath: string
Returns Promise<ExecutePlanResponseHandler[]>
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:141

insertInto

insertInto(tableName: string): Promise<ExecutePlanResponseHandler[]>
Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
Parameters
- tableName: string
Returns Promise<ExecutePlanResponseHandler[]>
Note
Unlike saveAsTable, insertInto ignores the column names and just uses position-based resolution. For example:

Note
SaveMode.ErrorIfExists and SaveMode.Ignore behave as SaveMode.Append in insertInto as insertInto is not a table creating operation.
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:160

saveAsTable

saveAsTable(tableName: string): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame as the specified table.

In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception). When mode is Overwrite, the schema of the DataFrame does not need to be the same as that of the existing table.

When mode is Append, if there is an existing table, we will use the format and options of the existing table. The column order in the schema of the DataFrame doesn't need to be same as that of the existing table. Unlike insertInto, saveAsTable will use the column names to find the correct column positions. For example:

In this method, save mode is used to determine the behavior if the data source table exists in Spark catalog. We will always overwrite the underlying data of data source (e.g. a table in JDBC data source) if the table doesn't exist in Spark catalog, and will always append to the underlying data of data source if the table already exists.

When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
Parameters
- tableName: string
Returns Promise<ExecutePlanResponseHandler[]>
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:191

jdbc

jdbc(
    url: string,
    table: string,
    connectionProperties: Record<string, string>,
): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame to an external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception).

Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.

JDBC-specific option and parameter documentation for storing tables via JDBC in Data Source Option in the version you use.
Parameters
- url: string
  JDBC database url of the form jdbc:subprotocol:subname.
- table: string
  Name of the table in the external database.
- connectionProperties: Record<string, string>
  JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "batchsize" can be used to control the number of rows per insert. "isolationLevel" can be one of "NONE", "READ_COMMITTED", "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding to standard transaction isolation levels defined by JDBC's Connection object, with default of "READ_UNCOMMITTED".
Returns Promise<ExecutePlanResponseHandler[]>
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:221

json

json(path: string): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame in JSON format ( JSON Lines text format or newline-delimited JSON) at the specified path. This is equivalent to: {{{ format("json").save(path) }}}

You can find the JSON-specific options for writing JSON files in Data Source Option in the version you use.
Parameters
- path: string
Returns Promise<ExecutePlanResponseHandler[]>
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:245

parquet

parquet(path: string): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame in Parquet format at the specified path. This is equivalent to: {{{ format("parquet").save(path) }}}

Parquet-specific option(s) for writing Parquet files can be found in Data Source Option in the version you use.
Parameters
- path: string
Returns Promise<ExecutePlanResponseHandler[]>
Since
1.0.0
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:262

orc

orc(path: string): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame in ORC format at the specified path. This is equivalent to: {{{ format("orc").save(path) }}}

ORC-specific option(s) for writing ORC files can be found in Data Source Option in the version you use.
Parameters
- path: string
Returns Promise<ExecutePlanResponseHandler[]>
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:277

text

text(path: string): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame in a text file at the specified path. The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file.

You can find the text-specific options for writing text files in Data Source Option in the version you use.
Parameters
- path: string
Returns Promise<ExecutePlanResponseHandler[]>
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:289

csv

csv(path: string): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame in CSV format at the specified path. This is equivalent to: {{{ format("csv").save(path) }}}

You can find the CSV-specific options for writing CSV files in Data Source Option in the version you use.
Parameters
- path: string
Returns Promise<ExecutePlanResponseHandler[]>
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:304

xml

xml(path: string): Promise<ExecutePlanResponseHandler[]>
Saves the content of the DataFrame in XML format at the specified path. This is equivalent to: {{{ format("xml").save(path) }}}

Note that writing a XML file from DataFrame having a field ArrayType with its element as ArrayType would have an additional nested field for the element. For example, the DataFrame having a field below,

{@code fieldA [[data1], [data2]]}

would produce a XML file below. {@code data1 data2 }

Namely, roundtrip in writing and reading can end up in different schema structure.

You can find the XML-specific options for writing XML files in Data Source Option in the version you use.
Parameters
- path: string
Returns Promise<ExecutePlanResponseHandler[]>
- Defined in src/org/apache/spark/sql/DataFrameWriter.ts:330

Class DataFrameWriter

Stable

Since

See

Author

Index

Constructors

Properties

Methods

Constructors

constructor

Parameters

Returns DataFrameWriter

Properties

df

Methods

mode

Parameters

Returns DataFrameWriter

format

Parameters

Returns DataFrameWriter

option

Parameters

Returns DataFrameWriter

options

Parameters

Returns DataFrameWriter

partitionBy

Parameters

Returns DataFrameWriter

bucketBy

Parameters

Returns DataFrameWriter

sortBy

Parameters

Returns DataFrameWriter

clusterBy

Parameters

Returns DataFrameWriter

save

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

insertInto

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

Note

Note

saveAsTable

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

jdbc

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

json

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

parquet

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

Since

orc

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

text

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

csv

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

xml

Parameters

Returns Promise<ExecutePlanResponseHandler[]>

Settings

On This Page