Class Dataset<T>
- All Implemented Interfaces:
Serializable
DataFrame
, which is a Dataset of Row
.
Operations available on Datasets are divided into transformations and actions. Transformations
are the ones that produce new Datasets, and actions are the ones that trigger computation and
return results. Example transformations include map, filter, select, and aggregate (groupBy
).
Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally,
a Dataset represents a logical plan that describes the computation required to produce the data.
When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a
physical plan for efficient execution in a parallel and distributed manner. To explore the
logical plan as well as optimized physical plan, use the explain
function.
To efficiently support domain-specific objects, an Encoder
is required. The encoder maps
the domain specific type T
to Spark's internal type system. For example, given a class Person
with two fields, name
(string) and age
(int), an encoder is used to tell Spark to generate
code at runtime to serialize the Person
object into a binary structure. This binary structure
often has much lower memory footprint as well as are optimized for efficiency in data processing
(e.g. in a columnar format). To understand the internal binary representation for data, use the
schema
function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark
to some files on storage systems, using the read
function available on a SparkSession
.
val people = spark.read.parquet("...").as[Person] // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String]
Dataset<String> names = people.map(
(MapFunction<Person, String>) p -> p.name, Encoders.STRING()); // Java
Dataset operations can also be untyped, through various domain-specific-language (DSL)
functions defined in: Dataset (this class), Column
, and functions
. These operations
are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply
method in Scala and col
in Java.
val ageCol = people("age") // in Scala
Column ageCol = people.col("age"); // in Java
Note that the Column
type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10.
people("age") + 10 // in Scala
people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), people("gender"))
.agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Dataset<Row> using SparkSession
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");
people.filter(people.col("age").gt(30))
.join(department, people.col("deptId").equalTo(department.col("id")))
.groupBy(department.col("name"), people.col("gender"))
.agg(avg(people.col("salary")), max(people.col("age")));
- Since:
- 1.6.0
- See Also:
-
Constructor Summary
ConstructorDescriptionDataset
(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder) Dataset
(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder) -
Method Summary
Modifier and TypeMethodDescription(Java-specific) Aggregates on the entire Dataset without groups.Aggregates on the entire Dataset without groups.Aggregates on the entire Dataset without groups.(Scala-specific) Aggregates on the entire Dataset without groups.agg
(scala.Tuple2<String, String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String, String>> aggExprs) (Scala-specific) Aggregates on the entire Dataset without groups.Returns a new Dataset with an alias set.alias
(scala.Symbol alias) (Scala-specific) Returns a new Dataset with an alias set.static <T> Dataset<T>
apply
(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> evidence$1) Returns a new Dataset with an alias set.<U> Dataset<U>
Returns a new Dataset where each record has been mapped on to the specified type.as
(scala.Symbol alias) (Scala-specific) Returns a new Dataset with an alias set.cache()
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).Eagerly checkpoint a Dataset and return the new Dataset.checkpoint
(boolean eager) Returns a checkpointed version of this Dataset.coalesce
(int numPartitions) Returns a new Dataset that has exactlynumPartitions
partitions, when the fewer partitions are requested.Selects column based on the column name and returns it as aColumn
.static String
collect()
Returns an array that contains all rows in this Dataset.Returns a Java list that contains all rows in this Dataset.Selects column based on the column name specified as a regex and returns it asColumn
.long
count()
Returns the number of rows in the Dataset.Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.static AtomicLong
curId()
static String
static org.apache.spark.sql.catalyst.trees.TreeNodeTag<scala.collection.mutable.HashSet<Object>>
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.distinct()
Returns a new Dataset that contains only the unique rows from this Dataset.Returns a new Dataset with a column dropped.Returns a new Dataset with columns dropped.Returns a new Dataset with column dropped.Returns a new Dataset with columns dropped.Returns a new Dataset with columns dropped.Returns a new Dataset with columns dropped.Returns a new Dataset that contains only the unique rows from this Dataset.dropDuplicates
(String[] colNames) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.dropDuplicates
(String col1, String... cols) Returns a newDataset
with duplicate rows removed, considering only the subset of columns.dropDuplicates
(String col1, scala.collection.immutable.Seq<String> cols) Returns a newDataset
with duplicate rows removed, considering only the subset of columns.dropDuplicates
(scala.collection.immutable.Seq<String> colNames) (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.Returns a new Dataset with duplicates rows removed, within watermark.dropDuplicatesWithinWatermark
(String[] colNames) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.dropDuplicatesWithinWatermark
(String col1, String... cols) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.dropDuplicatesWithinWatermark
(String col1, scala.collection.immutable.Seq<String> cols) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.dropDuplicatesWithinWatermark
(scala.collection.immutable.Seq<String> colNames) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.encoder()
void
Prints the plans (logical and physical) with a format specified by a given explain mode.explode
(String inputColumn, String outputColumn, scala.Function1<A, scala.collection.IterableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$5) Deprecated.use flatMap() or select() with functions.explode() instead.explode
(scala.collection.immutable.Seq<Column> input, scala.Function1<Row, scala.collection.IterableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$4) Deprecated.use flatMap() or select() with functions.explode() instead.Filters rows using the given SQL expression.filter
(FilterFunction<T> func) (Java-specific) Returns a new Dataset that only contains elements wherefunc
returnstrue
.Filters rows using the given condition.(Scala-specific) Returns a new Dataset that only contains elements wherefunc
returnstrue
.<U> Dataset<U>
flatMap
(FlatMapFunction<T, U> f, Encoder<U> encoder) (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.<U> Dataset<U>
(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.void
(Java-specific) Runsfunc
on each partition of this Dataset.void
foreachPartition
(scala.Function1<scala.collection.Iterator<T>, scala.runtime.BoxedUnit> f) Applies a functionf
to each partition of this Dataset.Groups the Dataset using the specified columns, so that we can run aggregation on them.Groups the Dataset using the specified columns, so that we can run aggregation on them.Groups the Dataset using the specified columns, so we can run aggregation on them.Groups the Dataset using the specified columns, so we can run aggregation on them.<K> KeyValueGroupedDataset<K,
T> groupByKey
(MapFunction<T, K> func, Encoder<K> encoder) (Java-specific) Returns aKeyValueGroupedDataset
where the data is grouped by the given keyfunc
.<K> KeyValueGroupedDataset<K,
T> groupByKey
(scala.Function1<T, K> func, Encoder<K> evidence$3) (Scala-specific) Returns aKeyValueGroupedDataset
where the data is grouped by the given keyfunc
.groupingSets
(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, Column... cols) Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them.groupingSets
(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, scala.collection.immutable.Seq<Column> cols) Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them.head
(int n) Returns the firstn
rows.Specifies some hint on the current Dataset.Specifies some hint on the current Dataset.String[]
Returns a best-effort snapshot of the files that compose this Dataset.intersectAll
(Dataset<T> other) boolean
isEmpty()
Returns true if theDataset
is empty.boolean
isLocal()
Returns true if thecollect
andtake
methods can be run locally (without any Spark executors).boolean
Returns true if this Dataset contains one or more sources that continuously return data as it arrives.boolean
javaRDD()
Returns the content of the Dataset as aJavaRDD
ofT
s.limit
(int n) Returns a new Dataset by taking the firstn
rows.Eagerly locally checkpoints a Dataset and return the new Dataset.localCheckpoint
(boolean eager) Locally checkpoints a Dataset and return the new Dataset.<U> Dataset<U>
map
(MapFunction<T, U> func, Encoder<U> encoder) (Java-specific) Returns a new Dataset that contains the result of applyingfunc
to each element.<U> Dataset<U>
(Scala-specific) Returns a new Dataset that contains the result of applyingfunc
to each element.<U> Dataset<U>
mapPartitions
(MapPartitionsFunction<T, U> f, Encoder<U> encoder) (Java-specific) Returns a new Dataset that contains the result of applyingf
to each partition.<U> Dataset<U>
mapPartitions
(scala.Function1<scala.collection.Iterator<T>, scala.collection.Iterator<U>> func, Encoder<U> evidence$7) (Scala-specific) Returns a new Dataset that contains the result of applyingfunc
to each partition.Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.Merges a set of updates, insertions, and deletions based on a source table into a target table.metadataColumn
(String colName) Selects a metadata column based on its logical column name, and returns it as aColumn
.na()
Returns aDataFrameNaFunctions
for working with missing data.Define (named) metrics to observe on the Dataset.Define (named) metrics to observe on the Dataset.observe
(Observation observation, Column expr, Column... exprs) Observe (named) metrics through anorg.apache.spark.sql.Observation
instance.observe
(Observation observation, Column expr, scala.collection.immutable.Seq<Column> exprs) Observe (named) metrics through anorg.apache.spark.sql.Observation
instance.offset
(int n) Returns a new Dataset by skipping the firstn
rows.ofRows
(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan) ofRows
(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, org.apache.spark.sql.catalyst.QueryPlanningTracker tracker, org.apache.spark.sql.execution.ShuffleCleanupMode shuffleCleanupMode) A variant of ofRows that allows passing in a tracker so we can track query parsing time.ofRows
(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, org.apache.spark.sql.execution.ShuffleCleanupMode shuffleCleanupMode) Returns a new Dataset sorted by the given expressions.Returns a new Dataset sorted by the given expressions.Returns a new Dataset sorted by the given expressions.Returns a new Dataset sorted by the given expressions.persist()
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).persist
(StorageLevel newLevel) Persist this Dataset with the given storage level.org.apache.spark.sql.execution.QueryExecution
randomSplit
(double[] weights) Randomly splits this Dataset with the provided weights.randomSplit
(double[] weights, long seed) Randomly splits this Dataset with the provided weights.randomSplitAsList
(double[] weights, long seed) Returns a Java list that contains randomly split Dataset with the provided weights.rdd()
(Scala-specific) Reduces the elements of this Dataset using the specified binary function.repartition
(int numPartitions) Returns a new Dataset that has exactlynumPartitions
partitions.repartition
(int numPartitions, Column... partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions
.repartition
(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions
.repartition
(Column... partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions.repartition
(scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions.repartitionByRange
(int numPartitions, Column... partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions
.repartitionByRange
(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions
.repartitionByRange
(Column... partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions.repartitionByRange
(scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions.Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.boolean
sameSemantics
(Dataset<T> other) sample
(boolean withReplacement, double fraction) Returns a newDataset
by sampling a fraction of rows, using a random seed.sample
(boolean withReplacement, double fraction, long seed) Returns a newDataset
by sampling a fraction of rows, using a user-supplied seed.sample
(double fraction) Returns a newDataset
by sampling a fraction of rows (without replacement), using a random seed.sample
(double fraction, long seed) Returns a newDataset
by sampling a fraction of rows (without replacement), using a user-supplied seed.schema()
Returns the schema of this Dataset.Selects a set of columns.Selects a set of columns.Selects a set of column based expressions.<U1> Dataset<U1>
select
(TypedColumn<T, U1> c1) Returns a new Dataset by computing the givenColumn
expression for each element.<U1,
U2> Dataset<scala.Tuple2<U1, U2>> select
(TypedColumn<T, U1> c1, TypedColumn<T, U2> c2) Returns a new Dataset by computing the givenColumn
expressions for each element.<U1,
U2, U3> Dataset<scala.Tuple3<U1, U2, U3>> select
(TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3) Returns a new Dataset by computing the givenColumn
expressions for each element.<U1,
U2, U3, U4>
Dataset<scala.Tuple4<U1,U2, U3, U4>> select
(TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3, TypedColumn<T, U4> c4) Returns a new Dataset by computing the givenColumn
expressions for each element.<U1,
U2, U3, U4, U5>
Dataset<scala.Tuple5<U1,U2, U3, U4, U5>> select
(TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3, TypedColumn<T, U4> c4, TypedColumn<T, U5> c5) Returns a new Dataset by computing the givenColumn
expressions for each element.Selects a set of column based expressions.selectExpr
(String... exprs) Selects a set of SQL expressions.selectExpr
(scala.collection.immutable.Seq<String> exprs) Selects a set of SQL expressions.int
Returns ahashCode
of the logical query plan against thisDataset
.void
show
(int numRows, boolean truncate) Displays the Dataset in a tabular form.void
show
(int numRows, int truncate, boolean vertical) Displays the Dataset in a tabular form.Returns a new Dataset sorted by the specified column, all in ascending order.Returns a new Dataset sorted by the specified column, all in ascending order.Returns a new Dataset sorted by the given expressions.Returns a new Dataset sorted by the given expressions.sortWithinPartitions
(String sortCol, String... sortCols) Returns a new Dataset with each partition sorted by the given expressions.sortWithinPartitions
(String sortCol, scala.collection.immutable.Seq<String> sortCols) Returns a new Dataset with each partition sorted by the given expressions.sortWithinPartitions
(Column... sortExprs) Returns a new Dataset with each partition sorted by the given expressions.sortWithinPartitions
(scala.collection.immutable.Seq<Column> sortExprs) Returns a new Dataset with each partition sorted by the given expressions.stat()
Returns aDataFrameStatFunctions
for working statistic functions support.Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.Computes specified statistics for numeric and string columns.Computes specified statistics for numeric and string columns.tail
(int n) Returns the lastn
rows in the Dataset.to
(StructType schema) Returns a new DataFrame where each row is reconciled to match the specified schema.toDF()
Converts this strongly typed collection of data to generic Dataframe.Converts this strongly typed collection of data to genericDataFrame
with columns renamed.Converts this strongly typed collection of data to genericDataFrame
with columns renamed.Returns the content of the Dataset as aJavaRDD
ofT
s.toJSON()
Returns the content of the Dataset as a Dataset of JSON strings.Returns an iterator that contains all rows in this Dataset.toString()
Transposes a DataFrame, switching rows to columns.Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.unionByName
(Dataset<T> other) unionByName
(Dataset<T> other, boolean allowMissingColumns) Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.unpersist
(boolean blocking) Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.Filters rows using the given SQL expression.Filters rows using the given condition.withColumn
(String colName, Column col) Returns a new Dataset by adding a column or replacing the existing column that has the same name.withColumnRenamed
(String existingName, String newName) Returns a new Dataset with a column renamed.withColumns
(Map<String, Column> colsMap) (Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.withColumns
(scala.collection.immutable.Map<String, Column> colsMap) (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.withColumnsRenamed
(Map<String, String> colsMap) (Java-specific) Returns a new Dataset with a columns renamed.withColumnsRenamed
(scala.collection.immutable.Map<String, String> colsMap) (Scala-specific) Returns a new Dataset with a columns renamed.withMetadata
(String columnName, Metadata metadata) Returns a new Dataset by updating an existing column with metadata.withWatermark
(String eventTime, String delayThreshold) Defines an event time watermark for thisDataset
.write()
Interface for saving the content of the non-streaming Dataset out into external storage.Interface for saving the content of the streaming Dataset out into external storage.Create a write configuration builder for v2 sources.Methods inherited from class org.apache.spark.sql.api.Dataset
apply, columns, createGlobalTempView, createOrReplaceGlobalTempView, createOrReplaceTempView, createTempView, crossJoin, dtypes, except, exceptAll, explain, explain, first, foreach, foreach, head, intersect, intersectAll, join, join, join, join, join, join, join, join, join, joinWith, joinWith, printSchema, printSchema, reduce, registerTempTable, sameSemantics, show, show, show, show, take, takeAsList, transform, union, unionAll, unionByName, unionByName
-
Constructor Details
-
Dataset
public Dataset(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder) -
Dataset
public Dataset(SQLContext sqlContext, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> encoder)
-
-
Method Details
-
curId
-
DATASET_ID_KEY
-
COL_POS_KEY
-
DATASET_ID_TAG
public static org.apache.spark.sql.catalyst.trees.TreeNodeTag<scala.collection.mutable.HashSet<Object>> DATASET_ID_TAG() -
apply
public static <T> Dataset<T> apply(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, Encoder<T> evidence$1) -
ofRows
public static Dataset<Row> ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan) -
ofRows
public static Dataset<Row> ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, org.apache.spark.sql.execution.ShuffleCleanupMode shuffleCleanupMode) -
ofRows
public static Dataset<Row> ofRows(SparkSession sparkSession, org.apache.spark.sql.catalyst.plans.logical.LogicalPlan logicalPlan, org.apache.spark.sql.catalyst.QueryPlanningTracker tracker, org.apache.spark.sql.execution.ShuffleCleanupMode shuffleCleanupMode) A variant of ofRows that allows passing in a tracker so we can track query parsing time. -
toDF
Description copied from class:Dataset
Converts this strongly typed collection of data to genericDataFrame
with columns renamed. This can be quite convenient in conversion from an RDD of tuples into aDataFrame
with meaningful names. For example:val rdd: RDD[(Int, String)] = ... rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2` rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
-
hint
Description copied from class:Dataset
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:df1.join(df2.hint("broadcast"))
the following code specifies that this dataset could be rebalanced with given number of partitions:
df1.hint("rebalance", 10)
-
select
Description copied from class:Dataset
Selects a set of column based expressions.ds.select($"colA", $"colB" + 1)
-
groupBy
Description copied from class:Dataset
Groups the Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.// Compute the average for all numeric columns grouped by department. ds.groupBy($"department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
rollup
Description copied from class:Dataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.// Compute the average for all numeric columns rolled up by department and group. ds.rollup($"department", $"group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
cube
Description copied from class:Dataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.// Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
groupingSets
public RelationalGroupedDataset groupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, Column... cols) Description copied from class:Dataset
Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.// Compute the average for all numeric columns group by specific grouping sets. ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg() // Compute the max age and average salary, group by specific grouping sets. ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map( "salary" -> "avg", "age" -> "max" ))
- Overrides:
groupingSets
in classDataset<T>
- Parameters:
groupingSets
- (undocumented)cols
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
observe
Description copied from class:Dataset
Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:- It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset at that point.
- It will report the value of the defined aggregate columns as soon as we reach a completion point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.
The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.
-
observe
Description copied from class:Dataset
Observe (named) metrics through anorg.apache.spark.sql.Observation
instance. This method does not support streaming datasets.A user can retrieve the metrics by accessing
org.apache.spark.sql.Observation.get
.// Observe row count (rows) and highest id (maxid) in the Dataset while writing it val observation = Observation("my_metrics") val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) observed_ds.write.parquet("ds.parquet") val metrics = observation.get
-
drop
Description copied from class:Dataset
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
-
drop
Description copied from class:Dataset
Returns a new Dataset with columns dropped.This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.
-
summary
Description copied from class:Dataset
Computes specified statistics for numeric and string columns. Available statistics are:- count
- mean
- stddev
- min
- max
- arbitrary approximate percentiles specified as a percentage (e.g. 75%)
- count_distinct
- approx_count_distinct
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
agg
function instead.ds.summary().show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // 25% 24.0 176.0 // 50% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0
ds.summary("count", "min", "25%", "75%", "max").show() // output: // summary age height // count 10.0 10.0 // min 18.0 163.0 // 25% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0
To do a summary for specific columns first select them:
ds.select("age", "height").summary().show()
Specify statistics to output custom summaries:
ds.summary("count", "count_distinct").show()
The distinct count isn't included by default.
You can also run approximate distinct counts which are faster:
ds.summary("count", "approx_count_distinct").show()
See also
Dataset.describe(java.lang.String...)
for basic statistics. -
describe
Description copied from class:Dataset
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
agg
function instead.ds.describe("age", "height").show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // max 92.0 192.0
Use
Dataset.summary(java.lang.String...)
for expanded statistics and control over which statistics to compute. -
select
Description copied from class:Dataset
Selects a set of columns. This is a variant ofselect
that can only select existing columns using column names (i.e. cannot construct expressions).// The following two are equivalent: ds.select("colA", "colB") ds.select($"colA", $"colB")
-
sortWithinPartitions
Description copied from class:Dataset
Returns a new Dataset with each partition sorted by the given expressions.This is the same operation as "SORT BY" in SQL (Hive QL).
- Overrides:
sortWithinPartitions
in classDataset<T>
- Parameters:
sortCol
- (undocumented)sortCols
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
sortWithinPartitions
Description copied from class:Dataset
Returns a new Dataset with each partition sorted by the given expressions.This is the same operation as "SORT BY" in SQL (Hive QL).
- Overrides:
sortWithinPartitions
in classDataset<T>
- Parameters:
sortExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
sort
Description copied from class:Dataset
Returns a new Dataset sorted by the specified column, all in ascending order.// The following 3 are equivalent ds.sort("sortcol") ds.sort($"sortcol") ds.sort($"sortcol".asc)
-
sort
Description copied from class:Dataset
Returns a new Dataset sorted by the given expressions. For example:ds.sort($"col1", $"col2".desc)
-
orderBy
Description copied from class:Dataset
Returns a new Dataset sorted by the given expressions. This is an alias of thesort
function. -
orderBy
Description copied from class:Dataset
Returns a new Dataset sorted by the given expressions. This is an alias of thesort
function. -
selectExpr
Description copied from class:Dataset
Selects a set of SQL expressions. This is a variant ofselect
that accepts SQL expressions.// The following are equivalent: ds.selectExpr("colA", "colB as newName", "abs(colC)") ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
- Overrides:
selectExpr
in classDataset<T>
- Parameters:
exprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
dropDuplicates
Description copied from class:Dataset
Returns a newDataset
with duplicate rows removed, considering only the subset of columns.For a static batch
Dataset
, it just drops duplicate rows. For a streamingDataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can useDataset.withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Overrides:
dropDuplicates
in classDataset<T>
- Parameters:
col1
- (undocumented)cols
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
dropDuplicatesWithinWatermark
Description copied from class:Dataset
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.This only works with streaming
Dataset
, and watermark for the inputDataset
must be set viaDataset.withWatermark(java.lang.String,java.lang.String)
.For a streaming
Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Overrides:
dropDuplicatesWithinWatermark
in classDataset<T>
- Parameters:
col1
- (undocumented)cols
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
repartition
Description copied from class:Dataset
Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions
. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Overrides:
repartition
in classDataset<T>
- Parameters:
numPartitions
- (undocumented)partitionExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
repartition
Description copied from class:Dataset
Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Overrides:
repartition
in classDataset<T>
- Parameters:
partitionExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
repartitionByRange
Description copied from class:Dataset
Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions
. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition
.- Overrides:
repartitionByRange
in classDataset<T>
- Parameters:
numPartitions
- (undocumented)partitionExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
repartitionByRange
Description copied from class:Dataset
Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition
.- Overrides:
repartitionByRange
in classDataset<T>
- Parameters:
partitionExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
groupBy
Description copied from class:Dataset
Groups the Dataset using the specified columns, so that we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department. ds.groupBy("department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
rollup
Description copied from class:Dataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolled up by department and group. ds.rollup("department", "group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
cube
Description copied from class:Dataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group. ds.cube("department", "group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
agg
Description copied from class:Dataset
Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(max($"age"), avg($"salary")) ds.groupBy().agg(max($"age"), avg($"salary"))
-
queryExecution
public org.apache.spark.sql.execution.QueryExecution queryExecution() -
encoder
-
sparkSession
- Specified by:
sparkSession
in classDataset<T>
-
sqlContext
-
toString
-
toDF
Description copied from class:Dataset
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns genericRow
objects that allow fields to be accessed by ordinal or name. -
as
Description copied from class:Dataset
Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type ofU
:- When
U
is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined byspark.sql.caseSensitive
). - When
U
is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to_1
). - When
U
is a primitive type (i.e. String, Int, etc), then the first column of theDataFrame
will be used.
If the schema of the Dataset does not match the desired
U
type, you can useselect
along withalias
oras
to rearrange or rename as required.Note that
as[]
only changes the view of the data that is passed into typed operations, such asmap()
, and does not eagerly project away any columns that are not present in the specified class. - When
-
to
Description copied from class:Dataset
Returns a new DataFrame where each row is reconciled to match the specified schema. Spark will:- Reorder columns and/or inner fields by name to match the specified schema.
- Project away columns and/or inner fields that are not needed by the specified schema. Missing columns and/or inner fields (present in the specified schema but not input DataFrame) lead to failures.
- Cast the columns and/or inner fields to match the data types in the specified schema, if the types are compatible, e.g., numeric to numeric (error if overflows), but not string to int.
- Carry over the metadata from the specified schema, while the columns and/or inner fields still keep their own metadata if not overwritten by the specified schema.
- Fail if the nullability is not compatible. For example, the column and/or inner field is nullable but the specified schema requires them to be not nullable.
-
toDF
Description copied from class:Dataset
Converts this strongly typed collection of data to genericDataFrame
with columns renamed. This can be quite convenient in conversion from an RDD of tuples into aDataFrame
with meaningful names. For example:val rdd: RDD[(Int, String)] = ... rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2` rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
-
schema
Description copied from class:Dataset
Returns the schema of this Dataset. -
explain
Description copied from class:Dataset
Prints the plans (logical and physical) with a format specified by a given explain mode.- Specified by:
explain
in classDataset<T>
- Parameters:
mode
- specifies the expected output format of plans.simple
Print only a physical plan.extended
: Print both logical and physical plans.codegen
: Print a physical plan and generated codes if they are available.cost
: Print a logical plan and statistics if they are available.formatted
: Split explain output into two sections: a physical plan outline and node details.
- Inheritdoc:
-
isLocal
public boolean isLocal()Description copied from class:Dataset
Returns true if thecollect
andtake
methods can be run locally (without any Spark executors). -
isEmpty
public boolean isEmpty()Description copied from class:Dataset
Returns true if theDataset
is empty. -
isStreaming
public boolean isStreaming()Description copied from class:Dataset
Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as aStreamingQuery
using thestart()
method inDataStreamWriter
. Methods that return a single answer, e.g.count()
orcollect()
, will throw anAnalysisException
when there is a streaming source present.- Specified by:
isStreaming
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
withWatermark
Description copied from class:Dataset
Defines an event time watermark for thisDataset
. A watermark tracks a point in time before which we assume no more late data is going to arrive.Spark will use this watermark for several purposes:
- To know when a given time window aggregation can be finalized and thus can be emitted when using output modes that do not allow updates.
- To minimize the amount of state that we need to keep for on-going
aggregations,
mapGroupsWithState
anddropDuplicates
operators.
MAX(eventTime)
seen across all of the partitions in the query minus a user specifieddelayThreshold
. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at leastdelayThreshold
behind the actual event time. In some cases we may still process records that arrive more thandelayThreshold
late.- Specified by:
withWatermark
in classDataset<T>
- Parameters:
eventTime
- the name of the column that contains the event time of the row.delayThreshold
- the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.- Returns:
- (undocumented)
- Inheritdoc:
-
show
public void show(int numRows, boolean truncate) Description copied from class:Dataset
Displays the Dataset in a tabular form. For example:year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
-
show
public void show(int numRows, int truncate, boolean vertical) Description copied from class:Dataset
Displays the Dataset in a tabular form. For example:year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
If
vertical
enabled, this command prints output rows vertically (one line per column value)?-RECORD 0------------------- year | 1980 month | 12 AVG('Adj Close) | 0.503218 AVG('Adj Close) | 0.595103 -RECORD 1------------------- year | 1981 month | 01 AVG('Adj Close) | 0.523289 AVG('Adj Close) | 0.570307 -RECORD 2------------------- year | 1982 month | 02 AVG('Adj Close) | 0.436504 AVG('Adj Close) | 0.475256 -RECORD 3------------------- year | 1983 month | 03 AVG('Adj Close) | 0.410516 AVG('Adj Close) | 0.442194 -RECORD 4------------------- year | 1984 month | 04 AVG('Adj Close) | 0.450090 AVG('Adj Close) | 0.483521
-
na
Description copied from class:Dataset
Returns aDataFrameNaFunctions
for working with missing data.// Dropping rows containing any null values. ds.na.drop()
-
stat
Description copied from class:Dataset
Returns aDataFrameStatFunctions
for working statistic functions support.// Finding frequent items in column with name 'a'. ds.stat.freqItems(Seq("a"))
-
join
- Inheritdoc:
-
join
public Dataset<Row> join(Dataset<?> right, scala.collection.immutable.Seq<String> usingColumns, String joinType) - Inheritdoc:
-
join
- Inheritdoc:
-
crossJoin
- Inheritdoc:
-
joinWith
- Inheritdoc:
-
hint
Description copied from class:Dataset
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:df1.join(df2.hint("broadcast"))
the following code specifies that this dataset could be rebalanced with given number of partitions:
df1.hint("rebalance", 10)
-
col
Description copied from class:Dataset
Selects column based on the column name and returns it as aColumn
. -
metadataColumn
Description copied from class:Dataset
Selects a metadata column based on its logical column name, and returns it as aColumn
.A metadata column can be accessed this way even if the underlying data source defines a data column with a conflicting name.
- Specified by:
metadataColumn
in classDataset<T>
- Parameters:
colName
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
colRegex
Description copied from class:Dataset
Selects column based on the column name specified as a regex and returns it asColumn
. -
as
Description copied from class:Dataset
Returns a new Dataset with an alias set. -
select
Description copied from class:Dataset
Selects a set of column based expressions.ds.select($"colA", $"colB" + 1)
-
select
Description copied from class:Dataset
Returns a new Dataset by computing the givenColumn
expression for each element.val ds = Seq(1, 2, 3).toDS() val newDS = ds.select(expr("value + 1").as[Int])
-
filter
Description copied from class:Dataset
Filters rows using the given condition.// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)
-
groupBy
Description copied from class:Dataset
Groups the Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.// Compute the average for all numeric columns grouped by department. ds.groupBy($"department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
rollup
Description copied from class:Dataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.// Compute the average for all numeric columns rolled up by department and group. ds.rollup($"department", $"group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
cube
Description copied from class:Dataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.// Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
groupingSets
public RelationalGroupedDataset groupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, scala.collection.immutable.Seq<Column> cols) Description copied from class:Dataset
Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.// Compute the average for all numeric columns group by specific grouping sets. ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg() // Compute the max age and average salary, group by specific grouping sets. ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map( "salary" -> "avg", "age" -> "max" ))
- Specified by:
groupingSets
in classDataset<T>
- Parameters:
groupingSets
- (undocumented)cols
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
reduce
Description copied from class:Dataset
(Scala-specific) Reduces the elements of this Dataset using the specified binary function. The givenfunc
must be commutative and associative or the result may be non-deterministic. -
groupByKey
Description copied from class:Dataset
(Scala-specific) Returns aKeyValueGroupedDataset
where the data is grouped by the given keyfunc
.- Specified by:
groupByKey
in classDataset<T>
- Parameters:
func
- (undocumented)evidence$3
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
unpivot
public Dataset<Row> unpivot(Column[] ids, Column[] values, String variableColumnName, String valueColumnName) Description copied from class:Dataset
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse togroupBy(...).pivot(...).agg(...)
, except for the aggregation, which cannot be reversed.This function is useful to massage a DataFrame into a format where some columns are identifier columns ("ids"), while all other columns ("values") are "unpivoted" to the rows, leaving just two non-id columns, named as given by
variableColumnName
andvalueColumnName
.val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long") df.show() // output: // +---+---+----+ // | id|int|long| // +---+---+----+ // | 1| 11| 12| // | 2| 21| 22| // +---+---+----+ df.unpivot(Array($"id"), Array($"int", $"long"), "variable", "value").show() // output: // +---+--------+-----+ // | id|variable|value| // +---+--------+-----+ // | 1| int| 11| // | 1| long| 12| // | 2| int| 21| // | 2| long| 22| // +---+--------+-----+ // schema: //root // |-- id: integer (nullable = false) // |-- variable: string (nullable = false) // |-- value: long (nullable = true)
When no "id" columns are given, the unpivoted DataFrame consists of only the "variable" and "value" columns.
All "value" columns must share a least common data type. Unless they are the same data type, all "value" columns are cast to the nearest common data type. For instance, types
IntegerType
andLongType
are cast toLongType
, whileIntegerType
andStringType
do not have a common data type andunpivot
fails with anAnalysisException
. -
unpivot
Description copied from class:Dataset
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse togroupBy(...).pivot(...).agg(...)
, except for the aggregation, which cannot be reversed.- Specified by:
unpivot
in classDataset<T>
- Parameters:
ids
- Id columnsvariableColumnName
- Name of the variable columnvalueColumnName
- Name of the value column- Returns:
- (undocumented)
- See Also:
-
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)
This is equivalent to calling
Dataset#unpivot(Array, Array, String, String)
wherevalues
is set to all non-id columns that exist in the DataFrame.
- Inheritdoc:
-
transpose
Description copied from class:Dataset
Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.Please note: - All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type. - The name of the column into which the original column names are transposed defaults to "key". - null values in the index column are excluded from the column names for the transposed table, which are ordered in ascending order.
val df = Seq(("A", 1, 2), ("B", 3, 4)).toDF("id", "val1", "val2") df.show() // output: // +---+----+----+ // | id|val1|val2| // +---+----+----+ // | A| 1| 2| // | B| 3| 4| // +---+----+----+ df.transpose($"id").show() // output: // +----+---+---+ // | key| A| B| // +----+---+---+ // |val1| 1| 3| // |val2| 2| 4| // +----+---+---+ // schema: // root // |-- key: string (nullable = false) // |-- A: integer (nullable = true) // |-- B: integer (nullable = true) df.transpose().show() // output: // +----+---+---+ // | key| A| B| // +----+---+---+ // |val1| 1| 3| // |val2| 2| 4| // +----+---+---+ // schema: // root // |-- key: string (nullable = false) // |-- A: integer (nullable = true) // |-- B: integer (nullable = true)
- Specified by:
transpose
in classDataset<T>
- Parameters:
indexColumn
- The single column that will be treated as the index for the transpose operation. This column will be used to pivot the data, transforming the DataFrame such that the values of the indexColumn become the new columns in the transposed DataFrame.- Returns:
- (undocumented)
- Inheritdoc:
-
transpose
Description copied from class:Dataset
Transposes a DataFrame, switching rows to columns. This function transforms the DataFrame such that the values in the first column become the new columns of the DataFrame.This is equivalent to calling
Dataset#transpose(Column)
whereindexColumn
is set to the first column.Please note: - All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type. - The name of the column into which the original column names are transposed defaults to "key". - Non-"key" column names for the transposed table are ordered in ascending order.
-
observe
Description copied from class:Dataset
Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:- It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset at that point.
- It will report the value of the defined aggregate columns as soon as we reach a completion point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.
The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.
-
observe
public Dataset<T> observe(Observation observation, Column expr, scala.collection.immutable.Seq<Column> exprs) Description copied from class:Dataset
Observe (named) metrics through anorg.apache.spark.sql.Observation
instance. This method does not support streaming datasets.A user can retrieve the metrics by accessing
org.apache.spark.sql.Observation.get
.// Observe row count (rows) and highest id (maxid) in the Dataset while writing it val observation = Observation("my_metrics") val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) observed_ds.write.parquet("ds.parquet") val metrics = observation.get
-
limit
Description copied from class:Dataset
Returns a new Dataset by taking the firstn
rows. The difference between this function andhead
is thathead
is an action and returns an array (by triggering query execution) whilelimit
returns a new Dataset. -
offset
Description copied from class:Dataset
Returns a new Dataset by skipping the firstn
rows. -
union
- Inheritdoc:
-
unionByName
- Inheritdoc:
-
intersect
- Inheritdoc:
-
intersectAll
- Inheritdoc:
-
except
- Inheritdoc:
-
exceptAll
- Inheritdoc:
-
sample
Description copied from class:Dataset
Returns a newDataset
by sampling a fraction of rows, using a user-supplied seed. -
randomSplit
Description copied from class:Dataset
Randomly splits this Dataset with the provided weights.- Specified by:
randomSplit
in classDataset<T>
- Parameters:
weights
- weights for splits, will be normalized if they don't sum to 1.seed
- Seed for sampling.For Java API, use
Dataset.randomSplitAsList(double[],long)
.- Returns:
- (undocumented)
- Inheritdoc:
-
randomSplit
Description copied from class:Dataset
Randomly splits this Dataset with the provided weights.- Specified by:
randomSplit
in classDataset<T>
- Parameters:
weights
- weights for splits, will be normalized if they don't sum to 1.- Returns:
- (undocumented)
- Inheritdoc:
-
randomSplitAsList
Description copied from class:Dataset
Returns a Java list that contains randomly split Dataset with the provided weights.- Specified by:
randomSplitAsList
in classDataset<T>
- Parameters:
weights
- weights for splits, will be normalized if they don't sum to 1.seed
- Seed for sampling.- Returns:
- (undocumented)
- Inheritdoc:
-
explode
public <A extends scala.Product> Dataset<Row> explode(scala.collection.immutable.Seq<Column> input, scala.Function1<Row, scala.collection.IterableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$4) Deprecated.use flatMap() or select() with functions.explode() instead. Since 2.0.0.Description copied from class:Dataset
(Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to aLATERAL VIEW
in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.Given that this is deprecated, as an alternative, you can explode columns either using
functions.explode()
orflatMap()
. The following example uses these alternatives to count the number of books that contain a given word:case class Book(title: String, words: String) val ds: Dataset[Book] val allWords = ds.select($"title", explode(split($"words", " ")).as("word")) val bookCountPerWord = allWords.groupBy("word").agg(count_distinct("title"))
Using
flatMap()
this can similarly be exploded as:ds.flatMap(_.words.split(" "))
-
explode
public <A,B> Dataset<Row> explode(String inputColumn, String outputColumn, scala.Function1<A, scala.collection.IterableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$5) Deprecated.use flatMap() or select() with functions.explode() instead. Since 2.0.0.Description copied from class:Dataset
(Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to aLATERAL VIEW
in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.Given that this is deprecated, as an alternative, you can explode columns either using
functions.explode()
:ds.select(explode(split($"words", " ")).as("word"))
or
flatMap()
:ds.flatMap(_.words.split(" "))
-
withMetadata
Description copied from class:Dataset
Returns a new Dataset by updating an existing column with metadata.- Specified by:
withMetadata
in classDataset<T>
- Parameters:
columnName
- (undocumented)metadata
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
drop
Description copied from class:Dataset
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
-
drop
Description copied from class:Dataset
Returns a new Dataset with columns dropped.This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.
-
dropDuplicates
Description copied from class:Dataset
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias fordistinct
.For a static batch
Dataset
, it just drops duplicate rows. For a streamingDataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can useDataset.withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Specified by:
dropDuplicates
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
dropDuplicates
Description copied from class:Dataset
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.For a static batch
Dataset
, it just drops duplicate rows. For a streamingDataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can useDataset.withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Specified by:
dropDuplicates
in classDataset<T>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
dropDuplicatesWithinWatermark
Description copied from class:Dataset
Returns a new Dataset with duplicates rows removed, within watermark.This only works with streaming
Dataset
, and watermark for the inputDataset
must be set viaDataset.withWatermark(java.lang.String,java.lang.String)
.For a streaming
Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Specified by:
dropDuplicatesWithinWatermark
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
dropDuplicatesWithinWatermark
Description copied from class:Dataset
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.This only works with streaming
Dataset
, and watermark for the inputDataset
must be set viaDataset.withWatermark(java.lang.String,java.lang.String)
.For a streaming
Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Specified by:
dropDuplicatesWithinWatermark
in classDataset<T>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
summary
Description copied from class:Dataset
Computes specified statistics for numeric and string columns. Available statistics are:- count
- mean
- stddev
- min
- max
- arbitrary approximate percentiles specified as a percentage (e.g. 75%)
- count_distinct
- approx_count_distinct
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
agg
function instead.ds.summary().show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // 25% 24.0 176.0 // 50% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0
ds.summary("count", "min", "25%", "75%", "max").show() // output: // summary age height // count 10.0 10.0 // min 18.0 163.0 // 25% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0
To do a summary for specific columns first select them:
ds.select("age", "height").summary().show()
Specify statistics to output custom summaries:
ds.summary("count", "count_distinct").show()
The distinct count isn't included by default.
You can also run approximate distinct counts which are faster:
ds.summary("count", "approx_count_distinct").show()
See also
Dataset.describe(java.lang.String...)
for basic statistics. -
describe
Description copied from class:Dataset
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
agg
function instead.ds.describe("age", "height").show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // max 92.0 192.0
Use
Dataset.summary(java.lang.String...)
for expanded statistics and control over which statistics to compute. -
head
Description copied from class:Dataset
Returns the firstn
rows. -
filter
Description copied from class:Dataset
(Scala-specific) Returns a new Dataset that only contains elements wherefunc
returnstrue
. -
filter
Description copied from class:Dataset
(Java-specific) Returns a new Dataset that only contains elements wherefunc
returnstrue
. -
map
Description copied from class:Dataset
(Scala-specific) Returns a new Dataset that contains the result of applyingfunc
to each element. -
map
Description copied from class:Dataset
(Java-specific) Returns a new Dataset that contains the result of applyingfunc
to each element. -
mapPartitions
public <U> Dataset<U> mapPartitions(scala.Function1<scala.collection.Iterator<T>, scala.collection.Iterator<U>> func, Encoder<U> evidence$7) Description copied from class:Dataset
(Scala-specific) Returns a new Dataset that contains the result of applyingfunc
to each partition.- Specified by:
mapPartitions
in classDataset<T>
- Parameters:
func
- (undocumented)evidence$7
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
foreachPartition
public void foreachPartition(scala.Function1<scala.collection.Iterator<T>, scala.runtime.BoxedUnit> f) Description copied from class:Dataset
Applies a functionf
to each partition of this Dataset.- Specified by:
foreachPartition
in classDataset<T>
- Parameters:
f
- (undocumented)- Inheritdoc:
-
tail
Description copied from class:Dataset
Returns the lastn
rows in the Dataset.Running tail requires moving data into the application's driver process, and doing so with a very large
n
can crash the driver process with OutOfMemoryError. -
collect
Description copied from class:Dataset
Returns an array that contains all rows in this Dataset.Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use
Dataset.collectAsList()
. -
collectAsList
Description copied from class:Dataset
Returns a Java list that contains all rows in this Dataset.Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
- Specified by:
collectAsList
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
toLocalIterator
Description copied from class:Dataset
Returns an iterator that contains all rows in this Dataset.The iterator will consume as much memory as the largest partition in this Dataset.
- Specified by:
toLocalIterator
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
count
public long count()Description copied from class:Dataset
Returns the number of rows in the Dataset. -
repartition
Description copied from class:Dataset
Returns a new Dataset that has exactlynumPartitions
partitions.- Specified by:
repartition
in classDataset<T>
- Parameters:
numPartitions
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
coalesce
Description copied from class:Dataset
Returns a new Dataset that has exactlynumPartitions
partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. Similar to coalesce defined on anRDD
, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
-
persist
Description copied from class:Dataset
Persist this Dataset with the default storage level (MEMORY_AND_DISK
). -
cache
Description copied from class:Dataset
Persist this Dataset with the default storage level (MEMORY_AND_DISK
). -
persist
Description copied from class:Dataset
Persist this Dataset with the given storage level. -
storageLevel
Description copied from class:Dataset
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.- Specified by:
storageLevel
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
unpersist
Description copied from class:Dataset
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset. -
unpersist
Description copied from class:Dataset
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset. -
rdd
-
toJavaRDD
Returns the content of the Dataset as aJavaRDD
ofT
s.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
javaRDD
Returns the content of the Dataset as aJavaRDD
ofT
s.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
write
Description copied from class:Dataset
Interface for saving the content of the non-streaming Dataset out into external storage. -
writeTo
Description copied from class:Dataset
Create a write configuration builder for v2 sources.This builder is used to configure and execute write operations. For example, to append to an existing table, run:
df.writeTo("catalog.db.table").append()
This can also be used to create or replace existing tables:
df.writeTo("catalog.db.table").partitionedBy($"col").createOrReplace()
-
mergeInto
Description copied from class:Dataset
Merges a set of updates, insertions, and deletions based on a source table into a target table.Scala Examples:
spark.table("source") .mergeInto("target", $"source.id" === $"target.id") .whenMatched($"salary" === 100) .delete() .whenNotMatched() .insertAll() .whenNotMatchedBySource($"salary" === 100) .update(Map( "salary" -> lit(200) )) .merge()
-
writeStream
Interface for saving the content of the streaming Dataset out into external storage.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
toJSON
Description copied from class:Dataset
Returns the content of the Dataset as a Dataset of JSON strings. -
inputFiles
Description copied from class:Dataset
Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.- Specified by:
inputFiles
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
sameSemantics
- Inheritdoc:
-
semanticHash
public int semanticHash()Description copied from class:Dataset
Returns ahashCode
of the logical query plan against thisDataset
.- Specified by:
semanticHash
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
drop
Description copied from class:Dataset
Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
Note:
drop(colName)
has different semantic withdrop(col(colName))
, for example: 1, multi column have the same colName:val df1 = spark.range(0, 2).withColumn("key1", lit(1)) val df2 = spark.range(0, 2).withColumn("key2", lit(2)) val df3 = df1.join(df2) df3.show // +---+----+---+----+ // | id|key1| id|key2| // +---+----+---+----+ // | 0| 1| 0| 2| // | 0| 1| 1| 2| // | 1| 1| 0| 2| // | 1| 1| 1| 2| // +---+----+---+----+ df3.drop("id").show() // output: the two 'id' columns are both dropped. // |key1|key2| // +----+----+ // | 1| 2| // | 1| 2| // | 1| 2| // | 1| 2| // +----+----+ df3.drop(col("id")).show() // ...AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous...
2, colName contains special characters, like dot.
val df = spark.range(0, 2).withColumn("a.b.c", lit(1)) df.show() // +---+-----+ // | id|a.b.c| // +---+-----+ // | 0| 1| // | 1| 1| // +---+-----+ df.drop("a.b.c").show() // +---+ // | id| // +---+ // | 0| // | 1| // +---+ df.drop(col("a.b.c")).show() // no column match the expression 'a.b.c' // +---+-----+ // | id|a.b.c| // +---+-----+ // | 0| 1| // | 1| 1| // +---+-----+
-
drop
Description copied from class:Dataset
Returns a new Dataset with column dropped.This method can only be used to drop top level column. This version of drop accepts a
Column
rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.Note:
drop(col(colName))
has different semantic withdrop(colName)
, please refer toDataset#drop(colName: String)
. -
join
- Inheritdoc:
-
join
- Inheritdoc:
-
join
- Inheritdoc:
-
join
- Inheritdoc:
-
join
- Inheritdoc:
-
join
- Inheritdoc:
-
select
Description copied from class:Dataset
Selects a set of columns. This is a variant ofselect
that can only select existing columns using column names (i.e. cannot construct expressions).// The following two are equivalent: ds.select("colA", "colB") ds.select($"colA", $"colB")
-
select
Description copied from class:Dataset
Returns a new Dataset by computing the givenColumn
expressions for each element. -
select
public <U1,U2, Dataset<scala.Tuple3<U1,U3> U2, selectU3>> (TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3) Description copied from class:Dataset
Returns a new Dataset by computing the givenColumn
expressions for each element. -
select
public <U1,U2, Dataset<scala.Tuple4<U1,U3, U4> U2, selectU3, U4>> (TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3, TypedColumn<T, U4> c4) Description copied from class:Dataset
Returns a new Dataset by computing the givenColumn
expressions for each element. -
select
public <U1,U2, Dataset<scala.Tuple5<U1,U3, U4, U5> U2, selectU3, U4, U5>> (TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3, TypedColumn<T, U4> c4, TypedColumn<T, U5> c5) Description copied from class:Dataset
Returns a new Dataset by computing the givenColumn
expressions for each element. -
melt
public Dataset<Row> melt(Column[] ids, Column[] values, String variableColumnName, String valueColumnName) Description copied from class:Dataset
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse togroupBy(...).pivot(...).agg(...)
, except for the aggregation, which cannot be reversed. This is an alias forunpivot
. -
melt
Description copied from class:Dataset
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse togroupBy(...).pivot(...).agg(...)
, except for the aggregation, which cannot be reversed. This is an alias forunpivot
.- Overrides:
melt
in classDataset<T>
- Parameters:
ids
- Id columnsvariableColumnName
- Name of the variable columnvalueColumnName
- Name of the value column- Returns:
- (undocumented)
- See Also:
-
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)
This is equivalent to calling
Dataset#unpivot(Array, Array, String, String)
wherevalues
is set to all non-id columns that exist in the DataFrame.
- Inheritdoc:
-
withColumn
Description copied from class:Dataset
Returns a new Dataset by adding a column or replacing the existing column that has the same name.column
's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset.- Overrides:
withColumn
in classDataset<T>
- Parameters:
colName
- (undocumented)col
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
withColumns
Description copied from class:Dataset
(Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.colsMap
is a map of column name and column, the column must only refer to attributes supplied by this Dataset. It is an error to add columns that refers to some other Dataset.- Overrides:
withColumns
in classDataset<T>
- Parameters:
colsMap
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
withColumns
Description copied from class:Dataset
(Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.colsMap
is a map of column name and column, the column must only refer to attribute supplied by this Dataset. It is an error to add columns that refers to some other Dataset.- Overrides:
withColumns
in classDataset<T>
- Parameters:
colsMap
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
withColumnRenamed
Description copied from class:Dataset
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.- Overrides:
withColumnRenamed
in classDataset<T>
- Parameters:
existingName
- (undocumented)newName
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
withColumnsRenamed
Description copied from class:Dataset
(Scala-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.colsMap
is a map of existing column name and new column name.- Overrides:
withColumnsRenamed
in classDataset<T>
- Parameters:
colsMap
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
withColumnsRenamed
Description copied from class:Dataset
(Java-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.colsMap
is a map of existing column name and new column name.- Overrides:
withColumnsRenamed
in classDataset<T>
- Parameters:
colsMap
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
checkpoint
Description copied from class:Dataset
Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set withSparkContext#setCheckpointDir
.- Overrides:
checkpoint
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
checkpoint
Description copied from class:Dataset
Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set withSparkContext#setCheckpointDir
.- Overrides:
checkpoint
in classDataset<T>
- Parameters:
eager
- Whether to checkpoint this dataframe immediately- Returns:
- (undocumented)
- Inheritdoc:
-
localCheckpoint
Description copied from class:Dataset
Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.- Overrides:
localCheckpoint
in classDataset<T>
- Returns:
- (undocumented)
- Inheritdoc:
-
localCheckpoint
Description copied from class:Dataset
Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.- Overrides:
localCheckpoint
in classDataset<T>
- Parameters:
eager
- Whether to checkpoint this dataframe immediately- Returns:
- (undocumented)
- Inheritdoc:
-
joinWith
- Inheritdoc:
-
sortWithinPartitions
public Dataset<T> sortWithinPartitions(String sortCol, scala.collection.immutable.Seq<String> sortCols) Description copied from class:Dataset
Returns a new Dataset with each partition sorted by the given expressions.This is the same operation as "SORT BY" in SQL (Hive QL).
- Overrides:
sortWithinPartitions
in classDataset<T>
- Parameters:
sortCol
- (undocumented)sortCols
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
sortWithinPartitions
Description copied from class:Dataset
Returns a new Dataset with each partition sorted by the given expressions.This is the same operation as "SORT BY" in SQL (Hive QL).
- Overrides:
sortWithinPartitions
in classDataset<T>
- Parameters:
sortExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
sort
Description copied from class:Dataset
Returns a new Dataset sorted by the specified column, all in ascending order.// The following 3 are equivalent ds.sort("sortcol") ds.sort($"sortcol") ds.sort($"sortcol".asc)
-
sort
Description copied from class:Dataset
Returns a new Dataset sorted by the given expressions. For example:ds.sort($"col1", $"col2".desc)
-
orderBy
Description copied from class:Dataset
Returns a new Dataset sorted by the given expressions. This is an alias of thesort
function. -
orderBy
Description copied from class:Dataset
Returns a new Dataset sorted by the given expressions. This is an alias of thesort
function. -
as
Description copied from class:Dataset
(Scala-specific) Returns a new Dataset with an alias set. -
alias
Description copied from class:Dataset
Returns a new Dataset with an alias set. Same asas
. -
alias
Description copied from class:Dataset
(Scala-specific) Returns a new Dataset with an alias set. Same asas
. -
selectExpr
Description copied from class:Dataset
Selects a set of SQL expressions. This is a variant ofselect
that accepts SQL expressions.// The following are equivalent: ds.selectExpr("colA", "colB as newName", "abs(colC)") ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
- Overrides:
selectExpr
in classDataset<T>
- Parameters:
exprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
filter
Description copied from class:Dataset
Filters rows using the given SQL expression.peopleDs.filter("age > 15")
-
where
Description copied from class:Dataset
Filters rows using the given condition. This is an alias forfilter
.// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)
-
where
Description copied from class:Dataset
Filters rows using the given SQL expression.peopleDs.where("age > 15")
-
unionAll
- Inheritdoc:
-
unionByName
- Inheritdoc:
-
sample
Description copied from class:Dataset
Returns a newDataset
by sampling a fraction of rows (without replacement), using a user-supplied seed. -
sample
Description copied from class:Dataset
Returns a newDataset
by sampling a fraction of rows (without replacement), using a random seed. -
sample
Description copied from class:Dataset
Returns a newDataset
by sampling a fraction of rows, using a random seed. -
dropDuplicates
Description copied from class:Dataset
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.For a static batch
Dataset
, it just drops duplicate rows. For a streamingDataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can useDataset.withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Overrides:
dropDuplicates
in classDataset<T>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
dropDuplicates
Description copied from class:Dataset
Returns a newDataset
with duplicate rows removed, considering only the subset of columns.For a static batch
Dataset
, it just drops duplicate rows. For a streamingDataset
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can useDataset.withWatermark(java.lang.String,java.lang.String)
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Overrides:
dropDuplicates
in classDataset<T>
- Parameters:
col1
- (undocumented)cols
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
dropDuplicatesWithinWatermark
Description copied from class:Dataset
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.This only works with streaming
Dataset
, and watermark for the inputDataset
must be set viaDataset.withWatermark(java.lang.String,java.lang.String)
.For a streaming
Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Overrides:
dropDuplicatesWithinWatermark
in classDataset<T>
- Parameters:
colNames
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
dropDuplicatesWithinWatermark
public Dataset<T> dropDuplicatesWithinWatermark(String col1, scala.collection.immutable.Seq<String> cols) Description copied from class:Dataset
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.This only works with streaming
Dataset
, and watermark for the inputDataset
must be set viaDataset.withWatermark(java.lang.String,java.lang.String)
.For a streaming
Dataset
, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Overrides:
dropDuplicatesWithinWatermark
in classDataset<T>
- Parameters:
col1
- (undocumented)cols
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
mapPartitions
Description copied from class:Dataset
(Java-specific) Returns a new Dataset that contains the result of applyingf
to each partition.- Overrides:
mapPartitions
in classDataset<T>
- Parameters:
f
- (undocumented)encoder
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
flatMap
public <U> Dataset<U> flatMap(scala.Function1<T, scala.collection.IterableOnce<U>> func, Encoder<U> evidence$8) Description copied from class:Dataset
(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results. -
flatMap
Description copied from class:Dataset
(Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results. -
foreachPartition
Description copied from class:Dataset
(Java-specific) Runsfunc
on each partition of this Dataset.- Overrides:
foreachPartition
in classDataset<T>
- Parameters:
func
- (undocumented)- Inheritdoc:
-
repartition
public Dataset<T> repartition(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs) Description copied from class:Dataset
Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions
. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Overrides:
repartition
in classDataset<T>
- Parameters:
numPartitions
- (undocumented)partitionExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
repartition
Description copied from class:Dataset
Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Overrides:
repartition
in classDataset<T>
- Parameters:
partitionExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
repartitionByRange
public Dataset<T> repartitionByRange(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs) Description copied from class:Dataset
Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions
. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition
.- Overrides:
repartitionByRange
in classDataset<T>
- Parameters:
numPartitions
- (undocumented)partitionExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
repartitionByRange
Description copied from class:Dataset
Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition
.- Overrides:
repartitionByRange
in classDataset<T>
- Parameters:
partitionExprs
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
distinct
Description copied from class:Dataset
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias fordropDuplicates
.Note that for a streaming
Dataset
, this method returns distinct rows only once regardless of the output mode, which the behavior may not be same withDISTINCT
in SQL against streamingDataset
. -
groupBy
Description copied from class:Dataset
Groups the Dataset using the specified columns, so that we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department. ds.groupBy("department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
rollup
Description copied from class:Dataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolled up by department and group. ds.rollup("department", "group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
cube
Description copied from class:Dataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDataset
for all the available aggregate functions.This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group. ds.cube("department", "group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
-
agg
public Dataset<Row> agg(scala.Tuple2<String, String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String, String>> aggExprs) Description copied from class:Dataset
(Scala-specific) Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg("age" -> "max", "salary" -> "avg") ds.groupBy().agg("age" -> "max", "salary" -> "avg")
-
agg
Description copied from class:Dataset
(Scala-specific) Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
-
agg
Description copied from class:Dataset
(Java-specific) Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
-
agg
Description copied from class:Dataset
Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(max($"age"), avg($"salary")) ds.groupBy().agg(max($"age"), avg($"salary"))
-
groupByKey
Description copied from class:Dataset
(Java-specific) Returns aKeyValueGroupedDataset
where the data is grouped by the given keyfunc
.- Overrides:
groupByKey
in classDataset<T>
- Parameters:
func
- (undocumented)encoder
- (undocumented)- Returns:
- (undocumented)
- Inheritdoc:
-
isUnTyped
public boolean isUnTyped()
-