pyspark.sql.functions.collect_list#

pyspark.sql.functions.collect_list(col)[source]#

Aggregate function: Collects the values from a column into a list, maintaining duplicates, and returns this list of objects.

New in version 1.6.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

colColumn or str: The target column on which the function is computed.

Returns

Column: A new Column object representing a list of collected values, with duplicate values included.

Notes

The function is non-deterministic as the order of collected results depends on the order of the rows, which possibly becomes non-deterministic after shuffle operations.

Examples

Example 1: Collect values from a DataFrame and sort the result in ascending order

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1,), (2,), (2,)], ('value',))
>>> df.select(sf.sort_array(sf.collect_list('value')).alias('sorted_list')).show()
+-----------+
|sorted_list|
+-----------+
|  [1, 2, 2]|
+-----------+

Example 2: Collect values from a DataFrame and sort the result in descending order

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df.select(sf.sort_array(sf.collect_list('age'), asc=False).alias('sorted_list')).show()
+-----------+
|sorted_list|
+-----------+
|  [5, 5, 2]|
+-----------+

Example 3: Collect values from a DataFrame with multiple columns and sort the result

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name"))
>>> df = df.groupBy("name").agg(sf.sort_array(sf.collect_list('id')).alias('sorted_list'))
>>> df.orderBy(sf.desc("name")).show()
+----+-----------+
|name|sorted_list|
+----+-----------+
|John|     [1, 2]|
| Ana|        [3]|
+----+-----------+