pyspark.sql.DataFrame.collect#

DataFrame.collect()[source]#

Returns all the records in the DataFrame as a list of Row.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Returns
list

A list of Row objects, each representing a row in the DataFrame.

See also

DataFrame.take

Returns the first n rows.

DataFrame.head

Returns the first n rows.

DataFrame.toPandas

Returns the data as a pandas DataFrame.

DataFrame.toArrow

Returns the data as a PyArrow Table.

Notes

This method should only be used if the resulting list is expected to be small, as all the data is loaded into the driver’s memory.

Examples

Example: Collecting all rows of a DataFrame

>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
>>> df.collect()
[Row(age=14, name='Tom'), Row(age=23, name='Alice'), Row(age=16, name='Bob')]

Example: Collecting all rows after filtering

>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
>>> df.filter(df.age > 15).collect()
[Row(age=23, name='Alice'), Row(age=16, name='Bob')]

Example: Collecting all rows after selecting specific columns

>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
>>> df.select("name").collect()
[Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]

Example: Collecting all rows after applying a function to a column

>>> from pyspark.sql.functions import upper
>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
>>> df.select(upper(df.name)).collect()
[Row(upper(name)='TOM'), Row(upper(name)='ALICE'), Row(upper(name)='BOB')]

Example: Collecting all rows from a DataFrame and converting a specific column to a list

>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
>>> rows = df.collect()
>>> [row["name"] for row in rows]
['Tom', 'Alice', 'Bob']

Example: Collecting all rows from a DataFrame and converting to a list of dictionaries

>>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
>>> rows = df.collect()
>>> [row.asDict() for row in rows]
[{'age': 14, 'name': 'Tom'}, {'age': 23, 'name': 'Alice'}, {'age': 16, 'name': 'Bob'}]