pyspark.sql.DataFrame.dropna#

DataFrame.dropna(how='any', thresh=None, subset=None)[source]#

Returns a new DataFrame omitting rows with null or NaN values. DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other.

New in version 1.3.1.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
howstr, optional, the values that can be ‘any’ or ‘all’, default ‘any’.

If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.

thresh: int, optional, default None.

If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.

subsetstr, tuple or list, optional

optional list of column names to consider.

Returns
DataFrame

DataFrame with null only rows excluded.

Examples

>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([
...     Row(age=10, height=80.0, name="Alice"),
...     Row(age=5, height=float("nan"), name="Bob"),
...     Row(age=None, height=None, name="Tom"),
...     Row(age=None, height=float("nan"), name=None),
... ])

Example 1: Drop the row if it contains any null or NaN.

>>> df.na.drop().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|  80.0|Alice|
+---+------+-----+

Example 2: Drop the row only if all its values are null or NaN.

>>> df.na.drop(how='all').show()
+----+------+-----+
| age|height| name|
+----+------+-----+
|  10|  80.0|Alice|
|   5|   NaN|  Bob|
|NULL|  NULL|  Tom|
+----+------+-----+

Example 3: Drop rows that have less than thresh non-null and non-NaN values.

>>> df.na.drop(thresh=2).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|  80.0|Alice|
|  5|   NaN|  Bob|
+---+------+-----+

Example 4: Drop rows with null and NaN values in the specified columns.

>>> df.na.drop(subset=['age', 'name']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|  80.0|Alice|
|  5|   NaN|  Bob|
+---+------+-----+