PySpark: How to select rows where any column contains a null value
When performing exploratory data analysis in PySpark, it is often useful to find rows that contain nulls in any column. This recipe filters a dataframe to include only rows in which one or more columns is null. It works by generating a condition for each column to check for null and combining each condition through a series of OR (|
) statements.
from pyspark.sql.functions import col
from functools import reduce
def select_rows_with_nulls(from_df):
return from_df.where(
reduce(
lambda col1, col2: col1 | col2,
[col(col_name).isNull() for col_name in from_df.columns]
)
)
Example usage
In the following example, the dataframe will only contain rows in which at least one column contains a null value.
df = select_rows_with_nulls(from_df=df)
Broader Topics Related to PySpark Recipe: Select rows where any column contains a null value
PySpark Recipes
Quick and easy to copy recipes for PySpark
Exploratory Data Analysis (EDA)
Research into an unfamiliar dataset, aimed at pattern discovery, assumption verification, and data summarization