PySpark - Split dataframe by column value

A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python.

There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either using the filter function or the where function. In this article, we will discuss both ways to split data frames by column value.

Ways to split Pyspark data frame by column value:

Using filter function
Using where function

Method 1: Using the filter function

The function used to filter the rows from the data frame based on the given condition or SQL expression is known as the filter function. In this way, we will see how we can split the data frame by column value using the filter function. What we will do is apply the condition in the filter function once with equal to and next with not equal to and display both the data frames.

Syntax: data_frame.filter(condition)

Example:

In this example, we have read a CSV file (link), i.e., basically a data set of 5*5 as follows:

Then, we split the data frame with column 'Age' using the filter function when its value is 18 and when it is not. Finally, we displayed both data frames.

Python3

# PySpark - Split dataframe by
# column value using filter function

# Import the libraries SparkSession
from pyspark.sql import SparkSession

# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()

# Read the CSV file
df=csv_file = spark_session.read.csv('student_data.csv', 
                                     sep = ',', 
                                     inferSchema = True,
                                     header = True)

# Split data frame with age when value is 18 
df.filter(df.age == 18).show(truncate=False)
df.filter(df.age != 18).show(truncate=False)

Output:

Method 2: Using the where function

The function used to filter the rows from the data frame based on the given SQL expression or condition is known as the where function. In this way, we will see how we can split the data frame by column value using the where function. What we will do is apply the condition in the where function once with equal to and next with not equal to and display both the data frames.

Syntax: data_frame.where(condition)