PySpark - Split dataframe by column value
Last Updated :
28 Apr, 2025
A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python.
There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either using the filter function or the where function. In this article, we will discuss both ways to split data frames by column value.
Ways to split Pyspark data frame by column value:
- Using filter function
- Using where function
Method 1: Using the filter function
The function used to filter the rows from the data frame based on the given condition or SQL expression is known as the filter function. In this way, we will see how we can split the data frame by column value using the filter function. What we will do is apply the condition in the filter function once with equal to and next with not equal to and display both the data frames.
Syntax: data_frame.filter(condition)
Example:
In this example, we have read a CSV file (link), i.e., basically a data set of 5*5 as follows:
Then, we split the data frame with column 'Age' using the filter function when its value is 18 and when it is not. Finally, we displayed both data frames.
Python3
# PySpark - Split dataframe by
# column value using filter function
# Import the libraries SparkSession
from pyspark.sql import SparkSession
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Read the CSV file
df=csv_file = spark_session.read.csv('student_data.csv',
sep = ',',
inferSchema = True,
header = True)
# Split data frame with age when value is 18
df.filter(df.age == 18).show(truncate=False)
df.filter(df.age != 18).show(truncate=False)
Output:
Method 2: Using the where function
The function used to filter the rows from the data frame based on the given SQL expression or condition is known as the where function. In this way, we will see how we can split the data frame by column value using the where function. What we will do is apply the condition in the where function once with equal to and next with not equal to and display both the data frames.
Syntax: data_frame.where(condition)
In this example, we have read a CSV file (link), i.e., basically a data set of 5*5 as follows:
Then, we split the data frame with column 'Age' using the where function when its value is 18 and when it is not. Finally, we displayed both data frames.
Python3
# PySpark - Split dataframe by column value using where function
# Import the libraries SparkSession
from pyspark.sql import SparkSession
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Read the CSV file
df=csv_file = spark_session.read.csv('student_data.csv',
sep = ',',
inferSchema = True,
header = True)
# Split data frame with age when value is 18
df.where(df.age == 18).show(truncate=False)
df.where(df.age != 18).show(truncate=False)
Output:
Similar Reads
Select columns in PySpark dataframe In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We
4 min read
PySpark Dataframe Split PySpark is an open-source library used for handling big data. It is an interface of Apache Spark in Python. It is fast and also provides Pandas API to give comfortability to Pandas users while using PySpark. Dataframe is a data structure in which a large amount or even a small amount of data can be
4 min read
PySpark - Select Columns From DataFrame In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected
2 min read
Show distinct column values in PySpark dataframe In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksessi
2 min read
Count values by condition in PySpark Dataframe In this article, we are going to count the value of the Pyspark dataframe columns by condition. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving a
3 min read
Spark - Split array to separate column Apache Spark is a potent big data processing system that can analyze enormous amounts of data concurrently over distributed computer clusters. PySpark is a Python-based interface for Apache Spark. Python programmers may create Spark applications more quickly and easily thanks to PySpark. Method 1: U
3 min read
Spark dataframe - Split struct column into two columns In this article, we are going to learn how to split the struct column into two columns using PySpark in Python. Spark is an open-source, distributed processing system that is widely used for big data workloads. It is designed to be fast, easy to use, and flexible, and it provides a wide range of fun
5 min read
Pivot String column on Pyspark Dataframe Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate
4 min read
Spark Trim String Column on DataFrame In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself. PySpark Trim String Column on DataFrameBelow are the ways by which we ca
4 min read
How to Change Column Type in PySpark Dataframe ? In this article, we are going to see how to change the column type of pyspark dataframe. Creating dataframe for demonstration: Python # Create a spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SparkExamples').getOrCreate() # Create a spark dataframe columns =
4 min read