Python PySpark pivot() Function
Last Updated :
24 Sep, 2024
The pivot() function in PySpark is a powerful method used to reshape a DataFrame by transforming unique values from one column into multiple columns in a new DataFrame, while aggregating data in the process. The function takes a set of unique values from a specified column and turns them into separate columns.
In this article, we will go through a detailed example of how to use the pivot() function in PySpark, covering its usage step by step.
Introduction to PySpark pivot()
In PySpark, the pivot() function is part of the DataFrame API. It allows us to convert rows into columns by specifying:
- A column whose values will become new columns.
- Optionally, an aggregation method to apply during the pivot process, e.g., sum(), avg, etc.
The syntax of the pivot() function is:
df.pivot(pivot_column, [values])
Where:
- pivot_column: The column whose unique values will become the column headers.
- values: An optional list of values from pivot_column to include. If not specified, all unique values will be used.
Example Data:
Let's consider the following DataFrame, which contains sales data of different products in various regions:
Product | Region | Sale |
---|
A | East | 100 |
---|
A | West | 150 |
---|
B | East | 200 |
---|
B | West | 250 |
---|
C | East | 300 |
---|
C | West | 350 |
---|
We will pivot this data so that each region becomes a column, with sales as the values.
Step-by-Step Implementation
1. Initial DataFrame Setup
First, let's create a PySpark DataFrame for the sales data
Python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create a SparkSession
spark = SparkSession.builder.appName("PySpark Pivot Example").getOrCreate()
# Sample data
data = [
("A", "East", 100),
("A", "West", 150),
("B", "East", 200),
("B", "West", 250),
("C", "East", 300),
("C", "West", 350)
]
# Create a DataFrame
columns = ["Product", "Region", "Sales"]
df = spark.createDataFrame(data, columns)
# Show the initial DataFrame
df.show()
Output:
Creating a PySpark DataFrame2. Using pivot()
Now we will use the pivot() function to reorganize the data. We want to pivot on the Region column, so that East and West become separate columns. We will aggregate the sales by Product.
Python
# ...
# Pivot the DataFrame
pivot_df = df.groupBy("Product").pivot("Region").sum("Sales")
# Show the pivoted DataFrame
pivot_df.show()
Output:
Using PySpark Pivot FunctionIn this output:
- Each Product is in a single row.
- The East and West regions have become columns.
- The values in the new columns represent the Sales values.
3. Aggregating with pivot()
We can also apply additional aggregation functions during the pivot process. For example, if we had multiple rows for each Product and Region, we could use avg(), min(), max(), or other aggregate functions.
For example, if we have multiple sales entries for each product in the same region, we can sum them during the pivot.
Python
# ...
# Sample data with multiple entries for the same product-region combination
data_with_duplicates = [
("A", "East", 100),
("A", "East", 50),
("A", "West", 150),
("B", "East", 200),
("B", "West", 250),
("B", "West", 100),
("C", "East", 300),
("C", "West", 350)
]
# Create a new DataFrame with duplicate data
df_with_duplicates = spark.createDataFrame(data_with_duplicates, columns)
# Pivot and sum the sales
pivot_df_with_aggregation_sum = df_with_duplicates.groupBy("Product").pivot("Region").sum("Sales")
# Show the result
pivot_df_with_aggregation_sum.show()
# Pivot and sum the sales
pivot_df_with_aggregation_avg = df_with_duplicates.groupBy("Product").pivot("Region").avg("Sales")
# Show the result
pivot_df_with_aggregation_avg.show()
Output:
Aggregating with Pivot in PySparkHere, for Product A, the sales from two entries in the East region have been summed (100 + 50 = 150). For Product B, the sales in the West region have also been aggregated (250 + 100 = 350).
And for Product A, the sales from two entries in the East region have been averaged ((100 + 50)/2 = 75.0). For Product B, the sales in the West region have also been aggregated ((250 + 100)/2 = 175.0).
Conclusion
The pivot() function in PySpark is a powerful tool for transforming data. It allows us to convert row-based data into column-based data by pivoting on a specific column's values. In this article, we demonstrated how to pivot data using PySpark, with a focus on sales data by region. Additionally, we showed how to apply aggregation methods like sum() during the pivot process. By utilizing pivot(), we can restructure our DataFrame to make it more suitable for further analysis or reporting.
When using pivot(), keep in mind:
- We should choose the correct column for pivoting based on the data structure.
- Ensure that the aggregation method used suits our needs, whether it be sum(), avg(), or another method.
- Use groupBy() before pivot() to ensure the data is grouped correctly.
Similar Reads
Python PySpark sum() Function PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
PySpark Window Functions PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and
8 min read
pandas.lreshape() function in Python This method is used to reshape long-format data to wide. This is the generalized inverse of DataFrame.pivot. Syntax : pandas.lreshape(data, groups, dropna=True, label=None)Â Arguments : data : DataFramegroups : dict {new_name : list_of_columns}dropna : boolean, default True Below is the implementati
1 min read
Pivot String column on Pyspark Dataframe Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate
4 min read
Convert Python Dictionary List to PySpark DataFrame In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
Convert PySpark DataFrame to Dictionary in Python In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark se
3 min read
How to re-partition pyspark dataframe in Python Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Pyth
3 min read
Converting a PySpark DataFrame Column to a Python List In this article, we will discuss how to convert Pyspark dataframe column to a Python list. Creating dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
5 min read
Python | Pandas.pivot() pandas.pivot(index, columns, values) function produces a pivot table based on 3 columns of the DataFrame. Uses unique values from the index/columns and fills them with values. Python Pandas.pivot() SyntaxSyntax: pandas.pivot(index, columns, values) Parameters: index[ndarray] :Â Labels to use to make
2 min read
UDF to sort list in PySpark The most useful feature of Spark SQL used to create a reusable function in Pyspark is known as UDF or User defined function in Python. The column type of the Pyspark can be String, Integer, Array, etc. There occurs some situations in which you have got ArrayType column in Pyspark data frame and you
3 min read