PySpark SQL Cheat Sheet Guide

This document provides a cheat sheet on using PySpark SQL to work with structured data. It covers initializing Spark sessions, creating and inspecting DataFrames, performing SQL queries programmatically, column operations like adding and renaming columns, and output operations like saving DataFrames to files. It also summarizes common DataFrame actions like grouping, filtering, sorting, handling missing values, and repartitioning data.

Uploaded by

Karthigai Selvan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

841 views1 page

PySpark SQL Cheat Sheet Guide

Uploaded by

Karthigai Selvan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PySpark SQL

From Spark Data Sources SQL Queries

• JSON >>> from [Link] import functions as f
>>>df = [Link]("[Link])
>>>[Link]() • Select

CHEAT SHEET
>>> df2 = [Link]("[Link]", format="json") >>> [Link]("col1").show()
• Parquet Files >>> [Link]("col2","col3") \ .show()
>>> df3 = [Link]("[Link]") • When

>>> [Link]("col1", [Link](df.col2> 30, 1) \ .otherwise(0)) \ .show()

Inspect Data
>>> df[[Link]("A","B")] .collect()
Initializing Spark Session • >>> [Link] -- Return df column names and data types
• >>> from [Link] import SparkSession • >>> [Link]() -- Display the content of df Running SQL Queries Programmatically
• >>> spark = SparkSession\.builder\.appName("PySpark • >>> [Link]() -- Return first n rows
SQL\.config("[Link]", "some-value") \.getOrCreate() • >>> [Link](n) -- Return the first n rows • Registering Data Frames as Views:
• >>> [Link] -- Return the schema of df >>> [Link]("column1")
• >>> [Link]().show() -- Compute summary statistics >>> [Link]("column1")
Creating Data Frames • >>> [Link] -- Return the columns of df >>> [Link]("column2")
• >>> [Link]() -- Count the number of rows in df
#import pyspark class Row from module sql • >>> [Link]().count() -- Count the number of distinct rows in df • Query Views
>>>from [Link] import * • >>> [Link]() -- Print the schema of df >>> df_one = [Link]("SELECT * FROM customer").show()
• Infer Schema: • >>> [Link]() -- Print the (logical and physical) plans >>> df_new = [Link]("SELECT * FROM global_temp.people")\ .show()
>>> sc = [Link]
>>> A = [Link]("[Link]")
>>> B = [Link](lambda x: [Link](","))
Column Operations Output Operations
>>> C = [Link](lambda a: Row(col1=a[0],col2=int(a[1]))) • Add
>>> C_df = [Link](C) >>> df = [Link]('col1',[Link].col1) \ .withColumn('col2',[Link].col2) \ • Data Structures:
• Specify Schema: .withColumn('col3',[Link].col3) \ .withColumn('col4',[Link].col4) >>> rdd_1 = [Link]
>>> C = [Link](lambda a: Row(col1=a[0], col2=int(a[1].strip()))) \.withColumn(col5', explode([Link].col5)) >>> [Link]().first()
>>> schemaString = "MyTable" • Update >>> [Link]()
>>> D = [StructField(field_name, StringType(), True) for >>> df = [Link]('col1', 'column1')
field_name in [Link]()] • Remove • Write & Save to Files:
>>> E = StructType(D) >>> df = [Link]("col3", "col4") >>> [Link]("Col1", "Col2")\ .write \ .save("[Link]")
>>> [Link](C, E).show() >>> df = [Link](df.col3).drop(df.col4) >>> [Link]("col3", "col5") \ .write \ .save("table_new.json",format="json")

col1 col2 Actions • Stopping SparkSession

row1 3 • Group By: >>> [Link]("col1")\ .count() \ .show() >>> [Link]()
• Filter: >>> [Link](df["col2"]>4).show()
row2 4 • Sort: >>> [Link]([Link]()).collect()
row3 5 >>> [Link]("col1", ascending=False).collect()
>>> [Link](["col1","col3"],ascending=[0,1])\ .collect()
• Missing & Replacing Values:
>>> [Link](20).show()
>>> [Link]().show()
>>> [Link] \ .replace(10, 20) \ .show()
• Repartitioning:
>>> [Link](10)\ df with 10 partitions .rdd \
.getNumPartitions()
FURTHERMORE: Spark, Scala and Python Training Training Course
>>> [Link](1).[Link]()

PySpark Interview Questions
0% (1)
PySpark Interview Questions
3 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark SQL Basics Cheat Sheet
No ratings yet
PySpark SQL Basics Cheat Sheet
1 page
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Spark Programming and RDDs Overview
No ratings yet
Spark Programming and RDDs Overview
59 pages
Second Highest Salary in PySpark
No ratings yet
Second Highest Salary in PySpark
22 pages
13 SparkBuildingAndDeploying
No ratings yet
13 SparkBuildingAndDeploying
53 pages
Hadoop Interview Questions Guide
100% (1)
Hadoop Interview Questions Guide
34 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
Docker & PySpark for Data Enthusiasts
100% (1)
Docker & PySpark for Data Enthusiasts
15 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
SQL Joins and Functions Guide
No ratings yet
SQL Joins and Functions Guide
1 page
Spark RDD Transformations Guide
No ratings yet
Spark RDD Transformations Guide
9 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
DP 300 Demo
No ratings yet
DP 300 Demo
13 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Dimensional Modeling in Data Warehousing
No ratings yet
Dimensional Modeling in Data Warehousing
33 pages
AZ-303 Exam Guide for Architects
No ratings yet
AZ-303 Exam Guide for Architects
18 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
SQL For Data Engineering
No ratings yet
SQL For Data Engineering
79 pages
Google BigQuery: Scalable Data Analysis
No ratings yet
Google BigQuery: Scalable Data Analysis
2 pages
Business Intelligence & Data Warehousing Guide
No ratings yet
Business Intelligence & Data Warehousing Guide
17 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Spark DataFrame and RDD Operations Guide
No ratings yet
Spark DataFrame and RDD Operations Guide
5 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Pentaho Data Integration with Oracle Guide
No ratings yet
Pentaho Data Integration with Oracle Guide
41 pages
Oracle 19c Performance Enhancements Guide
100% (1)
Oracle 19c Performance Enhancements Guide
42 pages
Overview of Apache Druid Architecture
No ratings yet
Overview of Apache Druid Architecture
12 pages
Data Modeling Techniques & Types
No ratings yet
Data Modeling Techniques & Types
2 pages
Mastering Azure Databricks Day-5
No ratings yet
Mastering Azure Databricks Day-5
9 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Data Cleaning with PySpark Guide
No ratings yet
Data Cleaning with PySpark Guide
21 pages
Apache Airflow 50
100% (1)
Apache Airflow 50
50 pages
Monitoring Azure SQL Database Performance
No ratings yet
Monitoring Azure SQL Database Performance
25 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
SQL Basics: Joins, Aggregates, and Subqueries
No ratings yet
SQL Basics: Joins, Aggregates, and Subqueries
33 pages
Azure Databricks Onboarding Guide
No ratings yet
Azure Databricks Onboarding Guide
298 pages
Linux Commands
No ratings yet
Linux Commands
2 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
PowerShell Automation in SQL Server Using Dbatools - Io
No ratings yet
PowerShell Automation in SQL Server Using Dbatools - Io
14 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Introduction To Snowflake Warehouses
No ratings yet
Introduction To Snowflake Warehouses
40 pages
Getting Started With Apache Nifi
No ratings yet
Getting Started With Apache Nifi
10 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Unix Commands Cheat Sheet
No ratings yet
Unix Commands Cheat Sheet
12 pages
Linux CLI Guide for Beginners
No ratings yet
Linux CLI Guide for Beginners
67 pages
UNIX and Shell Scripting - Module 2
100% (1)
UNIX and Shell Scripting - Module 2
44 pages
PySpark RDD Functions Overview
No ratings yet
PySpark RDD Functions Overview
1 page
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
ADF Exercises
100% (1)
ADF Exercises
75 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Sqoop Demo
No ratings yet
Sqoop Demo
7 pages
A - Learning - Oreilly.com-Preface Data Engineering With AWS
No ratings yet
A - Learning - Oreilly.com-Preface Data Engineering With AWS
6 pages
PySpark Programming & Spark SQL Guide
No ratings yet
PySpark Programming & Spark SQL Guide
7 pages
Relational (OLTP) Data Modeling
No ratings yet
Relational (OLTP) Data Modeling
2 pages
Azure Database For MySQL E-Book
No ratings yet
Azure Database For MySQL E-Book
16 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Big Data with Apache Spark Basics
No ratings yet
Big Data with Apache Spark Basics
43 pages
Azure AZ-900 Exam Prep Dumps
No ratings yet
Azure AZ-900 Exam Prep Dumps
9 pages
Daily A News - March 01
No ratings yet
Daily A News - March 01
2 pages
Negative - Affirmative Statements
100% (1)
Negative - Affirmative Statements
5 pages
Big Data Processing for Developers
No ratings yet
Big Data Processing for Developers
38 pages
Big Data: Business Intelligence, and Analytics
No ratings yet
Big Data: Business Intelligence, and Analytics
31 pages
Convert Tenses
No ratings yet
Convert Tenses
2 pages
Parry New Price List - 2021
100% (1)
Parry New Price List - 2021
188 pages
Aws Best Practices Guide
No ratings yet
Aws Best Practices Guide
14 pages
1000 Most Common Verbs in English - Verb Forms V1, V2, V3 List
100% (1)
1000 Most Common Verbs in English - Verb Forms V1, V2, V3 List
106 pages
FIT9132 Tutorial 6 Sample Solution
No ratings yet
FIT9132 Tutorial 6 Sample Solution
6 pages
Installed Files Product
No ratings yet
Installed Files Product
6 pages
DataSource Enhancement
No ratings yet
DataSource Enhancement
16 pages
JD Associate Skywise Engineer
No ratings yet
JD Associate Skywise Engineer
1 page
USB Setup Guide for GParted Live
No ratings yet
USB Setup Guide for GParted Live
4 pages
Client Installation and User'S Guide: Ibm Tivoli Storage Manager Fastback For Workstations 6.1.2.0
No ratings yet
Client Installation and User'S Guide: Ibm Tivoli Storage Manager Fastback For Workstations 6.1.2.0
122 pages
SWAT LUU User Manual Guide
No ratings yet
SWAT LUU User Manual Guide
13 pages
Ch2 Ch3 Mina
No ratings yet
Ch2 Ch3 Mina
10 pages
Database Management Systems (R22a0504)
No ratings yet
Database Management Systems (R22a0504)
96 pages
CentOS 6 Server Installation Guide
No ratings yet
CentOS 6 Server Installation Guide
96 pages
Function-Oriented Software Design: Dr. R. Mall
No ratings yet
Function-Oriented Software Design: Dr. R. Mall
81 pages
DBMS-MCQs 2
No ratings yet
DBMS-MCQs 2
5 pages
Drop The Existing Table If It Exists
No ratings yet
Drop The Existing Table If It Exists
5 pages
Ingilizce Almanca Ve Türkce SQL Interview Soru Cevaplar
No ratings yet
Ingilizce Almanca Ve Türkce SQL Interview Soru Cevaplar
93 pages
Oracle Training - Oracle Training in Chennai - Best Oracle Training in Chennai
No ratings yet
Oracle Training - Oracle Training in Chennai - Best Oracle Training in Chennai
14 pages
Chapter 7 Powerpoint Question Solution
No ratings yet
Chapter 7 Powerpoint Question Solution
2 pages
Grouping and Aggregate 1674115851
No ratings yet
Grouping and Aggregate 1674115851
12 pages
1-Giriş Ve Temel Kavramlar-OK en-US
No ratings yet
1-Giriş Ve Temel Kavramlar-OK en-US
27 pages
Introduction To DBMS Textbook Class 1000
No ratings yet
Introduction To DBMS Textbook Class 1000
13 pages
Car Dealership Management System IP Project
100% (1)
Car Dealership Management System IP Project
41 pages
DS Lab Exercise1&2
No ratings yet
DS Lab Exercise1&2
2 pages
Student Record Management System Overview
No ratings yet
Student Record Management System Overview
5 pages
Association Rules & Clustering Techniques
No ratings yet
Association Rules & Clustering Techniques
13 pages
Question Bank 2
No ratings yet
Question Bank 2
6 pages
BCNF With Example
No ratings yet
BCNF With Example
3 pages
ADBMS Assignment.2
No ratings yet
ADBMS Assignment.2
10 pages
Physical Database Design Overview
No ratings yet
Physical Database Design Overview
21 pages
ICT501 Database Management Systems 6 - Bachelors Degree
No ratings yet
ICT501 Database Management Systems 6 - Bachelors Degree
7 pages
Chapter2 Managing Files From The Command Line
No ratings yet
Chapter2 Managing Files From The Command Line
2 pages
In 1022 UserGuide en
No ratings yet
In 1022 UserGuide en
304 pages

PySpark SQL Cheat Sheet Guide

Uploaded by

PySpark SQL Cheat Sheet Guide

Uploaded by

PySpark SQL

From Spark Data Sources SQL Queries

>>> [Link]("col1", [Link](df.col2> 30, 1) \ .otherwise(0)) \ .show()

col1 col2 Actions • Stopping SparkSession

You might also like