PySpark SQL
From Spark Data Sources SQL Queries
• JSON >>> from [Link] import functions as f
>>>df = [Link]("[Link])
>>>[Link]() • Select
CHEAT SHEET
>>> df2 = [Link]("[Link]", format="json") >>> [Link]("col1").show()
• Parquet Files >>> [Link]("col2","col3") \ .show()
>>> df3 = [Link]("[Link]") • When
>>> [Link]("col1", [Link](df.col2> 30, 1) \ .otherwise(0)) \ .show()
Inspect Data
>>> df[[Link]("A","B")] .collect()
Initializing Spark Session • >>> [Link] -- Return df column names and data types
• >>> from [Link] import SparkSession • >>> [Link]() -- Display the content of df Running SQL Queries Programmatically
• >>> spark = SparkSession\.builder\.appName("PySpark • >>> [Link]() -- Return first n rows
SQL\.config("[Link]", "some-value") \.getOrCreate() • >>> [Link](n) -- Return the first n rows • Registering Data Frames as Views:
• >>> [Link] -- Return the schema of df >>> [Link]("column1")
• >>> [Link]().show() -- Compute summary statistics >>> [Link]("column1")
Creating Data Frames • >>> [Link] -- Return the columns of df >>> [Link]("column2")
• >>> [Link]() -- Count the number of rows in df
#import pyspark class Row from module sql • >>> [Link]().count() -- Count the number of distinct rows in df • Query Views
>>>from [Link] import * • >>> [Link]() -- Print the schema of df >>> df_one = [Link]("SELECT * FROM customer").show()
• Infer Schema: • >>> [Link]() -- Print the (logical and physical) plans >>> df_new = [Link]("SELECT * FROM global_temp.people")\ .show()
>>> sc = [Link]
>>> A = [Link]("[Link]")
>>> B = [Link](lambda x: [Link](","))
Column Operations Output Operations
>>> C = [Link](lambda a: Row(col1=a[0],col2=int(a[1]))) • Add
>>> C_df = [Link](C) >>> df = [Link]('col1',[Link].col1) \ .withColumn('col2',[Link].col2) \ • Data Structures:
• Specify Schema: .withColumn('col3',[Link].col3) \ .withColumn('col4',[Link].col4) >>> rdd_1 = [Link]
>>> C = [Link](lambda a: Row(col1=a[0], col2=int(a[1].strip()))) \.withColumn(col5', explode([Link].col5)) >>> [Link]().first()
>>> schemaString = "MyTable" • Update >>> [Link]()
>>> D = [StructField(field_name, StringType(), True) for >>> df = [Link]('col1', 'column1')
field_name in [Link]()] • Remove • Write & Save to Files:
>>> E = StructType(D) >>> df = [Link]("col3", "col4") >>> [Link]("Col1", "Col2")\ .write \ .save("[Link]")
>>> [Link](C, E).show() >>> df = [Link](df.col3).drop(df.col4) >>> [Link]("col3", "col5") \ .write \ .save("table_new.json",format="json")
col1 col2 Actions • Stopping SparkSession
row1 3 • Group By: >>> [Link]("col1")\ .count() \ .show() >>> [Link]()
• Filter: >>> [Link](df["col2"]>4).show()
row2 4 • Sort: >>> [Link]([Link]()).collect()
row3 5 >>> [Link]("col1", ascending=False).collect()
>>> [Link](["col1","col3"],ascending=[0,1])\ .collect()
• Missing & Replacing Values:
>>> [Link](20).show()
>>> [Link]().show()
>>> [Link] \ .replace(10, 20) \ .show()
• Repartitioning:
>>> [Link](10)\ df with 10 partitions .rdd \
.getNumPartitions()
FURTHERMORE: Spark, Scala and Python Training Training Course
>>> [Link](1).[Link]()