Aws Glue Interview
Aws Glue Interview
AWS Glue is a managed service ETL (extract, transform, and load) service that enables
categorizing, cleaning, enriching, and moving data reliably between various data storage
and data streams simple and cost-effective. AWS Glue consists of the AWS Glue Data
Catalog, an ETL engine that creates Python or Scala code automatically, and a
customizable scheduler that manages dependency resolution, job monitoring, and
retries. Because AWS Glue is serverless, there is no infrastructure to install or maintain.
1. The fundamentals of using AWS Glue to generate one's Data Catalog and
processing ETL dataflows.
3. You construct a crawler for datastore resources to enrich one's AWS Glue Data
Catalog with metadata table entries. When you direct your crawler to a data
store, the crawler populates the Data Catalog with table definitions. Manually
define Data Catalog tables and data stream characteristics for streaming
sources.
4. In addition to table descriptions, the AWS Glue Data Model contains additional
metadata that is required to build ETL operations. Users use this information
when they take on that job to alter their data.
5. AWS Glue may generate a data transformation script. Users can also provide the
script using the AWS Glue console or API.
6. Users could complete their task immediately or set it to start when another
incidence occurs. The trigger could be a timer or an event.
7. When a user's task starts, a script pulls information from the user's data source,
modifies it, and sends it to the user's data target. The script is run in an Apache
Spark environment through AWS Glue.
Job Scheduler
Several jobs can be initiated simultaneously, and users can specify job dependencies.
Developer Endpoints
A Glue Classifier is used to crawl a data store in the AWS Glue Data Catalog to
generate metadata tables. An ordered set of classifiers can be used to configure your
crawler. When a crawler calls a classifier, the classifier determines whether or not the
data is recognized. If the first classifier fails to acknowledge the data or is unsure, the
crawler moves to the next classifier in the list to see if it can.
AWS Glue DataBrew allows the user to clean and stabilize data using a visual
interface.
AWS Glue Elastic View will enable users to combine and replicate data across
multiple data stores.
These solutions will allow you to spend more time analyzing your data by automating
most of the non-differentiated labor associated with data search, categorization,
cleaning, enrichment, and migration.
1. Amazon Aurora
2. Amazon RDS for MySQL
3. Amazon RDS for Oracle
4. Amazon RDS for PostgreSQL
5. Amazon RDS for SQL Server
6. Amazon Redshift
7. DynamoDB
8. Amazon S3
9. MySQL
10. Oracle
11. Microsoft SQL Server
Your persistent metadata repository is AWS Glue Data Catalog. It's a managed service
that allows you to store, annotate, and exchange metadata in the AWS Cloud in the
same way as an Apache Hive metastore does.AWS Glue Data Catalogs are unique to
each AWS account and region. It creates a centralized location where diverse systems
may store and get metadata to maintain data in data silos and query and alter the data
using that metadata. Access to the data sources handled by the AWS Glue Data Catalog
can be controlled with AWS Identity and Access Management (IAM) policies.
9. Which AWS services and open-source projects use AWS Glue Data
Catalog?
The AWS Glue Data Catalog is used by the following AWS services and open-
source projects:
AWS Glue crawler is used to populate the AWS Glue catalog with tables. It can crawl
many data repositories in one operation. One or even more tables in the Data Catalog
are created or modified when the crawler is done. In ETL operations defined in AWS
Glue, these Data Catalog tables are used as sources and targets. The ETL task reads
and writes data to the Data Catalog tables in the source and target.
The AWS Glue Schema Registry assists us by allowing to validate and regulate the
lifecycle of streaming data using registered Apache Avro schemas at no cost. Apache
Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data
Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS
Lambda benefit from Schema Registry.
12. Why should we use AWS Glue Schema Registry?
Validate schemas: Schemas used for data production are checked against
schemas in a central registry when data streaming apps are linked with AWS
Glue Schema Registry, allowing you to regulate data quality centrally.
Save costs: Serializers transform data into a binary format that can be
compressed before transferring, lowering data transfer and storage costs.
AWS Batch enables you to conduct any batch computing job on AWS with ease and
efficiency, regardless of the work type. AWS Batch maintains and produces computing
resources in your AWS account, giving you complete control over and insight into the
resources in use. AWS Glue is a fully-managed ETL solution that runs your ETL tasks in
a serverless Apache Spark environment. We recommend using AWS Glue for your ETL
use cases. AWS Batch might be a better fit for some batch-oriented use cases, such as
ETL use cases.
14. What kinds of evolution rules does AWS Glue Schema Registry
support?
Backward, Backward All, Forward, Forward All, Full, Full All, None, and Disabled are the
compatibility modes accessible to regulate your schema evolution.
15. How does AWS Glue Schema Registry maintain high availability
for applications?
The AWS Glue SLA is underpinned by the Schema Registry storage and control plane,
and the serializers and deserializers use best-practice caching strategies to maximize
client schema availability.
The serializers and deserializers are Apache-licensed open-source components, but the
Glue Schema Registry storage is an AWS service.
17. How does AWS Glue relate to AWS Lake Formation?
AWS Lake Formation benefits from AWS Glue's shared infrastructure, which offers
console controls, ETL code development and task monitoring, a shared data catalog,
and serverless architecture. Lake Formation features AWS Glue capability and
additional capabilities for constructing, securing, and administering data lakes, even
though AWS Glue is still focused on such types of procedures.
The term "development endpoints" is used to describe the AWS Glue API's testing
capabilities when utilizing Custom DevEndpoint. A developer may debug the extract,
transform, and load ETL Scripts at the endpoint.
A tag is a label you apply to an Amazon Web Services resource. Each tag has a key and
an optional value, both of which are defined by you.
In AWS Glue, you may use tags to organize and identify your resources. Tags can be
used to generate cost accounting reports and limit resource access. You can restrict
which users in your AWS account have authority to create, update, or delete tags if you
use AWS Identity and Access Management.
Crawler
Job
Trigger
Workflow
Development endpoint
Machine learning transform
20. What are the points to remember when using tags with AWS Glue?
The AWS Glue Data Catalog database is a container that houses tables. You utilize
databases to categorize your tables. When you run a crawler or manually add a table,
you establish a database. All of your databases are listed in the AWS Glue console's
database list.
22. What programming language is used to write ETL code for AWS
Glue?
AWS Glue Jobs is a managed platform for orchestrating your ETL workflow. In AWS
Glue, you may construct jobs to automate the scripts you use to extract, transform, and
transport data to various places. Jobs can be scheduled and chained, or events like new
data arrival can trigger them.
The AWS Glue Data Catalog integrates with Amazon EMR, Amazon RDS, Amazon
Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache
Hive megastore, providing a consistent metadata repository across several data sources
and data formats.
Advanced AWS Glue interview questions with answers
25. Does AWS Glue have a no-code interface for visual ETL?
Yes. AWS Glue Studio is a graphical tool for creating Glue jobs that process data. AWS
Glue studio will produce Apache Spark code on your behalf once you've defined the flow
of your data sources, transformations, and targets in the visual interface.
AWS Glue metadata such as databases, tables, partitions, and columns may be queried
using Athena. Individual hive DDL commands can be used to extract metadata
information from Athena for specific databases, tables, views, partitions, and columns,
but the results are not tabular.
27. What is the general workflow for how a Crawler populates the
AWS Glue Data Catalog?
The usual method for populating the AWS Glue Data Catalog via a crawler is as
follows:
1. To deduce the format and schema of your data, a crawler runs any custom
classifiers you specify. Custom classifiers are programmed by you and run in the
order you specify.
2. A schema is created using the first custom classifier that correctly recognizes
your data structure. Lower-ranking custom classifiers are ignored.
4. The crawler accesses the data storage. Connection attributes are required for
crawler access to some data repositories.
6. The crawler populates the data catalog. A table description is a piece of metadata
that defines your data store's data. The table is kept in the Data Catalog, a
database container for tables. The label generated by the classifier that inferred
the table schema is the table's classification attribute.
Scala or Python code is generated via the AWS Glue ETL script suggestion engine. It
makes use of Glue's ETL framework to manage task execution and facilitate access to
data sources. One can use AWS Glue's library to write ETL code, or you can use inline
editing using the AWS Glue Console script editor to write arbitrary code in Scala or
Python, which you can then download and modify in your IDE.
AWS Glue includes a sophisticated set of orchestration features that allow you to handle
dependencies between numerous tasks to design end-to-end ETL processes; in addition
to the ETL library and code generation, AWS Glue ETL jobs can be scheduled or
triggered when they finish. Several jobs can be activated simultaneously or sequentially
by triggering them on a task completion event.
AWS Glue uses triggers to handle dependencies among two or more activities or
external events. Triggers can both watch and invoke jobs. The three options are a
scheduled trigger, which runs jobs regularly, an on-demand trigger, or a job completion
trigger.
AWS Glue tracks job metrics and faults and sends all alerts to Amazon CloudWatch.
You may set up Amazon CloudWatch to do various tasks responding to AWS Glue
notifications. You can use AWS Lambda to trigger an AWS Lambda function when you
get an error or a success notice from Glue. Glue also has a default retry behavior that
retries all errors three times before generating an error message.
33. What AWS Glue Schema Registry supports data format, client
language, and integrations?
The Schema Registry supports Java client apps and Apache Avro and JSON Schema
data formats. We intend to keep adding support for non-Java clients and various data
types. The Schema Registry works with Apache Kafka, Amazon Managed Streaming for
Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis
Data Analytics for Apache Flink, and AWS Lambda applications.
34. How to get metadata into the AWS Glue Data Catalog?
The AWS Glue Data Catalog can be populated in a variety of ways. Crawlers in the Glue
Data Catalog search various data stores you own to infer schemas and partition
structure and populate the Glue Data Catalog with table definitions and statistics. You
can also run crawlers regularly to keep your metadata current and in line with the
underlying data. Users can also use the AWS Glue Console or the API to manually add
and change table information. Hive DDL statements can also be executed on an
Amazon EMR cluster via the Amazon Athena Console or a Hive client.
35. How to import data from the existing Apache Hive Metastore to the
AWS Glue Data Catalog?
Simply execute an ETL process that reads data from your Apache Hive Metastore,
exports it to Amazon S3, and imports it into the AWS Glue Data Catalog.
No, the Apache Hive Metastore is incompatible with AWS Glue Data Catalog. You can
use Glue Data Catalog to replace Apache Hive Metastore by pointing to its endpoint.
37. When should we use AWS Glue Streaming, and when should I use
Amazon Kinesis Data Analytics?
Streaming data can be processed with AWS Glue and Amazon Kinesis Data Analytics.
AWS Glue is advised when your use cases are mostly ETL, and you wish to run tasks
on a serverless Apache Spark-based infrastructure. Amazon Kinesis Data Analytics is
recommended when your use cases are mostly analytics, and you want to run jobs on a
serverless Apache Flink-based platform.
AWS Glue's Streaming ETL allows you to perform complex ETL on streaming data using
the same serverless, pay-as-you-go infrastructure that you use for batch tasks. AWS
Glue provides customized ETL code to prepare your data in flight and has built-in
functionality to process semi-structured or developing schema Streaming data. Use Glue
to load data streams into your data lake or warehouse using its built-in and Spark-native
transformations.
We can use Amazon Kinesis Data Analytics to create complex streaming applications
that analyze data in real time. It offers a serverless Apache Flink runtime that scales
without servers and saves application information indefinitely. For real-time analytics and
more generic stream data processing, use Amazon Kinesis Data Analytics.
AWS Glue DataBrew is a visual data preparation solution that allows data analysts and
scientists to prepare without writing code using an interactive, point-and-click graphical
interface. You can simply visualize, clean, and normalize terabytes, even petabytes, of
data directly from your data lake, data warehouses, and databases, including Amazon
S3, Amazon Redshift, Amazon Aurora, and Amazon RDS, with Glue DataBrew.
AWS Glue DataBrew is designed for users that need to clean and standardize data
before using it for analytics or machine learning. The most common users are data
analysts and data scientists. Business intelligence analysts, operations analysts, market
intelligence analysts, legal analysts, financial analysts, economists, quants, and
accountants are examples of employment functions for data analysts. Materials
scientists, bioanalytical scientists, and scientific researchers are all examples of
employment functions for data scientists.
You can combine, pivot, and transpose data using over 250 built-in transformations
without writing code. AWS Glue DataBrew also suggests transformations such as
filtering anomalies, rectifying erroneous, wrongly classified, duplicate data, normalizing
data to standard date and time values, or generating aggregates for analysis
automatically. Glue DataBrew enables transformations that leverage powerful machine
learning techniques such as Natural Language Processing for complex transformations
like translating words to a common base or root word (NLP). Multiple transformations
can be grouped, saved as recipes, and applied straight to incoming data.
AWS Glue DataBrew accepts comma-separated values (.csv), JSON and nested JSON,
Apache Parquet and nested Apache Parquet, and Excel sheets as input data types.
Comma-separated values (.csv), JSON, Apache Parquet, Apache Avro, Apache ORC,
and XML are all supported as output data formats in AWS Glue DataBrew.
AWS Glue Elastic Views makes it simple to create materialized views that integrate and
replicate data across various data stores without writing proprietary code. AWS Glue
Elastic Views can quickly generate a virtual materialized view table from multiple source
data stores using familiar Structured Query Language (SQL). AWS Glue Elastic Views
moves data from each source data store to a destination datastore and generates a
duplicate of it. AWS Glue Elastic Views continuously monitors data in your source data
stores, and automatically updates materialized views in your target data stores, ensuring
that data accessed through the materialized view is always up-to-date.
Use AWS Glue Elastic Views to aggregate and continuously replicate data across
several data stores in near-real-time. This is frequently the case when implementing new
application functionality requiring data access from one or more existing data stores. For
example, a company might use a customer relationship management (CRM) application
to keep track of customer information and an e-commerce website to handle online
transactions. The data would be stored in these apps or more data stores. The firm is
now developing a new custom application that produces and displays special offers for
active website visitors.
1. Data lake build & consolidation: Glue can extract data from multiple
sources and load the data into a central data lake powered by
something like Amazon S3.
2. Data migration: For large migration and modernization initiatives, Glue
can help move data from a legacy data store to a modern data lake or
data warehouse.
3. Data transformation: Glue provides a visual workflow to transform data
using a comprehensive built-in transformation library or custom
transformation using PySpark
4. Data cataloging: Glue can assist data governance initiatives since it
supports automatic metadata cataloging across your data sources and
targets, making it easy to discover and understand data relationships.
When compared to other options for setting up data pipelines, such as
Apache NiFi or Apache Airflow, AWS Glue is typically a good choice if:
1. You want a fully managed solution: With Glue, you don’t have to
worry about setting up, patching, or maintaining any infrastructure.
2. Your data sources are primarily in AWS: Glue integrates natively with
many AWS services, such as S3, Redshift, and RDS.
3. You are constrained by programming skills availability: Glue’s visual
workflow makes it easy to create data pipelines in a no-code or low-
code code way.
4. You need flexibility and scalability: Glue can scale automatically to
meet demand and can handle petabyte-scale data.
Related Reading: AWS Glue vs Lambda: Choosing the Right Tool for Your
Data Pipeline
AWS Glue is a fully managed ETL (extract, transform, and load) service
that makes it easy for customers to prepare and load their data for
analytics. AWS EMR, on the other hand, is a service that makes it easy to
process large amounts of data quickly and efficiently.
AWS Glue and EMR are both used for data processing but they differ in
how they process and data and their typical use cases
AWS Glue can be easily used to process both structured as well as
unstructured data while AWS EMR is typically suited for processing
structured or semi-structured data.
AWS Glue can automatically discover and categorize the data. AWS EMR
does not have that capability.
AWS Glue can be used to process streaming data or data in near-real-
time, while AWS EMR is typically used for scheduled batch processing.
Usage of AWS Glue is charged per DPU hour while EMR is charged per
underlying EC2 instance hour.
AWS Glue is easier to get started than EMR as Glue does not require
developers to have prior knowledge of MapReduce or Hadoop.
Here is an article that dives deep into AWS Glue vs EMR
Connections, with the help of glue crawlers, help move data from source to
target.
In addition to the support for many AWS native data stores glue connections
also support external data sources as long as those data sources can be
connected to using a JDBC driver.
Glue makes is possible to aggregate logs from various sources into a common
data lake that makes it easy to access and maintain these logs.
Using interactive sessions, you can author and test your scripts as Jupyter
notebooks. Glue supports a comprehensive set of Jupyter magics allowing
developers to develop rich data preparation or transformation scripts.
What are the two types of workflow views in
AWS Glue?
The two types of workflow views are static views and dynamic views.
Static view can be considered as the design view of the workflow, whereas the
dynamic view is the runtime view of the workflow that includes logs, status
and error details for the latest run of the workflow.
Static view is used mainly while defining the workflow, whereas dynamic view
is used when operating the workflow.
AWS Glue Data Catalog suits organizations heavily invested in the AWS
ecosystem, whereas Collibra Data Catalog is ideal for those prioritizing
advanced governance features and flexibility in connecting with various data
sources. Our article AWS Glue Data Catalog versus Collibra Data
Catalog covers this topic in-depth.
The Data Catalog integrates with other AWS services like Amazon Athena and
Amazon Redshift Spectrum, allowing direct querying of the data without
moving it. Additionally, it stores metadata related to ETL jobs, aiding in
automating data preparation for analysis. This approach creates a unified view
of all data, irrespective of its location or format.
Are you preparing for an interview on AWS Glue? Check out this comprehensive list of common
AWS Glue interview questions and answers. Covering topics such as ETL jobs, data pipelines,
data lakes, real-time data processing, and more, this guide will help you demonstrate your
knowledge and understanding of this fully-managed ETL service. Whether you’re a beginner or
an experienced user, these questions and answers will help you confidently navigate any AWS
Glue interview.
I have prepared a list of top 10 AWS Glue interview questions and answers to help you prepare
for your next job interview.
ALSO READ 10 Steps to Mastering AWS EKS Interview Questions and Answers
Answer: AWS Glue jobs are the core ETL operations that perform data transformations and move
data between different data stores. You can create, schedule, and manage Glue jobs using the
AWS Management Console, AWS SDKs, or AWS CLI.
5. What are some advantages of using AWS Glue over traditional ETL solutions?
Answer: a. Fully managed service with no infrastructure to manage. b. Automatic scaling to
handle varying workloads. c. Pay-as-you-go pricing model. d. Integration with other AWS
services. e. Support for various data formats and sources.
9. How does AWS Glue handle schema changes in the source data?
Answer: AWS Glue crawlers can automatically detect schema changes in the source data and
update the metadata in the Data Catalog. You can also configure the crawler to update the schema
in the Data Catalog with new columns or changes to the data type of existing columns.
Answer: AWS Glue Studio is a visual interface for creating, managing, and monitoring AWS
Glue ETL jobs. It simplifies the ETL job creation process by providing a drag-and-drop interface
for defining sources, transformations, and targets, and generating the ETL code automatically.
Related Posts
What are the Python Modules provided in AWS Glue
AWS Glue version 2.0 supports the following python modules. Note :
Different Glue versions support…
AWS Glue and what is it used for - A easy to read introduction
AWS Glue is a fully managed ETL service that allows users to extract,
transform, and…
How does AWS Glue support data migration from legacy systems to cloud
AWS Glue supports data migration from legacy systems to cloud through
various features and functionalities.…
Explain the purpose of the AWS Glue data catalog.
The AWS Glue data catalog is a central repository for storing metadata
about data sources,…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
When used in combination with Amazon S3, AWS Glue offers several
benefits, including: Scalability: AWS…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
AWS Glue is a fully managed extract, transform, and load (ETL) service
that makes it…
Spark : Advantages of Google's Serverless Spark
There can be multiple reason to rename the Spark Data frame . Even
though withColumnRenamed…
How to create AWS Glue table where partitions have different columns?
Related Posts
What are the Python Modules provided in AWS Glue
AWS Glue version 2.0 supports the following python modules. Note :
Different Glue versions support…
AWS Glue and what is it used for - A easy to read introduction
AWS Glue is a fully managed ETL service that allows users to extract,
transform, and…
How does AWS Glue support data migration from legacy systems to cloud
AWS Glue supports data migration from legacy systems to cloud through
various features and functionalities.…
Explain the purpose of the AWS Glue data catalog.
The AWS Glue data catalog is a central repository for storing metadata
about data sources,…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
When used in combination with Amazon S3, AWS Glue offers several
benefits, including: Scalability: AWS…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
AWS Glue is a fully managed extract, transform, and load (ETL) service
that makes it…
Spark : Advantages of Google's Serverless Spark
There can be multiple reason to rename the Spark Data frame . Even
though withColumnRenamed…
How to create AWS Glue table where partitions have different columns?
Related Posts
What are the Python Modules provided in AWS Glue
AWS Glue version 2.0 supports the following python modules. Note :
Different Glue versions support…
AWS Glue and what is it used for - A easy to read introduction
AWS Glue is a fully managed ETL service that allows users to extract,
transform, and…
How does AWS Glue support data migration from legacy systems to cloud
AWS Glue supports data migration from legacy systems to cloud through
various features and functionalities.…
Explain the purpose of the AWS Glue data catalog.
The AWS Glue data catalog is a central repository for storing metadata
about data sources,…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
When used in combination with Amazon S3, AWS Glue offers several
benefits, including: Scalability: AWS…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
AWS Glue is a fully managed extract, transform, and load (ETL) service
that makes it…
Spark : Advantages of Google's Serverless Spark
There can be multiple reason to rename the Spark Data frame . Even
though withColumnRenamed…
How to create AWS Glue table where partitions have different columns?
Related Posts
What are the Python Modules provided in AWS Glue
AWS Glue version 2.0 supports the following python modules. Note :
Different Glue versions support…
AWS Glue and what is it used for - A easy to read introduction
AWS Glue is a fully managed ETL service that allows users to extract,
transform, and…
How does AWS Glue support data migration from legacy systems to cloud
AWS Glue supports data migration from legacy systems to cloud through
various features and functionalities.…
Explain the purpose of the AWS Glue data catalog.
The AWS Glue data catalog is a central repository for storing metadata
about data sources,…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
When used in combination with Amazon S3, AWS Glue offers several
benefits, including: Scalability: AWS…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
AWS Glue is a fully managed extract, transform, and load (ETL) service
that makes it…
Spark : Advantages of Google's Serverless Spark
There can be multiple reason to rename the Spark Data frame . Even
though withColumnRenamed…
How to create AWS Glue table where partitions have different columns?
Related Posts
What are the Python Modules provided in AWS Glue
AWS Glue version 2.0 supports the following python modules. Note :
Different Glue versions support…
AWS Glue and what is it used for - A easy to read introduction
AWS Glue is a fully managed ETL service that allows users to extract,
transform, and…
How does AWS Glue support data migration from legacy systems to cloud
AWS Glue supports data migration from legacy systems to cloud through
various features and functionalities.…
Explain the purpose of the AWS Glue data catalog.
The AWS Glue data catalog is a central repository for storing metadata
about data sources,…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
When used in combination with Amazon S3, AWS Glue offers several
benefits, including: Scalability: AWS…
AWS Glue : What are the benefits of using AWS Glue with Amazon S3?
AWS Glue is a fully managed extract, transform, and load (ETL) service
that makes it…
Spark : Advantages of Google's Serverless Spark
There can be multiple reason to rename the Spark Data frame . Even
though withColumnRenamed…
How to create AWS Glue table where partitions have different columns?
Python
COPY
Use-Case Scenario
Imagine you’re a data analyst working with a global company that receives
sales data from different regions around the world. The data you’re working
with includes the timestamp of each transaction, which is stored in UTC
time. However, for your analysis, you need to convert these timestamps
into local times to get a more accurate picture of customer behaviors during
their local hours. Here, the comes from_utc_timestamp function into play.
Detailed Examples
First, let’s start by creating a PySpark session:
Python
COPY
Let’s assume we have a data frame with sales data, which includes a
timestamp column with UTC times. We’ll use hardcoded values for
simplicity:
Python
COPY
Now, our data frame has a ‘timestamp’ column with UTC times. Let’s
convert these to New York time using the from_utc_timestamp function:
from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn("NY_time", from_utc_timestamp(df["timestamp"], "America/New_York"))
df.show(truncate=False)
Python
COPY
Output
+-------+-------------------+-------------------+
|sale_id|timestamp |NY_time |
+-------+-------------------+-------------------+
|1 |2023-01-01 13:30:00|2023-01-01 08:30:00|
|2 |2023-02-01 14:00:00|2023-02-01 09:00:00|
|3 |2023-03-01 15:00:00|2023-03-01 10:00:00|
+-------+-------------------+-------------------+
Bash
COPY
import pytz
for tz in pytz.all_timezones:
print(tz)
Python
COPY
Python
COPY
Bash
COPY
How to resolve
First you can try installing again
Bash
COPY
The issue occurs due to a compatibility problem with Python 3.7 or later versions
and PySpark with Spark 2.4.4. PySpark uses an outdated method to check for a file
type, which leads to this TypeError.
A quick fix for this issue is to downgrade your Python version to 3.6. However, if
you don’t want to downgrade your Python version, you can apply a patch to
PySpark’s codebase.
The patch involves modifying the pyspark/serializers.py file in your PySpark
directory:
1. Open the pyspark/serializers.py file in a text editor. The exact path depends on
your PySpark installation.
2. Find the following function definition (around line 377):
def _read_with_length(stream):
length = read_int(stream)
if length == SpecialLengths.END_OF_DATA_SECTION:
return None
return stream.read(length)
Python
COPY
result = stream.read(length)
if length and not result:
raise EOFError
return result
Python
COPY
Python
COPY
In the above code:
df is your DataFrame.
integer_column is the new column with integer values.
decimal_column is the column you want to convert from decimal to integer.
Now, let’s illustrate this process with a practical example. We will first initialize a
PySpark session and create a DataFrame:
Python
COPY
+------+-----+
| Name|Score|
+------+-----+
|Sachin| 10.5|
| Ram| 20.8|
| Vinu| 30.3|
| null| null|
+------+-----+
Bash
COPY
df = df.withColumn("Score", col("Score").cast("integer"))
df.show()
Bash
COPY
+------+-----+
| Name|Score|
+------+-----+
|Sachin| 10|
| Ram| 20|
| Vinu| 30|
| null| null|
+------+-----+
Bash
COPY
The ‘Score’ column values are now converted into integers. The decimal parts
have been truncated, and not rounded. Also, observe how the NULL value
remained NULL after the conversion.
PySpark’s flexible and powerful data manipulation functions, like cast, make it a
highly capable tool for data analysis.
PySpark : A Comprehensive Guide to Converting Expressions
to Fixed-Point Numbers in PySpark
USER JULY 15, 2023 LEAVE A COMMENTON PYSPARK : A COMPREHENSIVE GUIDE TO CONVERTING
EXPRESSIONS TO FIXED-POINT NUMBERS IN PYSPARK
Among PySpark’s numerous features, one that stands out is its ability to convert
input expressions into fixed-point numbers. This feature comes in handy when
dealing with data that requires a high level of precision or when we want to control
the decimal places of numbers to maintain consistency across datasets.
In this article, we will walk you through a detailed explanation of how to convert
input expressions to fixed-point numbers using PySpark. Note that PySpark’s
fixed-point function, when given a NULL input, will output NULL.
Understanding Fixed-Point Numbers
Before we get started, it’s essential to understand what fixed-point numbers are. A
fixed-point number has a specific number of digits before and after the decimal
point. Unlike floating-point numbers, where the decimal point can ‘float’, in fixed-
point numbers, the decimal point is ‘fixed’.
PySpark’s Fixed-Point Function
PySpark uses the cast function combined with the DecimalType function to
convert an expression to a fixed-point number. DecimalType allows you to specify
the total number of digits as well as the number of digits after the decimal point.
Here is the syntax for converting an expression to a fixed-point number:
Python
COPY
Python
COPY
+-------+---------+
| Name| Score|
+-------+---------+
| Sachin|10.123456|
| James|20.987654|
|Smitha |30.111111|
| null| null|
+-------+---------+
Bash
COPY
Next, let’s convert the ‘Score’ column to a fixed-point number with a total of 5
digits, 2 of which are after the decimal point:
df = df.withColumn("Score", col("Score").cast(DecimalType(5, 2)))
df.show()
Python
COPY
+-------+-----+
| Name|Score|
+-------+-----+
| Sachin|10.12|
| James|20.99|
|Smitha |30.11|
| null| null|
+-------+-----+
Bash
COPY
The score column values are now converted into fixed-point numbers. Notice how
the NULL value remained NULL after the conversion, which adheres to PySpark’s
rule of NULL input leading to NULL output.
POSTED INSPARK
Bash
COPY
Python
COPY
+-------------------+
|Timestamp |
+-------------------+
|2023-01-14 13:45:30|
|2023-02-25 08:20:00|
|2023-07-07 22:15:00|
|2023-07-08 22:15:00|
+-------------------+
Bash
COPY
Bash
COPY
Result
+-------------------+----------+
|Timestamp |Next_Day |
+-------------------+----------+
|2023-01-14 13:45:30|2023-01-16|
|2023-02-25 08:20:00|2023-02-27|
|2023-07-07 22:15:00|2023-07-08|
|2023-07-08 22:15:00|2023-07-10|
+-------------------+----------+
Bash
COPY
In the Next_Day column, you’ll see that if the next day would have been a Sunday,
it has been replaced with the following Monday.
The use of date_add, date_format, and conditional logic with when function
enables us to easily compute the next business day from a given date or timestamp,
while excluding non-working days like Sundays.
PySpark : Getting the Next and Previous Day from a
Timestamp
USER JULY 15, 2023 LEAVE A COMMENTON PYSPARK : GETTING THE NEXT AND PREVIOUS DAY FROM A
TIMESTAMP
In data processing and analysis, there can often arise situations where you
might need to compute the next day or the previous day from a given date
or timestamp. This article will guide you through the process of
accomplishing these tasks using PySpark, the Python library for Apache
Spark. Detailed examples will be provided to ensure a clear understanding
of these operations.
Setting Up the Environment
Firstly, we need to set up our PySpark environment. Assuming you have
properly installed Spark and PySpark, you can initialize a SparkSession as
follows:
Python
COPY
Python
COPY
+-------------------+
|Timestamp |
+-------------------+
|2023-01-15 13:45:30|
|2023-02-22 08:20:00|
|2023-07-07 22:15:00|
+-------------------+
Bash
COPY
Python
COPY
+-------------------+----------+
|Timestamp |Next_Day |
+-------------------+----------+
|2023-01-15 13:45:30|2023-01-16|
|2023-02-22 08:20:00|2023-02-23|
|2023-07-07 22:15:00|2023-07-08|
+-------------------+----------+
Bash
COPY
The Next_Day column shows the date of the day after each timestamp.
Python
COPY
+-------------------+------------+
|Timestamp |Previous_Day|
+-------------------+------------+
|2023-01-15 13:45:30|2023-01-14 |
|2023-02-22 08:20:00|2023-02-21 |
|2023-07-07 22:15:00|2023-07-06 |
+-------------------+------------+
Bash
COPY
COPY
Python
COPY
+-------------------+
|Timestamp |
+-------------------+
|2023-01-15 13:45:30|
|2023-02-22 08:20:00|
|2023-07-07 22:15:00|
+-------------------+
Bash
COPY
Now, we can use the last_day function to get the last day of the month for each
timestamp:
df.withColumn("Last_Day_of_Month", F.last_day(F.col("Timestamp"))).show(truncate=False)
Python
COPY
+-------------------+-----------------+
|Timestamp |Last_Day_of_Month|
+-------------------+-----------------+
|2023-01-15 13:45:30|2023-01-31 |
|2023-02-22 08:20:00|2023-02-28 |
|2023-07-07 22:15:00|2023-07-31 |
+-------------------+-----------------+
Bash
COPY
The new Last_Day_of_Month column shows the last day of the month for each
corresponding timestamp.
Getting the Last Day of the Year
Determining the last day of the year is slightly more complex, as there isn’t a built-
in function for this in PySpark. However, we can accomplish it by combining the
year function with some string manipulation. Here’s how:
df.withColumn("Year", F.year(F.col("Timestamp")))\
.withColumn("Last_Day_of_Year", F.expr("make_date(Year, 12, 31)"))\
.show(truncate=False)
Python
COPY
In the code above, we first extract the year from the timestamp using the year
function. Then, we construct a new date representing the last day of that year using
the make_date function. The make_date function creates a date from the year,
month, and day values.
PySpark’s last_day function makes it straightforward to determine the last day of
the month for a given date or timestamp, finding the last day of the year requires a
bit more creativity. By combining the year and make_date functions, however, you
can achieve this with relative ease.
+-------------------+----+----------------+
|Timestamp |Year|Last_Day_of_Year|
+-------------------+----+----------------+
|2023-01-15 13:45:30|2023|2023-12-31 |
|2023-02-22 08:20:00|2023|2023-12-31 |
|2023-07-07 22:15:00|2023|2023-12-31 |
+-------------------+----+----------------+
Bash
COPY
This article will explain how to add or subtract a specific number of months
from a date or timestamp while preserving end-of-month information. This
is especially useful when dealing with financial, retail, or similar data, where
preserving the end-of-month status of a date is critical.
Python
COPY
Python
COPY
+----------+
| Date|
+----------+
|2023-01-31|
|2023-02-28|
|2023-07-15|
+----------+
Bash
COPY
Python
COPY
+----------+----------+
| Date| New_Date|
+----------+----------+
|2023-01-31|2023-03-31|
|2023-02-28|2023-04-28|
|2023-07-15|2023-09-15|
+----------+----------+
Bash
COPY
Note how the dates originally at the end of a month are still at the end of
the month in the New_Date column.
Subtracting Months
Subtracting months is as simple as adding months. We simply use a
negative number as the second parameter to the add_months function:
Python
COPY
+----------+----------+
| Date| New_Date|
+----------+----------+
|2023-01-31|2022-11-30|
|2023-02-28|2022-12-28|
|2023-07-15|2023-05-15|
+----------+----------+
Bash
COPY
Python
COPY
+-------------------+
|Timestamp |
+-------------------+
|2023-01-31 13:45:30|
|2023-02-28 08:20:00|
|2023-07-15 22:15:00|
+-------------------+
Bash
COPY
Python
COPY
+-------------------+-------------+
|Timestamp |New_Timestamp|
+-------------------+-------------+
|2023-01-31 13:45:30|2023-03-31 |
|2023-02-28 08:20:00|2023-04-28 |
|2023-07-15 22:15:00|2023-09-15 |
+-------------------+-------------+
Bash
COPY
+-------------------+-------------+
|Timestamp |New_Timestamp|
+-------------------+-------------+
|2023-01-31 13:45:30|2022-11-30 |
|2023-02-28 08:20:00|2022-12-28 |
|2023-07-15 22:15:00|2023-05-15 |
+-------------------+-------------+
Bash
COPY
To illustrate these join operations, we will use two sample data frames –
‘freshers_personal_details’ and ‘freshers_academic_details’.
Sample Data
Python
COPY
We have ‘Id’ as a common column between the two data frames which we will use as a key for joining.
Inner Join
The inner join in PySpark returns rows from both data frames where key records of
the first data frame match the key records of the second data frame.
inner_join_df = freshers_personal_details.join(freshers_academic_details,
on=['Id'], how='inner')
inner_join_df.show()
Python
COPY
Output
+---+------+---------+--------------------+----------+---+
| Id| Name| City| Major|University|GPA|
+---+------+---------+--------------------+----------+---+
| 1|Sachin| New York| Computer Science| MIT|3.8|
| 2|Shekar|Bangalore|Electrical Engine...| Stanford|3.5|
| 3|Antony| Chicago| Physics| Princeton|3.9|
+---+------+---------+--------------------+----------+---+
Bash
COPY
left_join_df = freshers_personal_details.join(freshers_academic_details,
on=['Id'], how='left')
left_join_df.show()
Python
COPY
Output
+---+------+---------+--------------------+----------+----+
| Id| Name| City| Major|University| GPA|
+---+------+---------+--------------------+----------+----+
| 1|Sachin| New York| Computer Science| MIT| 3.8|
| 2|Shekar|Bangalore|Electrical Engine...| Stanford| 3.5|
| 3|Antony| Chicago| Physics| Princeton| 3.9|
| 5| Vijay| London| null| null|null|
| 4|Sharat| Delhi| null| null|null|
+---+------+---------+--------------------+----------+----+
Bash
COPY
right_join_df = freshers_personal_details.join(freshers_academic_details,
on=['Id'], how='right')
right_join_df.show()
Python
COPY
Output
+---+------+---------+--------------------+----------+---+
| Id| Name| City| Major|University|GPA|
+---+------+---------+--------------------+----------+---+
| 1|Sachin| New York| Computer Science| MIT|3.8|
| 2|Shekar|Bangalore|Electrical Engine...| Stanford|3.5|
| 7| null| null| Chemistry| Yale|3.6|
| 3|Antony| Chicago| Physics| Princeton|3.9|
| 6| null| null| Mathematics| Harvard|3.7|
+---+------+---------+--------------------+----------+---+
Bash
COPY
full_outer_join_df = freshers_personal_details.join(freshers_academic_details,
on=['Id'], how='outer')
full_outer_join_df.show()
Python
COPY
Output
+---+------+---------+--------------------+----------+----+
| Id| Name| City| Major|University| GPA|
+---+------+---------+--------------------+----------+----+
| 1|Sachin| New York| Computer Science| MIT| 3.8|
| 2|Shekar|Bangalore|Electrical Engine...| Stanford| 3.5|
| 3|Antony| Chicago| Physics| Princeton| 3.9|
| 4|Sharat| Delhi| null| null|null|
| 5| Vijay| London| null| null|null|
| 6| null| null| Mathematics| Harvard| 3.7|
| 7| null| null| Chemistry| Yale| 3.6|
+---+------+---------+--------------------+----------+----+
Bash
COPY
left_semi_join_df = freshers_personal_details.join(freshers_academic_details,
on=['Id'], how='leftsemi')
left_semi_join_df.show()
Python
COPY
+---+------+---------+
| Id| Name| City|
+---+------+---------+
| 1|Sachin| New York|
| 2|Shekar|Bangalore|
| 3|Antony| Chicago|
+---+------+---------+
Bash
COPY
left_anti_join_df = freshers_personal_details.join(freshers_academic_details,
on=['Id'], how='leftanti')
left_anti_join_df.show()
Python
COPY
Output
+---+------+------+
| Id| Name| City|
+---+------+------+
| 5| Vijay|London|
| 4|Sharat| Delhi|
+---+------+------+
Bash
COPY
freshers_additional_details = spark.createDataFrame([
('1', 'Sachin', 'Python'),
('2', 'Shekar', 'Java'),
('3', 'Sanjo', 'C++'),
('6', 'Rakesh', 'Scala'),
('7', 'Sorya', 'JavaScript'),
], ['Id', 'Name', 'Programming_Language'])
# Perform inner join based on multiple conditions
multi_condition_join_df = freshers_personal_details.join(
freshers_additional_details,
(freshers_personal_details['Id'] == freshers_additional_details['Id']) &
(freshers_personal_details['Name'] == freshers_additional_details['Name']),
how='inner'
)
multi_condition_join_df.show()
Python
COPY
Output
+---+------+---------+---+------+--------------------+
| Id| Name| City| Id| Name|Programming_Language|
+---+------+---------+---+------+--------------------+
| 1|Sachin| New York| 1|Sachin| Python|
| 2|Shekar|Bangalore| 2|Shekar| Java|
+---+------+---------+---+------+--------------------+
Bash
COPY
Note : When working with larger datasets, as the choice of join types and the order of operations can have
a significant impact on the performance of the Spark application.
pyspark.sql.functions.reverse
Collection function: returns a reversed string or an array with reverse order of
elements.
In order to reverse the order of lists in a dataframe column, we can use the PySpark
function reverse() from pyspark.sql.functions. Here’s an example.
Let’s start by creating a sample dataframe with a list of strings.
Python
COPY
Output
+-------+--------------------+
| Name| Techstack|
+-------+--------------------+
| Sachin| [Python, C, Go]|
|Renjith|[RedShift, Snowfl...|
| Ahamed|[Android, MacOS, ...|
+-------+--------------------+
Bash
COPY
Now, we can apply the reverse() function to the “Techstack” column to reverse the
order of the list.
Python
COPY
Output
+-------+--------------------+
| Name| Techstack|
+-------+--------------------+
| Sachin| [Go, C, Python]|
|Renjith|[Oracle, Snowflak...|
| Ahamed|[Windows, MacOS, ...|
+-------+--------------------+
Bash
COPY
As you can see, the order of the elements in each list in the “Techstack” column
has been reversed. The withColumn() function is used to add a new column or
replace an existing column (with the same name) in the dataframe. Here, we are
replacing the “Fruits” column with a new column where the lists have been
reversed.
PySpark : Reversing the order of strings in a list using
PySpark
USER JULY 5, 2023 LEAVE A COMMENTON PYSPARK : REVERSING THE ORDER OF STRINGS IN A LIST
USING PYSPARK
Python
COPY
We will use the built-in Python function reversed() inside a map operation
to reverse the order of each string. reversed() returns a reverse iterator, so
we have to join it back into a string with ”.join().
Python
COPY
The lambda function here is a simple anonymous function that takes one
argument, x, and returns the reversed string. x is each element of the RDD
(each string in this case).
After this operation, we have a new RDD where each string from the
original RDD has been reversed. You can collect the results back to the
driver program using the collect() action.
Python
COPY
As you can see, the order of characters in each string from the list has
been reversed. Note that Spark operations are lazily evaluated, meaning
the actual computations (like reversing the strings) only happen when an
action (like collect()) is called. This feature allows Spark to optimize the
overall data processing workflow.
Complete code
Bash
COPY
Python
COPY
Drawbacks:
1. Collisions: While the possibility is reduced, hash collisions can still occur where different inputs
produce the same hash output.
2. Not for Security: A hash value is not meant for security purposes. It can be reverse-engineered to
get the original input.
3. Data Loss: Hashing is a one-way function. Once data is hashed, it cannot be converted back to the
original input.
PySpark : Create an MD5 hash of a certain string column in
PySpark.
USER JULY 5, 2023 LEAVE A COMMENTON PYSPARK : CREATE AN MD5 HASH OF A CERTAIN STRING
COLUMN IN PYSPARK.
Introduction to MD5 Hash
MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function
that produces a 128-bit (16-byte) hash value. It is commonly used to check the
integrity of files. However, MD5 is not collision-resistant; as of 2021, it is possible
to find different inputs that hash to the same output, which makes it unsuitable for
functions such as SSL certificates or encryption that require a high degree of
security.
An MD5 hash is typically expressed as a 32-digit hexadecimal number.
Use of MD5 Hash in PySpark
Yes, you can use PySpark to generate a 32-character hex-encoded string containing
the 128-bit MD5 message digest. PySpark does not have a built-in MD5 function,
but you can easily use Python’s built-in libraries to create a User Defined Function
(UDF) for this purpose.
Here is how you can create an MD5 hash of a certain string column in PySpark.
df_hashed.show(20,False)
Python
COPY
In this example, we first create a Spark session and a DataFrame df with a single
column “Name”. Then, we define the function md5_hash to generate an MD5 hash
of an input string. After that, we create a user-defined function (UDF) md5_udf
using PySpark SQL functions. Finally, we apply this UDF to the column “Name”
in the DataFrame df and create a new DataFrame df_hashed with the MD5 hashed
values of the names.
Output
+----+--------------------------------+
|Name|Name_hashed |
+----+--------------------------------+
|John|61409aa1fd47d4a5332de23cbf59a36f|
|Jane|2b95993380f8be6bd4bd46bf44f98db9|
|Mike|1b83d5da74032b6a750ef12210642eea|
+----+--------------------------------+
POSTED INSPARK
def base64_encode(input):
try:
return base64.b64encode(input.encode('utf-8')).decode('utf-8')
except Exception as e:
return None
Python
COPY
The BASE64_ENCODE function is a handy tool for preserving binary data integrity when it needs to be
stored and transferred over systems that are designed to handle text.
Python
COPY
Output
+----------+---------+----------------------------+
|First Name|Last Name|Email |
+----------+---------+----------------------------+
|Sachin |Tendulkar|[email protected]|
|Mahesh |Babu |[email protected] |
|Mohan |Lal |[email protected] |
+----------+---------+----------------------------+
+----------+---------+----------------------------
+----------------------------------------+
|First Name|Last Name|Email |Encoded Email
|
+----------+---------+----------------------------
+----------------------------------------+
|Sachin |Tendulkar|[email protected]|
c2FjaGluLnRlbmR1bGthckBmcmVzaGVycy5pbg==|
|Mahesh |Babu |[email protected] |
bWFoZXNoLmJhYnVAZnJlc2hlcnMuaW4= |
|Mohan |Lal |[email protected] |bW9oYW4ubGFsQGZyZXNoZXJzLmlu
|
+----------+---------+----------------------------
+----------------------------------------+
Bash
COPY
In this script, we first create a SparkSession, which is the entry point to any
functionality in Spark. We then create a DataFrame with some sample
data.
The base64_encode function takes an input string and returns the Base64
encoded version of the string. We then create a user-defined function
(UDF) out of this, which can be applied to our DataFrame.
Finally, we create a new DataFrame, df_encoded, which includes a new
column ‘Encoded Email’. This column is the result of applying our UDF to
the ‘Email’ column of the original DataFrame.
When you run the df.show() and df_encoded.show(), it will display the
original and the base64 encoded DataFrames respectively.
Time series data often involves handling and manipulating dates. Apache Spark,
through its PySpark interface, provides an arsenal of date-time functions that
simplify this task. One such function is next_day(), a powerful function used to
find the next specified day of the week from a given date. This article will provide
an in-depth look into the usage and application of the next_day() function in
PySpark.
The next_day() function takes two arguments: a date and a day of the week. The
function returns the next specified day after the given date. For instance, if the
given date is a Monday and the specified day is ‘Thursday’, the function will return
the date of the coming Thursday.
The next_day() function recognizes the day of the week case-insensitively, and
both in full (like ‘Monday’) and abbreviated form (like ‘Mon’).
To begin with, let’s initialize a SparkSession, the entry point to any Spark
functionality.
Python
COPY
Create a DataFrame with a single column date filled with some hardcoded date
values.
data = [("2023-07-04",),
("2023-12-31",),
("2022-02-28",)]
df = spark.createDataFrame(data, ["date"])
df.show()
Python
COPY
Output
+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+
Bash
COPY
Given the dates are in string format, we need to convert them into date type using
the to_date function.
Bash
COPY
Use the next_day() function to find the next Sunday from the given date.
Python
COPY
Result DataFrame
+----------+-----------+
| date|next_sunday|
+----------+-----------+
|2023-07-04| 2023-07-09|
|2023-12-31| 2024-01-07|
|2022-02-28| 2022-03-05|
+----------+-----------+
Bash
COPY
Python
COPY
Create a DataFrame with a single column called date that contains some hard-
coded date values.
data = [("2023-07-04",),
("2023-12-31",),
("2022-02-28",)]
df = spark.createDataFrame(data, ["date"])
df.show()
Python
COPY
Output
+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+
Bash
COPY
As our dates are in string format, we need to convert them into date type using
the to_date function.
Python
COPY
Let’s use the month() function to extract the month from the date column.
Python
COPY
Result
+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+
Bash
COPY
As you can see, the month column contains the month part of the corresponding
date in the date column. The month() function in PySpark provides a simple and
effective way to retrieve the month part from a date, making it a valuable tool in a
data scientist’s arsenal. This function, along with other date-time functions in
PySpark, simplifies the process of handling date-time data.
PySpark : Calculating the Difference Between Dates with
PySpark: The months_between Function
USER JULY 4, 2023 LEAVE A COMMENTON PYSPARK : CALCULATING THE DIFFERENCE BETWEEN
DATES WITH PYSPARK: THE MONTHS_BETWEEN FUNCTION
When working with time series data, it is often necessary to calculate the time
difference between two dates. Apache Spark provides an extensive collection of
functions to perform date-time manipulations, and months_between is one of
them. This function computes the number of months between two dates. If the first
date (date1) is later than the second one (date2), the result will be positive.
Notably, if both dates are on the same day of the month, the function will return a
precise whole number. This article will guide you on how to utilize this function in
PySpark.
Firstly, we need to create a SparkSession, which is the entry point to any
functionality in Spark.
Python
COPY
Let’s create a DataFrame with hardcoded dates for illustration purposes. We’ll
create two columns, date1 and date2, which will contain our dates in string format.
Python
COPY
Output
+----------+----------+
| date1| date2|
+----------+----------+
|2023-07-04|2022-07-04|
|2023-12-31|2022-01-01|
|2022-02-28|2021-02-28|
+----------+----------+
Bash
COPY
In this DataFrame, date1 is always later than date2. Now, we need to convert the
date strings to date type using the to_date function.
Python
COPY
Python
COPY
Result
+----------+----------+--------------+
| date1| date2|months_between|
+----------+----------+--------------+
|2023-07-04|2022-07-04| 12.0|
|2023-12-31|2022-01-01| 23.96774194|
|2022-02-28|2021-02-28| 12.0|
+----------+----------+--------------+
Python
COPY
# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()
# Create DataFrame
freshers_in = spark.createDataFrame(data, ["array1", "array2"])
freshers_in.show(truncate=False)
Python
COPY
The show() function will display the DataFrame freshers_in, which should look
something like this:
+-------------------+-------------------+
|array1 |array2 |
+-------------------+-------------------+
|[java, c++, python]|[python, java, scala]|
|[javascript, c#, java]|[java, javascript, php]|
|[ruby, php, c++]|[c++, ruby, perl]|
+-------------------+-------------------+
Bash
COPY
To create a new array column containing unique elements from ‘array1’ and ‘array2’, we can utilize
the concat() function to merge the arrays and the array_distinct() function to extract the unique
elements.
Python
COPY
Result
+-------------------+-------------------+-----------------------------------+
|array1 |array2 |unique_elements |
+-------------------+-------------------+-----------------------------------+
|[java, c++, python]|[python, java, scala]|[java, c++, python, scala] |
|[javascript, c#, java]|[java, javascript, php]|[javascript, c#, java, php] |
|[ruby, php, c++]|[c++, ruby, perl]|[ruby, php, c++, perl] |
+-------------------+-------------------+-----------------------------------+
Bash
COPY
In this article, we’ll walk you through how to extract an array containing the
distinct values from arrays in a column in PySpark. We will demonstrate
this process using some sample data, which you can execute directly.
spark = SparkSession.builder.getOrCreate()
Python
COPY
Result
+-------+-------------------------+
|Name |Languages |
+-------+-------------------------+
|James |[Java, C++, Python] |
|Michael|[Python, Java, C++, Java]|
|Robert |[CSharp, VB, Python, Java, Python]|
+-------+-------------------------+
Bash
COPY
df2.show(truncate=False)
Python
COPY
Result
+-------+---------+
|Name |Languages|
+-------+---------+
|James |Python |
|James |Java |
|James |C++ |
|Michael|Java |
|Robert |Java |
|Robert |CSharp |
|Robert |Python |
|Robert |VB |
|Michael|C++ |
|Michael|Python |
+-------+---------+
Batch
COPY
Here, the explode function creates a new row for each element in the given
array or map column, and the dropDuplicates function eliminates duplicate
rows.
However, the result is not an array but rather individual rows. To get an
array of distinct values for each person, we can group the data by the
‘Name’ column and use the collect_list function:
df3 = df2.groupBy("Name").agg(collect_list("Languages").alias("DistinctLanguages"))
df3.show(truncate=False)
Python
COPY
Result
+-------+--------------------------+
|Name |DistinctLanguages |
+-------+--------------------------+
|James |[Python, Java, C++] |
|Michael|[Java, C++, Python] |
|Robert |[Java, CSharp, Python, VB]|
+-------+--------------------------+
Bash
COPY
You want to get the list of all the Languages without duplicate , you can
perform the below
df4 = df.select(explode(df["Languages"])).dropDuplicates(["col"])
df4.show(truncate=False)
Python
COPY
Python
COPY
+------+
|col |
+------+
|C++ |
|Python|
|Java |
|CSharp|
|VB |
+------+
This article will focus on a particular use case: returning an array that contains the
matching elements in two input arrays in PySpark. To illustrate this, we’ll use
PySpark’s built-in functions and DataFrame transformations.
PySpark does not provide a direct function to compare arrays and return the
matching elements. However, you can achieve this by utilizing some of its in-built
functions like explode, collect_list, and array_intersect.
Let’s assume we have a DataFrame that has two columns, both of which contain
arrays:
Python
COPY
df_with_matching_elements = df.withColumn("MatchingElements",
array_intersect(df.Array1, df.Array2))
df_with_matching_elements.show(20,False)
Python
COPY
+---+-----------------------+----------------------+----------------+
|id |Array1 |Array2 |MatchingElements|
+---+-----------------------+----------------------+----------------+
|1 |[apple, banana, cherry]|[banana, cherry, date]|[banana, cherry]|
|2 |[pear, mango, peach] |[mango, peach, lemon] |[mango, peach] |
+---+-----------------------+----------------------+----------------+
POSTED INSPARK
PySpark : Creating Ranges in PySpark DataFrame with
Custom Start, End, and Increment Values
USER JUNE 22, 2023 LEAVE A COMMENTON PYSPARK : CREATING RANGES IN PYSPARK DATAFRAME
WITH CUSTOM START, END, AND INCREMENT VALUES
In PySpark, there isn’t a built-in function to create an array sequence given a start,
end, and increment value. In PySpark, you can use the range function, but it’s only
available for integer values. For float values, PySpark doesn’t provide such an
option. But, we can use a workaround and apply an UDF (User-Defined Function)
to create a list between the start_val and end_val with increments of increment_val.
Here’s how to do it:
Python
COPY
This will create a new column called range in the DataFrame that contains a list
from start_val to end_val with increments of increment_val.
Result
+---------+-------+-------------+------------------+
|start_val|end_val|increment_val|range |
+---------+-------+-------------+------------------+
|1 |10 |2 |[1, 3, 5, 7, 9] |
|3 |6 |1 |[3, 4, 5, 6] |
|10 |20 |5 |[10, 15, 20] |
+---------+-------+-------------+------------------+
Bash
COPY
Remember that using Python UDFs might have a performance impact when
dealing with large volumes of data, as data needs to be moved from the JVM to
Python, which is an expensive operation. It is usually a good idea to profile your
Spark application and ensure the performance is acceptable.
Second Option [This below method is not suggested] Just for your information
Python
COPY
If you want to prepend an element to the array only when the array contains a
specific word, you can achieve this with the help of PySpark’s when() and
otherwise() functions along with array_contains(). The when() function allows you
to specify a condition, the array_contains() function checks if an array contains a
certain value, and the otherwise() function allows you to specify what should
happen if the condition is not met.
Here is the example to prepend an element only when the array contains the word
“four”.
Python
COPY
Source Data
+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+
Bash
COPY
Output
+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[zero, one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+
Bash
COPY
Let’s first create a PySpark DataFrame with an array column to use in the
demonstration:
Let’s first create a PySpark DataFrame with an array column to use in the
demonstration:
Python
COPY
+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+
Bash
COPY
Python
COPY
The lit() function is used to create a column of literal value. The array()
function is used to create an array with the literal value, and the concat()
function is used to concatenate two arrays.
+--------+-----------------------------------------------+
|Category|Items |
+--------+-----------------------------------------------+
|fruits |[zero, apple, banana, cherry, date, elderberry]|
|numbers |[zero, one, two, three, four, five] |
|colors |[zero, red, blue, green, yellow, pink] |
+--------+-----------------------------------------------+
POSTED INSPARK
PySpark : Finding the Index of the First Occurrence of an
Element in an Array in PySpark
USER JUNE 16, 2023 LEAVE A COMMENTON PYSPARK : FINDING THE INDEX OF THE FIRST OCCURRENCE
OF AN ELEMENT IN AN ARRAY IN PYSPARK
This article will walk you through the steps on how to find the index of the first
occurrence of an element in an array in PySpark with a working example.
Installing PySpark
Before we get started, you’ll need to have PySpark installed. You can install it via
pip:
Bash
COPY
Python
COPY
Source data
+--------+-----------------------------------------+
|Category|Items |
+--------+-----------------------------------------+
|fruits |[apple, banana, cherry, date, elderberry]|
|numbers |[one, two, three, four, five] |
|colors |[red, blue, green, yellow, pink] |
+--------+-----------------------------------------+
Bash
COPY
Python
COPY
This UDF takes two arguments: an array and an item. It tries to return the index of
the item in the array. If the item is not found, it returns None.
Applying the UDF
To pass a literal value to the UDF, you should use the lit function from
pyspark.sql.functions. Here’s how you should modify your code:
Finally, we’ll apply the UDF to our DataFrame to find the index of an element.
Python
COPY
Final Output
+--------+-----------------------------------------+---------+
|Category|Items |ItemIndex|
+--------+-----------------------------------------+---------+
|fruits |[apple, banana, cherry, date, elderberry]|null |
|numbers |[one, two, three, four, five] |2 |
|colors |[red, blue, green, yellow, pink] |null |
+--------+-----------------------------------------+---------+
Bash
COPY
This will add a new column to the DataFrame, “ItemIndex”, that contains the index
of the first occurrence of “three” in the “Items” column. If “three” is not found in
an array, the corresponding entry in the “ItemIndex” column will be null.
lit(“three”) creates a Column of literal value “three”, which is then passed to the UDF. This ensures that
the UDF correctly interprets “three” as a string value, not a column name.
POSTED INSPARK
Python
COPY
Result
+---+-----+------+
| id| type| value|
+---+-----+------+
| 1|type1|value1|
| 1|type2|value2|
| 2|type1|value3|
| 2|type2|value4|
+---+-----+------+
Bash
COPY
Python
COPY
Final Output
In this example, groupBy(“id”) groups the DataFrame by ‘id’, pivot(“type”) pivots
the ‘type’ column, and agg(collect_list(“value”)) collects the ‘value’ column into
an array for each group. The resulting DataFrame will have one row for each
unique ‘id’, and a column for each unique ‘type’, with the values in these columns
being arrays of the corresponding ‘value’ entries.
‘collect_list’ collects all values including duplicates. If you want to collect only
unique values, use ‘collect_set’ instead.
PySpark : Extract values from JSON strings within a
DataFrame in PySpark [json_tuple]
USER MAY 26, 2023 LEAVE A COMMENTON PYSPARK : EXTRACT VALUES FROM JSON STRINGS WITHIN
A DATAFRAME IN PYSPARK [JSON_TUPLE]
pyspark.sql.functions.json_tuple
PySpark provides a powerful function called json_tuple that allows you to extract
values from JSON strings within a DataFrame. This function is particularly useful
when you’re working with JSON data and need to retrieve specific values or
attributes from the JSON structure. In this article, we will explore the json_tuple
function in PySpark and demonstrate its usage with an example.
Understanding json_tuple
The json_tuple function in PySpark extracts the values of specified attributes from
JSON strings within a DataFrame. It takes two or more arguments: the first
argument is the input column containing JSON strings, and the subsequent
arguments are the attribute names you want to extract from the JSON.
The json_tuple function returns a tuple of columns, where each column represents
the extracted value of the corresponding attribute from the JSON string.
Example Usage
Let’s dive into an example to understand how to use json_tuple in PySpark.
Consider the following sample data:
Python
COPY
Output:
+-----------------------+
|json_data |
+-----------------------+
|{"name": "Sachin", "age": 30}|
|{"name": "Narendra", "age": 25}|
|{"name": "Jacky", "age": 40} |
+-----------------------+
Bash
COPY
Python
COPY
Output
+----+---+
|name|age|
+----+---+
|Sachin|30 |
|Narendra|25 |
|Jacky |40 |
+----+---+
Bash
COPY
In the above code, we use the json_tuple function to extract the ‘name’ and ‘age’
attributes from the ‘json_data’ column. We specify the attribute names as
arguments to json_tuple (‘name’ and ‘age’), and use the alias method to assign
meaningful column names to the extracted attributes.
The resulting extracted_data DataFrame contains two columns: ‘name’ and ‘age’
with the extracted values from the JSON strings.
The json_tuple function in PySpark is a valuable tool for working with JSON data
in DataFrames. It allows you to extract specific attributes or values from JSON
strings efficiently. By leveraging the power of json_tuple, you can easily process
and analyze JSON data within your PySpark pipelines, gaining valuable insights
from structured JSON information.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 15
Related Posts
How can you convert PySpark Dataframe to JSON ?
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : PySpark program to write DataFrame to Snowflake table.
XML data is commonly used in data exchange and storage, and it can
contain complex…
PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the
DataFrame…
PySpark : Sort an array of elements in a DataFrame column
Python
COPY
Output
+-----+----------+
|value|cbrt_value|
+-----+----------+
| 1| 1.0|
| 8| 2.0|
| 27| 3.0|
| 64| 4.0|
+-----+----------+
Bash
COPY
We import the cbrt function from pyspark.sql.functions. Then, we use the cbrt()
function directly in the withColumn method to apply the cube root transformation
to the ‘value’ column. The col(‘value’) expression retrieves the column ‘value’,
and cbrt(col(‘value’)) computes the cube root of that column.
Now, the transformed_df DataFrame will contain the expected cube root values in
the ‘cbrt_value’ column.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 1
Related Posts
PySpark : Finding the position of a given value in an array column.
pyspark.sql.functions.grouping_id(*cols)
This function is valuable when you need to identify the grouping level in data after
performing a group by operation with cube or rollup. In this article, we will delve
into the details of the grouping_id function and its usage with an example.
The grouping_id function signature in PySpark is as follows:
pyspark.sql.functions.grouping_id(*cols)
Bash
COPY
This function doesn’t require any argument, but it’s often used with columns in a
DataFrame.
The grouping_id function is used in conjunction with the cube or rollup operations,
and it provides an ID to indicate the level of grouping. The more columns the data
is grouped by, the smaller the grouping ID will be.
Example Usage
Let’s go through a simple example to understand the usage of the grouping_id
function.
Suppose we have a DataFrame named df containing three columns: ‘City’,
‘Product’, and ‘Sales’.
Python
COPY
Result : DataFrame
+-----------+-------+-----+
| City|Product|Sales|
+-----------+-------+-----+
| New York| Apple| 100|
|Los Angeles| Orange| 200|
| New York| Banana| 150|
|Los Angeles| Apple| 120|
| New York| Orange| 75|
|Los Angeles| Banana| 220|
+-----------+-------+-----+
Python
COPY
Now, let’s perform a cube operation on the ‘City’ and ‘Product’ columns and
compute the total ‘Sales’ for each group. Also, let’s add a grouping_id column to
identify the level of grouping.
Python
COPY
The orderBy function is used here to sort the result by the ‘GroupingID’ column.
The output will look something like this:
+-----------+-------+----------+----------+
| City|Product|TotalSales|GroupingID|
+-----------+-------+----------+----------+
| New York| Banana| 150| 0|
|Los Angeles| Orange| 200| 0|
|Los Angeles| Apple| 120| 0|
| New York| Apple| 100| 0|
| New York| Orange| 75| 0|
|Los Angeles| Banana| 220| 0|
| New York| null| 325| 1|
|Los Angeles| null| 540| 1|
| null| Apple| 220| 2|
| null| Banana| 370| 2|
| null| Orange| 275| 2|
| null| null| 865| 3|
+-----------+-------+----------+----------+
Bash
COPY
As you can see, the grouping_id function provides a numerical identifier that
describes the level of grouping in the DataFrame, with smaller values
corresponding to more columns being used for grouping.
The grouping_id function is a powerful tool for understanding the level of
grouping in your data when using cube or rollup operations in PySpark. It provides
valuable insights, especially when dealing with complex datasets with multiple
levels of aggregation.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 1
Related Posts
PySpark : LongType and ShortType data types in PySpark
XML data is commonly used in data exchange and storage, and it can
contain complex…
PySpark : How decode works in PySpark ?
pyspark.sql.functions.exp(col)
Bash
COPY
Python
COPY
Result : DataFrame:
+----+
|col1|
+----+
| 1.0|
| 2.0|
| 3.0|
| 4.0|
| 5.0|
+----+
Bash
COPY
Now, we wish to compute the exponential of each value in the col1 column. We
can achieve this using the exp function:
from pyspark.sql.functions import exp
df_exp = df.withColumn("col1_exp", exp(df["col1"]))
df_exp.show()
Python
COPY
In this code, the withColumn function is utilized to add a new column to the
DataFrame. This new column, col1_exp, will contain the exponential of each value
in the col1 column. The output will resemble the following:
+----+------------------+
|col1| col1_exp|
+----+------------------+
| 1.0|2.7182818284590455|
| 2.0| 7.38905609893065|
| 3.0|20.085536923187668|
| 4.0|54.598150033144236|
| 5.0| 148.4131591025766|
+----+------------------+
Bash
COPY
As you can see, the col1_exp column now holds the exponential of the values in
the col1 column.
PySpark’s exp function is a beneficial tool for computing the exponential of
numeric data. It is a must-have in the toolkit of data scientists and engineers
dealing with large datasets, as it empowers them to perform complex
transformations with ease.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 2
Related Posts
PySpark : Finding the position of a given value in an array column.
For repeating array elements k times in PySpark we can use the below
library. Library…
PySpark : Replacing null column in a PySpark dataframe to 0 or any value you wish.
Consider you have situation with incoming raw data got a json column, and
you need…
Spark : Calculate the number of unique elements in a column using PySpark
Function Signature
The encode function signature in PySpark is as follows:
pyspark.sql.functions.encode(col, charset)
Bash
COPY
Python
COPY
+-----+
|col1 |
+-----+
|Hello|
|World|
+-----+
Bash
COPY
Now, let’s say we want to encode these strings into a binary format using
the UTF-8 charset. We can do this using the encode function as follows:
Python
COPY
Bash
COPY
PySpark’s encode function is a useful tool for converting string data into
binary format, and it’s incredibly flexible with its ability to support multiple
character sets. It’s a valuable tool for any data scientist or engineer who is
working with large datasets and needs to perform transformations at scale.
Related Posts
PySpark : Exploring PySpark's last_day function with detailed examples
Syntax:
The syntax for using date_sub in PySpark is as follows:
date_sub(start_date, days)
Python
COPY
Here, start_date represents the initial date from which we want to subtract
days, and days indicates the number of days to subtract.
Example Usage:
To illustrate the usage of date_sub in PySpark, let’s consider a scenario
where we have a dataset containing sales records. We want to analyze
sales data from the past 7 days.
Step 1: Importing the necessary libraries and creating a
SparkSession.
from pyspark.sql import SparkSession
from pyspark.sql.functions import date_sub
# Create a SparkSession
spark = SparkSession.builder \
.appName("date_sub Example at Freshers.in") \
.getOrCreate()
Python
COPY
Python
COPY
Result
+---------+----------+-----+
| Product | Date |Sales|
+---------+----------+-----+
|Product A|2023-05-15| 100|
|Product B|2023-05-16| 150|
|Product C|2023-05-17| 200|
|Product D|2023-05-18| 120|
|Product E|2023-05-19| 90|
|Product F|2023-05-20| 180|
|Product G|2023-05-21| 210|
|Product H|2023-05-22| 160|
+---------+----------+-----+
Bash
COPY
Python
COPY
Result
+---------+----------+-----+--------------+
| Product| Date|Sales|SubtractedDate|
+---------+----------+-----+--------------+
|Product A|2023-05-15| 100| 2023-05-08|
|Product B|2023-05-16| 150| 2023-05-09|
|Product C|2023-05-17| 200| 2023-05-10|
|Product D|2023-05-18| 120| 2023-05-11|
|Product E|2023-05-19| 90| 2023-05-12|
|Product F|2023-05-20| 180| 2023-05-13|
|Product G|2023-05-21| 210| 2023-05-14|
|Product H|2023-05-22| 160| 2023-05-15|
+---------+----------+-----+--------------+
Bash
COPY
Python
COPY
Result
+---------+----------+-----+--------------+
| Product | Date |Sales|SubtractedDate|
+---------+----------+-----+--------------+
|Product H|2023-05-22| 160| 2023-05-15|
+---------+----------+-----+--------------+
Bash
COPY
Related Posts
PySpark : Adding a specified number of days to a date column in PySpark
We specify schema = true when a CSV file is being read. Spark determines
the…
PySpark: How to add months to a date column in Spark DataFrame (add_months)
# Create a SparkSession
spark = SparkSession.builder \
.appName("Current Date and Timestamp Example at Freshers.in") \
.getOrCreate()
Python
COPY
# Sample DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["Name", "OrderID"])
Python
COPY
Output
+-------+------+------------+--------------------+
| Name|OrderID|CurrentDate | CurrentTimestamp |
+-------+------+------------+--------------------+
| Alice| 1| 2023-05-22|2023-05-22 10:15:...|
| Bob| 2| 2023-05-22|2023-05-22 10:15:...|
|Charlie| 3| 2023-05-22|2023-05-22 10:15:...|
+-------+------+------------+--------------------+
Bash
COPY
As seen in the output, we added two new columns to the DataFrame:
“CurrentDate” and “CurrentTimestamp.” These columns contain the current date
and timestamp for each row in the DataFrame.
Step 3: Filtering data based on the current date.
Python
COPY
Output:
+-------+------+------------+--------------------+
| Name|OrderID|CurrentDate | CurrentTimestamp |
+-------+------+------------+--------------------+
| Alice| 1| 2023-05-22|2023-05-22 10:15:...|
| Bob| 2| 2023-05-22|2023-05-22 10:15:...|
|Charlie| 3| 2023-05-22|2023-05-22 10:15:...|
+-------+------+------------+--------------------+
Bash
COPY
# Calculate the time difference between current timestamp and order placement time
df_with_timestamp = df_with_timestamp.withColumn("TimeElapsed",
current_timestamp() - df_with_timestamp.CurrentTimestamp)
Python
COPY
Output
+-------+------+------------+--------------------+-------------------+
| Name|OrderID|CurrentDate | CurrentTimestamp | TimeElapsed |
+-------+------+------------+--------------------+-------------------+
| Alice| 1| 2023-05-22|2023-05-22 10:15:...| 00:01:23.456789 |
| Bob| 2| 2023-05-22|2023-05-22 10:15:...| 00:00:45.678912 |
|Charlie| 3| 2023-05-22|2023-05-22 10:15:...| 00:02:10.123456 |
+-------+------+------------+--------------------+-------------------+
Bash
COPY
In the above code snippet, we calculate the time elapsed between the current
timestamp and the order placement time for each row in the DataFrame. The
resulting column, “TimeElapsed,” shows the duration in the format
‘HH:mm:ss.sss’. This can be useful for analyzing time-based metrics and
understanding the timing patterns of the orders.
In this article, we explored the powerful PySpark functions current_date and
current_timestamp. These functions provide us with the current date and timestamp
within a Spark application, enabling us to perform time-based operations and gain
valuable insights from our data. By incorporating these functions into our PySpark
workflows, we can effectively handle time-related tasks and leverage temporal
information for various data processing and analysis tasks.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 8
Related Posts
PySpark : Truncate date and timestamp in PySpark [date_trunc and trunc]
One of its powerful features is the ability to work with window functions,
which allow…
PySpark : Understanding the ‘take’ Action in PySpark with
Examples. [Retrieves a specified number of elements from the
beginning of an RDD or DataFrame]
USER APRIL 29, 2023 LEAVE A COMMENTON PYSPARK : UNDERSTANDING THE ‘TAKE’ ACTION IN
PYSPARK WITH EXAMPLES. [RETRIEVES A SPECIFIED NUMBER OF ELEMENTS FROM THE BEGINNING OF
AN RDD OR DATAFRAME]
In this article, we will focus on the ‘take’ action, which is commonly used in
PySpark operations. We’ll provide a brief explanation of the ‘take’ action,
followed by a simple example to help you understand its usage.
What is the ‘take’ Action in PySpark?
The ‘take’ action in PySpark retrieves a specified number of elements from the
beginning of an RDD (Resilient Distributed Dataset) or DataFrame. It is an action
operation, which means it triggers the execution of any previous transformations
on the data, returning the result to the driver program. This operation is particularly
useful for previewing the contents of an RDD or DataFrame without having to
collect all the elements, which can be time-consuming and memory-intensive for
large datasets.
Syntax:
take(num)
Where num is the number of elements to retrieve from the RDD or DataFrame.
Simple Example
Let’s go through a simple example using the ‘take’ action in PySpark. First, we’ll
create a PySpark RDD and then use the ‘take’ action to retrieve a specified number
of elements.
RDD Version
Step 1: Start a PySpark session
Before starting with the example, you’ll need to start a PySpark session:
Python
COPY
Python
COPY
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = spark.sparkContext.parallelize(data)
Python
COPY
first_five_elements = rdd.take(5)
print("The first five elements of the RDD are:", first_five_elements)
Python
COPY
Output:
Bash
COPY
Python
COPY
Output
SQL
COPY
We created a DataFrame with some sample data and used the ‘take’ action to retrieve a specified number
of rows. This operation is useful for previewing the contents of a DataFrame, especially when working
with large datasets.
Related Posts
PySpark : Exploring PySpark's last_day function with detailed examples
Python
COPY
Output:
Bash
COPY
Related Posts
PySpark : Exploring PySpark's joinByKey on RDD : A Comprehensive Guide
One of the key functionalities of PySpark is the ability to transform data into
the…
PySpark : Exploring PySpark's last_day function with detailed examples
In PySpark, join operations are a fundamental technique for combining data from
two different RDDs based on a common key. Although there isn’t a specific
joinByKey function, PySpark provides several join functions that are applicable to
Key-Value pair RDDs. In this article, we will explore the different types of join
operations available in PySpark and provide a concrete example with hardcoded
values instead of reading from a file.
Types of Join Operations in PySpark
1. join: Performs an inner join between two RDDs based on matching keys.
2. leftOuterJoin: Performs a left outer join between two RDDs, retaining all keys from the left RDD
and matching keys from the right RDD.
3. rightOuterJoin: Performs a right outer join between two RDDs, retaining all keys from the right
RDD and matching keys from the left RDD.
4. fullOuterJoin: Performs a full outer join between two RDDs, retaining all keys from both RDDs.
Python
COPY
Output:
Bash
COPY
In this article, we explored the different types of join operations in PySpark for
Key-Value pair RDDs. We provided a concrete example using hardcoded values
for an inner join between two RDDs based on a common key. By leveraging join
operations in PySpark, you can combine data from various sources, enabling more
comprehensive data analysis and insights.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 6
Related Posts
PySpark : Unraveling PySpark's groupByKey: A Comprehensive Guide
One of the key functionalities of PySpark is the ability to transform data into
the…
How to run dataframe as Spark SQL - PySpark
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : Explanation of MapType in PySpark with Example
Example
Let’s dive into an example to better understand the usage of groupByKey. Suppose
we have a dataset containing sales data for a chain of stores. The data includes
store ID, product ID, and the number of units sold. Our goal is to group the sales
data by store ID.
#Unraveling PySpark's groupByKey: A Comprehensive Guide @ Freshers.in
from pyspark import SparkContext
# Initialize the Spark context
sc = SparkContext("local", "groupByKey @ Freshers.in")
Python
COPY
Output:
Bash
COPY
Related Posts
PySpark : Mastering PySpark's reduceByKey: A Comprehensive Guide
One of the key functionalities of PySpark is the ability to transform data into
the…
How to run dataframe as Spark SQL - PySpark
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : Explanation of MapType in PySpark with Example
where:
func: The function that will be used to aggregate the values for each key
Example
COPY
Output:
Bash
COPY
Related Posts
In pyspark what is the difference between Spark spark.table() and spark.read.table()
One of the key functionalities of PySpark is the ability to transform data into
the…
How to run dataframe as Spark SQL - PySpark
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : Explanation of MapType in PySpark with Example
MapType in PySpark is a data type used to represent a value that maps
keys…
PySpark : How to decode in PySpark ?
A common use case when dealing with CSV file is to remove the header
from…
Spark : Calculate the number of unique elements in a column using PySpark
where:
zeroValue: The initial value used for the aggregation (commonly known as the zero value)
func: The function that will be used to aggregate the values for each key
Example
Let’s dive into an example to better understand the usage of foldByKey. Suppose
we have a dataset containing sales data for a chain of stores. The data includes
store ID, product ID, and the number of units sold. Our goal is to calculate the total
units sold for each store.
Python
COPY
Output:
Bash
COPY
Related Posts
PySpark : Using randomSplit Function in PySpark for train and test data
In this article, we will discuss the randomSplit function in PySpark, which is
useful for…
PySpark : LongType and ShortType data types in PySpark
XML data is commonly used in data exchange and storage, and it can
contain complex…
PySpark : Exploring PySpark's last_day function with detailed examples
Here is some sample PySpark code that demonstrates how to read and
write data from…
Utilize the power of Pandas library with PySpark dataframes.
Python
COPY
Using combineByKey
Now, let’s use the combineByKey method to compute the average value for each
key in the RDD:
def create_combiner(value):
return (value, 1)
Python
COPY
In this example, we used the combineByKey method on the RDD, which requires
three functions as arguments:
1. A function that initializes the accumulator for each key. In our case, it creates a tuple with the value
and a count of 1.
2. merge_value: A function that updates the accumulator for a key with a new value. It takes the
current accumulator and the new value, then updates the sum and count.
3. merge_combiners: A function that merges two accumulators for the same key. It takes two
accumulators and combines their sums and counts.
We then use mapValues to compute the average value for each key by dividing the
sum by the count.
The output will be:
Bash
COPY
Notes:
Bash
COPY
Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C.
Here users can control the partitioning of the output RDD.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 7
Related Posts
PySpark : How to convert a sequence of key-value pairs into a dictionary in PySpark
In this article, we will explore the use of collectAsMap in PySpark, a method that
retrieves the key-value pairs from an RDD as a dictionary. We will provide a
detailed example using hardcoded values as input.
First, let’s create a PySpark RDD:
Python
COPY
Using collectAsMap
Now, let’s use the collectAsMap method to retrieve the key-value pairs from the
RDD as a dictionary:
result_map = rdd.collectAsMap()
print("Result as a Dictionary:")
for key, value in result_map.items():
print(f"{key}: {value}")
Python
COPY
In this example, we used the collectAsMap method on the RDD, which returns a
dictionary containing the key-value pairs in the RDD. This can be useful when you
need to work with the RDD data as a native Python dictionary.
Output will be:
Result as a Dictionary:
America: 1
Botswana: 2
Costa Rica: 3
Denmark: 4
Egypt: 5
Bash
COPY
The resulting dictionary contains the key-value pairs from the RDD, which can
now be accessed and manipulated using standard Python dictionary operations.
Keep in mind that using collectAsMap can cause the driver to run out of memory
if the RDD has a large number of key-value pairs, as it collects all data to the
driver. Use this method judiciously and only when you are certain that the
resulting dictionary can fit into the driver’s memory.
Here, we explored the use of collectAsMap in PySpark, a method that retrieves the
key-value pairs from an RDD as a dictionary. We provided a detailed example
using hardcoded values as input, showcasing how to create an RDD with key-value
pairs, use the collectAsMap method, and interpret the results. collectAsMap can be
useful in various scenarios when you need to work with RDD data as a native
Python dictionary, but it’s important to be cautious about potential memory issues
when using this method on large RDDs.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 2
Related Posts
PySpark :Remove any key-value pair that has a key present in another RDD [subtractByKey]
In this article you will learn , what an RDD is ? How can we…
How to replace a value with another value in a column in Pyspark Dataframe ?
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
Python
COPY
Using subtractByKey
Now, let’s use the subtractByKey method to create a new RDD by removing key-
value pairs from rdd1 that have keys present in rdd2:
result_rdd = rdd1.subtractByKey(rdd2)
result_data = result_rdd.collect()
print("Result of subtractByKey:")
for element in result_data:
print(element)
Python
COPY
In this example, we used the subtractByKey method on rdd1 and passed rdd2 as an
argument. The method returns a new RDD containing key-value pairs from rdd1
after removing any pair with a key present in rdd2. The collect method is then used
to retrieve the results.
Interpreting the Results
Result of subtractByKey:
('Costa Rica', 3)
('America', 1)
('Egypt', 5)
Bash
COPY
The resulting RDD contains key-value pairs from rdd1 with the key-value pairs
having keys “Botswana” and “Denmark” removed, as these keys are present in
rdd2.
In this article, we explored the use of subtractByKey in PySpark, a transformation
that returns an RDD consisting of key-value pairs from one RDD by removing any
pair that has a key present in another RDD. We provided a detailed example using
hardcoded values as input, showcasing how to create two RDDs with key-value
pairs, use the subtractByKey method, and interpret the results. subtractByKey can
be useful in various scenarios, such as filtering out unwanted data based on keys or
performing set-like operations on key-value pair RDDs.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 4
Related Posts
PySpark : How to convert a sequence of key-value pairs into a dictionary in PySpark
Consider you have situation with incoming raw data got a json column, and
you need…
PySpark : Replacing special characters with a specific value using PySpark.
Use case : Converting Map to multiple columns. There can be raw data
with Maptype…
PySpark : Replacing null column in a PySpark dataframe to 0 or any value you wish.
In this article you will learn , what an RDD is ? How can we…
PySpark : Assigning a unique identifier to each element in an
RDD [ zipWithUniqueId in PySpark]
USER APRIL 12, 2023 LEAVE A COMMENTON PYSPARK : ASSIGNING A UNIQUE IDENTIFIER TO EACH
ELEMENT IN AN RDD [ ZIPWITHUNIQUEID IN PYSPARK]
In this article, we will explore the use of zipWithUniqueId in PySpark, a method
that assigns a unique identifier to each element in an RDD. We will provide a
detailed example using hardcoded values as input.
Prerequisites
Python 3.7 or higher
PySpark library
Java 8 or higher
Python
COPY
Using zipWithUniqueId
Now, let’s use the zipWithUniqueId method to assign a unique identifier to each
element in the RDD:
unique_id_rdd = rdd.zipWithUniqueId()
unique_id_data = unique_id_rdd.collect()
print("Data with Unique IDs:")
for element in unique_id_data:
print(element)
Python
COPY
In this example, we used the zipWithUniqueId method on the RDD, which creates
a new RDD containing tuples of the original elements and their corresponding
unique identifier. The collect method is then used to retrieve the results.
Interpreting the Results
Bash
COPY
Related Posts
PySpark : Assigning an index to each element in an RDD [zipWithIndex in PySpark]
In this article you will learn , what an RDD is ? How can we…
Spark : Calculate the number of unique elements in a column using PySpark
pyspark.sql.functions.approx_count_distinct Pyspark's
approx_count_distinct function is a way to approximate the number of
unique elements in…
PySpark : Creating multiple rows for each element in the array[explode]
pyspark.sql.functions.explode One of the important operations in PySpark
is the explode function, which is used…
PySpark : Removing all occurrences of a specified element from an array column in a DataFrame
pyspark.sql.functions.array_remove Syntax
pyspark.sql.functions.array_remove(col, element)
pyspark.sql.functions.array_remove is a function that removes all
occurrences of a specified…
PySpark - How to read a text file as RDD using Spark3 and Display the result in Windows 10
Here we will see how to read a sample text file as RDD using Spark…
PySpark : Generates a unique and increasing 64-bit integer ID for each row in a DataFrame
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
Python
COPY
Performing Transformations
Now, let’s apply several transformations to the RDD:
rdd1 = rdd.map(lambda x: x * 2)
rdd2 = rdd1.filter(lambda x: x > 2)
rdd3 = rdd2.map(lambda x: x * 3)
Python
COPY
Applying Checkpoint
Next, let’s apply a checkpoint to rdd2:
rdd2.checkpoint()
Python
COPY
result = rdd3.collect()
print("Result:", result)
Python
COPY
Output
Bash
COPY
When executing the collect action on rdd3, PySpark will process the checkpoint for
rdd2. The lineage of rdd3 will now be based on the checkpointed data instead of
the full lineage from the original RDD.
Analyzing the Benefits of Checkpointing
Checkpointing can be helpful in situations where you have a long chain of
transformations, leading to a large lineage graph. A large lineage graph may result
in performance issues due to the overhead of tracking dependencies and can also
cause stack overflow errors during recursive operations.
By applying checkpoints, you can truncate the lineage, reducing the overhead of
tracking dependencies and mitigating the risk of stack overflow errors.
However, checkpointing comes at the cost of writing data to the checkpoint
directory, which can be a slow operation, especially when using distributed file
systems like HDFS. Therefore, it’s essential to use checkpointing judiciously and
only when necessary.
In this article, we explored checkpointing in PySpark, a feature that allows you to
truncate the lineage of RDDs. We provided a detailed example using hardcoded
values as input, showcasing how to create an RDD, apply transformations, set up
checkpointing, and execute an action that triggers the checkpoint. Checkpointing
can be beneficial when dealing with long chains of transformations that may cause
performance issues or stack overflow errors. However, it’s important to consider
the trade-offs and use checkpointing only when necessary, as it can introduce
additional overhead due to writing data to the checkpoint directory.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 11
Related Posts
PySpark : Truncate date and timestamp in PySpark [date_trunc and trunc]
To read a Parquet file stored on Amazon S3 using PySpark, you can use
the…
PySpark : HiveContext in PySpark - A brief explanation
pyspark.sql.functions.dayofmonth pyspark.sql.functions.dayofweek
pyspark.sql.functions.dayofyear One of the most common data
manipulations in PySpark is working with…
PySpark : LongType and ShortType data types in PySpark
Python
COPY
Using zipWithIndex
Now, let’s use the zipWithIndex method to assign an index to each element
in the RDD:
indexed_rdd = rdd.zipWithIndex()
indexed_data = indexed_rdd.collect()
print("Indexed Data:")
for element in indexed_data:
print(element)
Python
COPY
In this example, we used the zipWithIndex method on the RDD, which creates a
new RDD containing tuples of the original elements and their corresponding index.
The collect method is then used to retrieve the results.
Interpreting the Results
The output of the example will be:
Indexed Data:
('USA', 0)
('INDIA', 1)
('CHINA', 2)
('JAPAN', 3)
('CANADA', 4)
Bash
COPY
Each element in the RDD is now paired with an index, starting from 0. The
zipWithIndex method assigns the index based on the position of each element in
the RDD.
Keep in mind that zipWithIndex might cause a performance overhead since it
requires a full pass through the RDD to assign indices. Consider using alternatives
such as zipWithUniqueId if unique identifiers are sufficient for your use case, as it
avoids this performance overhead.
In this article, we explored the use of zipWithIndex in PySpark, a method that
assigns an index to each element in an RDD. We provided a detailed example
using hardcoded values as input, showcasing how to create an RDD, use the
zipWithIndex method, and interpret the results. zipWithIndex can be useful when
you need to associate an index with each element in an RDD, but be cautious about
the potential performance overhead it may introduce.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 20
Related Posts
PySpark-How to create and RDD from a List and from AWS S3
In this article you will learn , what an RDD is ? How can we…
PySpark : Creating multiple rows for each element in the array[explode]
pyspark.sql.functions.array_remove Syntax
pyspark.sql.functions.array_remove(col, element)
pyspark.sql.functions.array_remove is a function that removes all
occurrences of a specified…
PySpark - How to read a text file as RDD using Spark3 and Display the result in Windows 10
Here we will see how to read a sample text file as RDD using Spark…
PySpark : Explanation of MapType in PySpark with Example
To read a Parquet file stored on Amazon S3 using PySpark, you can use
the…
PySpark : HiveContext in PySpark - A brief explanation
pyspark.sql.functions.dayofmonth pyspark.sql.functions.dayofweek
pyspark.sql.functions.dayofyear One of the most common data
manipulations in PySpark is working with…
PySpark : LongType and ShortType data types in PySpark
spark = SparkSession.builder \
.appName("Covariance Analysis Example") \
.getOrCreate()
data_schema = StructType([
StructField("name", StringType(), True),
StructField("variable1", DoubleType(), True),
StructField("variable2", DoubleType(), True),
])
data = spark.createDataFrame([
("A", 1.0, 2.0),
("B", 2.0, 3.0),
("C", 3.0, 4.0),
("D", 4.0, 5.0),
("E", 5.0, 6.0),
], data_schema)
data.show()
Python
COPY
Output
+----+---------+---------+
|name|variable1|variable2|
+----+---------+---------+
| A| 1.0| 2.0|
| B| 2.0| 3.0|
| C| 3.0| 4.0|
| D| 4.0| 5.0|
| E| 5.0| 6.0|
+----+---------+---------+
Bash
COPY
Calculating Covariance
Python
COPY
Output
Bash
COPY
In this example, we used the cov function from the stat module of the
DataFrame API to calculate the covariance between the two variables.
It’s important to note that covariance values are not standardized, making
them difficult to interpret in isolation. For a standardized measure of the
relationship between two variables, you may consider using correlation
analysis instead.
Related Posts
PySpark : Correlation Analysis in PySpark with a detailed example
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : Correlation Analysis in PySpark with a detailed
example
USER APRIL 11, 2023 LEAVE A COMMENTON PYSPARK : CORRELATION ANALYSIS IN PYSPARK WITH A
DETAILED EXAMPLE
spark = SparkSession.builder \
.appName("Correlation Analysis Example") \
.getOrCreate()
data_schema = StructType([
StructField("name", StringType(), True),
StructField("variable1", DoubleType(), True),
StructField("variable2", DoubleType(), True),
])
data = spark.createDataFrame([
("A", 1.0, 2.0),
("B", 2.0, 3.0),
("C", 3.0, 4.0),
("D", 4.0, 5.0),
("E", 5.0, 6.0),
], data_schema)
data.show()
Python
COPY
Output
+----+---------+---------+
|name|variable1|variable2|
+----+---------+---------+
| A| 1.0| 2.0|
| B| 2.0| 3.0|
| C| 3.0| 4.0|
| D| 4.0| 5.0|
| E| 5.0| 6.0|
+----+---------+---------+
Bash
COPY
Calculating Correlation
Now, let’s calculate the correlation between variable1 and variable2:
Python
COPY
Output
Bash
COPY
In this example, we used the VectorAssembler to combine the two variables into a single feature vector
column called features. Then, we used the Correlation module from pyspark.ml.stat to calculate the
correlation between the two variables. The corr function returns a correlation matrix, from which we can
extract the correlation value between variable1 and variable2.
In our example, the correlation value is 1.0, which indicates a strong positive
relationship between variable1 and variable2. This means that
as variable1 increases, variable2 also increases, and vice versa.
In this article, we explored correlation analysis in PySpark, a statistical technique
used to measure the strength and direction of the relationship between two
continuous variables. We provided a detailed example using hardcoded values as
input, showcasing how to create a DataFrame, calculate the correlation between
two variables, and interpret the results. Correlation analysis can be useful in
various fields, such as finance, economics, and social sciences, to understand the
relationships between variables and make data-driven decisions.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 25
Related Posts
PySpark : Exploring PySpark's last_day function with detailed examples
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : What is predicate pushdown in Spark and how to enable it ?
spark = SparkSession.builder \
.appName("Broadcast Join Example @ Freshers.in") \
.getOrCreate()
orders_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("customer_id", IntegerType(), True),
StructField("product_id", IntegerType(), True),
])
orders_data = spark.createDataFrame([
(1, 101, 1001),
(2, 102, 1002),
(3, 103, 1001),
(4, 104, 1003),
(5, 105, 1002),
], orders_schema)
products_schema = StructType([
StructField("product_id", IntegerType(), True),
StructField("product_name", StringType(), True),
StructField("price", IntegerType(), True),
])
products_data = spark.createDataFrame([
(1001, "Product A", 50),
(1002, "Product B", 60),
(1003, "Product C", 70),
], products_schema)
orders_data.show()
products_data.show()
Python
COPY
Python
COPY
This DataFrame provides a combined view of the orders and products, allowing for
further analysis, such as calculating the total order value or finding the most
popular products.
In this article, we explored broadcast joins in PySpark, an optimization technique
for joining a large DataFrame with a smaller DataFrame. We provided a detailed
example using hardcoded values as input to create two DataFrames and perform a
broadcast join. This method can significantly improve performance by reducing
data shuffling and network overhead during join operations. However, it’s crucial
to use broadcast joins only with small DataFrames, as broadcasting large
DataFrames can cause memory issues.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 14
Related Posts
PySpark : Exploring PySpark's last_day function with detailed examples
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : Explanation of MapType in PySpark with Example
A common use case when dealing with CSV file is to remove the header
from…
PySpark : Understanding PySpark's LAG and LEAD Window Functions with detailed examples
One of its powerful features is the ability to work with window functions,
which allow…
Spark : Calculate the number of unique elements in a column using PySpark
spark = SparkSession.builder \
.appName("RandomSplit @ Freshers.in Example") \
.getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("timestamp", TimestampType(), True)
])
data = spark.createDataFrame([
("Sachin", 30, datetime.strptime("2022-12-01 12:30:15.123", "%Y-%m-%d %H:%M:%S.
%f")),
("Barry", 25, datetime.strptime("2023-01-10 16:45:35.789", "%Y-%m-%d %H:%M:%S.
%f")),
("Charlie", 35, datetime.strptime("2023-02-07 09:15:30.246", "%Y-%m-%d %H:%M:%S.
%f")),
("David", 28, datetime.strptime("2023-03-15 18:20:45.567", "%Y-%m-%d %H:%M:%S.
%f")),
("Eva", 22, datetime.strptime("2023-04-21 10:34:25.890", "%Y-%m-%d %H:%M:%S.%f"))
], schema)
data.show(20,False)
Python
COPY
Output
+-------+---+--------------------+
| name|age| timestamp|
+-------+---+--------------------+
| Sachin| 30|2022-12-01 12:30:...|
| Barry| 25|2023-01-10 16:45:...|
|Charlie| 35|2023-02-07 09:15:...|
| David| 28|2023-03-15 18:20:...|
| Eva| 22|2023-04-21 10:34:...|
+-------+---+--------------------+
Bash
COPY
Python
COPY
Output
+------+---+-----------------------+
|name |age|timestamp |
+------+---+-----------------------+
|Barry |25 |2023-01-10 16:45:35.789|
|Sachin|30 |2022-12-01 12:30:15.123|
|David |28 |2023-03-15 18:20:45.567|
|Eva |22 |2023-04-21 10:34:25.89 |
+------+---+-----------------------+
+-------+---+-----------------------+
|name |age|timestamp |
+-------+---+-----------------------+
|Charlie|35 |2023-02-07 09:15:30.246|
+-------+---+-----------------------+
Bash
COPY
The randomSplit function accepts two arguments: a list of weights for each
DataFrame and a seed for reproducibility. In this example, we’ve used the weights
[0.7, 0.3] to allocate approximately 70% of the data to the training set and 30% to
the testing set. The seed value 42 ensures that the split will be the same every time
we run the code.
Please note that the actual number of rows in the resulting DataFrames might not
exactly match the specified weights due to the random nature of the function.
However, with a larger dataset, the split will be closer to the specified weights.
Here we demonstrated how to use the randomSplit function in PySpark to divide a
DataFrame into smaller DataFrames based on specified weights. This function is
particularly useful for creating training and testing sets for machine learning tasks.
We provided an example using hardcoded values as input, showcasing how to
create a DataFrame and perform the random split.
Spark important urls to refer
1. Spark Examples
2. PySpark Blogs
3. Bigdata Blogs
4. Spark Interview Questions
5. Official Page
Post Views: 91
Related Posts
PySpark : Using randomSplit Function in PySpark for train and test data
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : Understanding PySpark's map_from_arrays Function with detailed examples
In PySpark, you can insert a row into a DataFrame by first converting the
DataFrame…
How can you convert PySpark Dataframe to JSON ?
spark = SparkSession.builder \
.appName("RandomSplit @ Freshers.in Example") \
.getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("timestamp", TimestampType(), True)
])
data = spark.createDataFrame([
("Sachin", 30, datetime.strptime("2022-12-01 12:30:15.123", "%Y-%m-%d %H:%M:%S.
%f")),
("Barry", 25, datetime.strptime("2023-01-10 16:45:35.789", "%Y-%m-%d %H:%M:%S.
%f")),
("Charlie", 35, datetime.strptime("2023-02-07 09:15:30.246", "%Y-%m-%d %H:%M:%S.
%f")),
("David", 28, datetime.strptime("2023-03-15 18:20:45.567", "%Y-%m-%d %H:%M:%S.
%f")),
("Eva", 22, datetime.strptime("2023-04-21 10:34:25.890", "%Y-%m-%d %H:%M:%S.%f"))
], schema)
data.show(20,False)
Python
COPY
Output
+-------+---+--------------------+
| name|age| timestamp|
+-------+---+--------------------+
| Sachin| 30|2022-12-01 12:30:...|
| Barry| 25|2023-01-10 16:45:...|
|Charlie| 35|2023-02-07 09:15:...|
| David| 28|2023-03-15 18:20:...|
| Eva| 22|2023-04-21 10:34:...|
+-------+---+--------------------+
Bash
COPY
Python
COPY
The randomSplit function accepts two arguments: a list of weights for each
DataFrame and a seed for reproducibility. In this example, we’ve used the weights
[0.7, 0.3] to allocate approximately 70% of the data to the training set and 30% to
the testing set. The seed value 42 ensures that the split will be the same every time
we run the code.
Please note that the actual number of rows in the resulting DataFrames
might not exactly match the specified weights due to the random nature of
the function. However, with a larger dataset, the split will be closer to the
specified weights.
Related Posts
PySpark : LongType and ShortType data types in PySpark
XML data is commonly used in data exchange and storage, and it can
contain complex…
PySpark : Exploring PySpark's last_day function with detailed examples
Here is some sample PySpark code that demonstrates how to read and
write data from…
BigQuery : How to process BigQuery Data with PySpark on Dataproc ?
Input Data
First, let’s load the dataset into a PySpark DataFrame:
Python
COPY
Schema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- timestamp: timestamp (nullable = true)
Bash
COPY
+-------+---+-----------------------+
|name |age|timestamp |
+-------+---+-----------------------+
|Sachin |30 |2022-12-01 12:30:15.123|
|Wilson |25 |2023-01-10 16:45:35.789|
|Johnson|35 |2023-02-07 09:15:30.246|
+-------+---+-----------------------+
Bash
COPY
Now, we will extract various time components from the ‘timestamp’ column using PySpark SQL
functions:
Python
COPY
# 1. Extract hour
Python
COPY
Output
+-------+---+-----------------------+----+
|name |age|timestamp |hour|
+-------+---+-----------------------+----+
|Alice |30 |2022-12-01 12:30:15.123|12 |
|Bob |25 |2023-01-10 16:45:35.789|16 |
|Charlie|35 |2023-02-07 09:15:30.246|9 |
+-------+---+-----------------------+----+
Bash
COPY
# 2. Extract minute
data.withColumn("minute", minute("timestamp")).show(20, False)
Python
COPY
Output
+-------+---+-----------------------+------+
|name |age|timestamp |minute|
+-------+---+-----------------------+------+
|Alice |30 |2022-12-01 12:30:15.123|30 |
|Bob |25 |2023-01-10 16:45:35.789|45 |
|Charlie|35 |2023-02-07 09:15:30.246|15 |
+-------+---+-----------------------+------+
Bash
COPY
# 3. Extract second
Python
COPY
Output
+-------+---+-----------------------+------+
|name |age|timestamp |second|
+-------+---+-----------------------+------+
|Alice |30 |2022-12-01 12:30:15.123|15 |
|Bob |25 |2023-01-10 16:45:35.789|35 |
|Charlie|35 |2023-02-07 09:15:30.246|30 |
+-------+---+-----------------------+------+
Bash
COPY
# 4. Extract millisecond
Python
COPY
Output
+-------+---+-----------------------+-----------+
|name |age|timestamp |millisecond|
+-------+---+-----------------------+-----------+
|Alice |30 |2022-12-01 12:30:15.123|123 |
|Bob |25 |2023-01-10 16:45:35.789|789 |
|Charlie|35 |2023-02-07 09:15:30.246|246 |
+-------+---+-----------------------+-----------+
Bash
COPY
# 5. Extract year
Python
COPY
Output
+-------+---+-----------------------+----+
|name |age|timestamp |year|
+-------+---+-----------------------+----+
|Alice |30 |2022-12-01 12:30:15.123|2022|
|Bob |25 |2023-01-10 16:45:35.789|2023|
|Charlie|35 |2023-02-07 09:15:30.246|2023|
+-------+---+-----------------------+----+
Bash
COPY
# 6. Extract month
Python
COPY
Output
+-------+---+-----------------------+-----+
|name |age|timestamp |month|
+-------+---+-----------------------+-----+
|Alice |30 |2022-12-01 12:30:15.123|12 |
|Bob |25 |2023-01-10 16:45:35.789|1 |
|Charlie|35 |2023-02-07 09:15:30.246|2 |
+-------+---+-----------------------+-----+
Bash
COPY
# 7. Extract day
Python
COPY
Output
+-------+---+-----------------------+---+
|name |age|timestamp |day|
+-------+---+-----------------------+---+
|Alice |30 |2022-12-01 12:30:15.123|1 |
|Bob |25 |2023-01-10 16:45:35.789|10 |
|Charlie|35 |2023-02-07 09:15:30.246|7 |
+-------+---+-----------------------+---+
Bash
COPY
# 8. Extract week
Python
COPY
Output
+-------+---+-----------------------+----+
|name |age|timestamp |week|
+-------+---+-----------------------+----+
|Alice |30 |2022-12-01 12:30:15.123|48 |
|Bob |25 |2023-01-10 16:45:35.789|2 |
|Charlie|35 |2023-02-07 09:15:30.246|6 |
+-------+---+-----------------------+----+
Bash
COPY
# 9. Extract quarter
Python
COPY
Output
+-------+---+-----------------------+-------+
|name |age|timestamp |quarter|
+-------+---+-----------------------+-------+
|Alice |30 |2022-12-01 12:30:15.123|4 |
|Bob |25 |2023-01-10 16:45:35.789|1 |
|Charlie|35 |2023-02-07 09:15:30.246|1 |
+-------+---+-----------------------+-------+
Bash
COPY
Python
COPY
Output
+-------+---+-----------------------+-----------------------+
|name |age|timestamp |timestamp_local |
+-------+---+-----------------------+-----------------------+
|Alice |30 |2022-12-01 12:30:15.123|2022-12-01 07:30:15.123|
|Bob |25 |2023-01-10 16:45:35.789|2023-01-10 11:45:35.789|
|Charlie|35 |2023-02-07 09:15:30.246|2023-02-07 04:15:30.246|
+-------+---+-----------------------+-----------------------+
Bash
COPY
Related Posts
Python : Extracting Time Components and Converting Timezones with Python
In this article, we will be working with a dataset containing a column with
names,…
PySpark : Extracting dayofmonth, dayofweek, and dayofyear in PySpark
pyspark.sql.functions.dayofmonth pyspark.sql.functions.dayofweek
pyspark.sql.functions.dayofyear One of the most common data
manipulations in PySpark is working with…
In pyspark what is the difference between Spark spark.table() and spark.read.table()
If you have a situation that you can easily get the result using SQL/ SQL…
PySpark : Explanation of MapType in PySpark with Example
A common use case when dealing with CSV file is to remove the header
from…
Spark : Calculate the number of unique elements in a column using PySpark
pyspark.sql.functions.map_from_arrays(keys, values)
Python
COPY
Python
COPY
Now that we have our DataFrame, let’s apply the map_from_arrays function to it:
Python
COPY
Output
+---------+---------+------------------------+
|Keys |Values |Map |
+---------+---------+------------------------+
|[a, b, c]|[1, 2, 3]|{a -> 1, b -> 2, c -> 3}|
|[x, y, z]|[4, 5, 6]|{x -> 4, y -> 5, z -> 6}|
+---------+---------+------------------------+
Bash
COPY
In this example, we created a PySpark DataFrame with two array columns, “Keys” and “Values”, and
applied the map_from_arrays function to combine them into a “Map” column. The output DataFrame
displays the original keys and values arrays, as well as the resulting map column.
The PySpark map_from_arrays function is a powerful and convenient tool for
working with array columns and transforming them into a map column. With the
help of the detailed example provided in this article, you should be able to
effectively use the map_from_arrays function in your own PySpark projects
How to remove csv header using Spark (PySpark)
USER MAY 27, 2021 LEAVE A COMMENTON HOW TO REMOVE CSV HEADER USING SPARK (PYSPARK)
A common use case when dealing with CSV file is to remove the header
from the source to do data analysis. In PySpark this can be done as
bellow.
Source Code ( PySpark – Python 3.6 and Spark 3, this is compatible with
spark 2.2+ ad Python 2.7)
from pyspark import SparkContext
import csv
sc = SparkContext()
readFile = sc.textFile("D:\\Users\\speedika\\PycharmProjects\\
sparkprojects\\sample_csv_01.csv")
file_with_indx = readCSV.zipWithIndex()
print (data_with_idx)
print(cleanse_data)
Code Explanation
file_with_indx = readCSV.zipWithIndex()
The zipWithIndex() transformation appends the RDD with the element
indices. Each row in the CSV will have and index attached starting from 0.
rmHeader = file_with_indx.filter(lambda x : x[1] > 0).map(lambda x : x[0])
This will remove the rows with index less than 0. So if you want to skip ‘n’
number of rows you can use the same code as well.
Note: Here we use the print statements to show the functionality .
Sample data
Name,Country,Phone
TOM,USA,343-098-292
JACK,CHINA,783-098-232
CHARLIE,INDIA,873-984-123
SUSAN,JAPAN,898-231-987
MIKE,UK,987-989-121
Result
['TOM', 'USA', '343-098-292']
['JACK', 'CHINA', '783-098-232']
['CHARLIE', 'INDIA', '873-984-123']
['SUSAN', 'JAPAN', '898-231-987']
['MIKE', 'UK', '987-989-121']
PySpark : How do I read a parquet file in Spark
USER JANUARY 27, 2023 LEAVE A COMMENTON PYSPARK : HOW DO I READ A PARQUET FILE IN SPARK
# Create a SparkSession
spark = SparkSession.builder.appName("ReadParquet").getOrCreate()
Python
COPY
df = spark.read.format("parquet").load("hdfs://path/to/directory")
Python
COPY
You can also read a parquet file with filtering using the where method
df = spark.read.parquet("freshers_path/to/freshers_in.parquet").where("column_name = 'value'")
Python
COPY
In addition to reading a single Parquet file, you can also read a directory containing
multiple Parquet files by specifying the directory path instead of a file path, like
this:
df = spark.read.parquet("freshers_path/to/directory")
Python
COPY
You can also use the schema option to specify the schema of the parquet file:
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())
])
df = spark.read.schema(schema).parquet("freshers_path/to/file.parquet")
Python
COPY
By providing the schema, Spark will skip the expensive process of inferring the
schema from the parquet file, which can be useful when working with large
datasets.
In pyspark what is the difference between Spark spark.table()
and spark.read.table()
USER JANUARY 8, 2023 LEAVE A COMMENTON IN PYSPARK WHAT IS THE DIFFERENCE BETWEEN
SPARK SPARK.TABLE() AND SPARK.READ.TABLE()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas
spark.read.table() is used to read a table from a structured data source, such as a
data lake or a database.
The spark.table() method requires that you have previously created a table in the
Spark catalog and registered it using the spark.createTable() method or the
CREATE TABLE SQL statement. Once a table has been registered in the catalog,
you can use the spark.table() method to access it.
On the other hand, spark.read.table() reads a table from a structured data source
and returns a DataFrame. It requires a configuration specifying the data source and
the options to read the table.
Here is an example of using spark.read.table() to read a table from a database:
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://localhost/mydatabase") \
.option("dbtable", "mytable") \
.option("user", "username") \
.option("password", "password") \
.load()
A common use case when dealing with CSV file is to remove the header
from the source to do data analysis. In PySpark this can be done as
bellow.
Source Code ( PySpark – Python 3.6 and Spark 3, this is compatible with
spark 2.2+ ad Python 2.7)
from pyspark import SparkContext
import csv
sc = SparkContext()
readFile = sc.textFile("D:\\Users\\speedika\\PycharmProjects\\
sparkprojects\\sample_csv_01.csv")
file_with_indx = readCSV.zipWithIndex()
print (data_with_idx)
print(cleanse_data)
Code Explanation
file_with_indx = readCSV.zipWithIndex()
The zipWithIndex() transformation appends the RDD with the element
indices. Each row in the CSV will have and index attached starting from 0.
rmHeader = file_with_indx.filter(lambda x : x[1] > 0).map(lambda x : x[0])
This will remove the rows with index less than 0. So if you want to skip ‘n’
number of rows you can use the same code as well.
Note: Here we use the print statements to show the functionality .
Sample data
Name,Country,Phone
TOM,USA,343-098-292
JACK,CHINA,783-098-232
CHARLIE,INDIA,873-984-123
SUSAN,JAPAN,898-231-987
MIKE,UK,987-989-121
Result
['TOM', 'USA', '343-098-292']
['JACK', 'CHINA', '783-098-232']
['CHARLIE', 'INDIA', '873-984-123']
['SUSAN', 'JAPAN', '898-231-987']
['MIKE', 'UK', '987-989-121']
Python
COPY
Output
+-----------+-----------+
|binary_data|string_data|
+-----------+-----------+
| Team| Team|
|Freshers.in|Freshers.in|
+-----------+-----------+
Bash
COPY
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- hobbies: map (nullable = true)
| |-- key: string
| |-- value: integer
Bash
COPY
Python
COPY
Result
+----+---+------------------------------+
|name|age|hobbies |
+----+---+------------------------------+
|John|30 |[reading -> 3, traveling -> 5]|
|Jane|25 |[painting -> 4, cooking -> 2] |
+----+---+------------------------------+
POSTED INSPARK
sc = SparkContext()
spark=SparkSession.builder.getOrCreate()
myDF.registerTempTable("sql_df")
tot_salary.show(30,False)
+----------+------------+
|department|total_salary|
+----------+------------+
|Teacher |900 |
|Finance |1120 |
+----------+------------+
You can also try the bellow to get all the column from data frame
tot_salary.selectExpr('*').show()
tot_salary.select('*').show()
Python
COPY
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
|-- car_make_year: string (nullable = true)
Bash
COPY
car_df_updated = car_df.withColumn("car_make_year_dt",to_date("car_make_year"))
car_df_updated.show()
Python
COPY
+-----+--------------+-------------+----------------+
|si_no|country_origin|car_make_year|car_make_year_dt|
+-----+--------------+-------------+----------------+
| 1| Japan| 2023-01-11| 2023-01-11|
| 2| Italy| 2023-04-21| 2023-04-21|
| 3| France| 2023-05-22| 2023-05-22|
| 4| India| 2023-07-18| 2023-07-18|
| 5| USA| 2023-08-23| 2023-08-23|
+-----+--------------+-------------+----------------+
Bash
COPY
Check the schema that is going to print , you can see the date data time for the new
column car_make_year_dt
car_df_updated.printSchema()
Python
COPY
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
|-- car_make_year: string (nullable = true)
<span style="color: #0000ff;"> |-- car_make_year_dt: date (nullable = true)</span>
Bash
COPY
Python
COPY
+-----+--------------+----------------------------------+
|si_no|country_origin|to_date(car_table.`car_make_year`)|
+-----+--------------+----------------------------------+
| 1| Japan| 2023-01-11|
| 2| Italy| 2023-04-21|
| 3| France| 2023-05-22|
| 4| India| 2023-07-18|
| 5| USA| 2023-08-23|
+-----+--------------+----------------------------------+
Bash
COPY
Python
COPY
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
<span style="color: #0000ff;"> |-- to_date(car_table.`car_make_year`): date (nullable =
true)</
POSTED INSPARK
One of the important concepts in PySpark is data encoding and decoding, which
refers to the process of converting data into a binary format and then converting it
back into a readable format.
In PySpark, encoding and decoding are performed using various methods that are
available in the library. The most commonly used methods are base64 encoding
and decoding, which is a standard encoding scheme that is used for converting
binary data into ASCII text. This method is used for transmitting binary data over
networks, where text data is preferred over binary data.
Another popular method for encoding and decoding in PySpark is the JSON
encoding and decoding. JSON is a lightweight data interchange format that is easy
to read and write. In PySpark, JSON encoding is used for storing and exchanging
data between systems, whereas JSON decoding is used for converting the encoded
data back into a readable format.
Additionally, PySpark also provides support for encoding and decoding data in the
Avro format. Avro is a data serialization system that is used for exchanging data
between systems. It is similar to JSON encoding and decoding, but it is more
compact and efficient. Avro encoding and decoding in PySpark is performed using
the Avro library.
To perform encoding and decoding in PySpark, one must first create a Spark
context and then import the necessary libraries. The data to be encoded or decoded
must then be loaded into the Spark context, and the appropriate encoding or
decoding method must be applied to the data. Once the encoding or decoding is
complete, the data can be stored or transmitted as needed.
In conclusion, encoding and decoding are important concepts in PySpark, as they
are used for storing and exchanging data between systems. PySpark provides
support for base64 encoding and decoding, JSON encoding and decoding, and
Avro encoding and decoding, making it a powerful tool for big data analysis.
Whether you are a data scientist or a software engineer, understanding the basics of
PySpark encoding and decoding is crucial for performing effective big data
analysis.
Here is a sample PySpark program that demonstrates how to perform base64
decoding using PySpark:
from pyspark import SparkContext
from pyspark.sql import SparkSession
import base64
# Create a UDF (User Defined Function) for decoding base64 encoded data
decode_udf = spark.udf.register("decode", lambda x: base64.b64decode(x).decode("utf-
8"))
Python
COPY
Output
+-----+------------+------------+
| key|encoded_data|decoded_data|
+-----+------------+------------+
|data1| ZGF0YTE=| data1|
|data2| ZGF0YTI=| data2|
+-----+------------+------------+
Bash
COPY
Explanation
1. The first step is to import the necessary
libraries, SparkContext and SparkSession from pyspark and base64 library.
2. Next, we initialize the SparkContext and SparkSession by creating an instance of SparkContext
with the name “local” and “base64 decode example” as the application name.
3. In the next step, we create a Spark dataframe with two columns, key and encoded_data, and
load some sample data into the dataframe.
4. Then, we create a UDF (User Defined Function) called decode which takes a base64 encoded
string as input and decodes it using the base64.b64decode method and returns the decoded
string. The .decode("utf-8") is used to convert the binary decoded data into a readable string
format.
5. After creating the UDF, we use the withColumn method to apply the UDF to
the encoded_data column of the dataframe and add a new column called decoded_data to
store the decoded data.
6. Finally, we display the decoded data using the show method.
In pyspark what is the difference between Spark spark.table()
and spark.read.table()
USER JANUARY 8, 2023 LEAVE A COMMENTON IN PYSPARK WHAT IS THE DIFFERENCE BETWEEN
SPARK SPARK.TABLE() AND SPARK.READ.TABLE()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas
spark.read.table() is used to read a table from a structured data source, such as a
data lake or a database.
The spark.table() method requires that you have previously created a table in the
Spark catalog and registered it using the spark.createTable() method or the
CREATE TABLE SQL statement. Once a table has been registered in the catalog,
you can use the spark.table() method to access it.
On the other hand, spark.read.table() reads a table from a structured data source
and returns a DataFrame. It requires a configuration specifying the data source and
the options to read the table.
Here is an example of using spark.read.table() to read a table from a database:
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://localhost/mydatabase") \
.option("dbtable", "mytable") \
.option("user", "username") \
.option("password", "password") \
.load()
POSTED INSPARK
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- hobbies: map (nullable = true)
| |-- key: string
| |-- value: integer
Bash
COPY
Python
COPY
Result
+----+---+------------------------------+
|name|age|hobbies |
+----+---+------------------------------+
|John|30 |[reading -> 3, traveling -> 5]|
|Jane|25 |[painting -> 4, cooking -> 2] |
+----+---+------------------------------+
pyspark.sql.functions.decode
The pyspark.sql.functions.decode Function in PySpark
PySpark is a popular library for processing big data using Apache Spark. One of its
many functions is the pyspark.sql.functions.decode function, which is used to
convert binary data into a string using a specified character set. The
pyspark.sql.functions.decode function takes two arguments: the first argument is
the binary data to be decoded, and the second argument is the character set to use
for decoding the binary data.
The pyspark.sql.functions.decode function in PySpark supports the following
character sets: US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-
16. The character set specified in the second argument must match one of these
supported character sets in order to perform the decoding successfully.
Here’s a simple example to demonstrate the use of the
pyspark.sql.functions.decode function in PySpark:
Python
COPY
Output
+-----------+-----------+
|binary_data|string_data|
+-----------+-----------+
| Team| Team|
|Freshers.in|Freshers.in|
+-----------+-----------+
Bash
COPY
One of the key components of PySpark is the HiveContext, which provides a SQL-
like interface to work with data stored in Hive tables. The HiveContext provides a
way to interact with Hive from PySpark, allowing you to run SQL queries against
tables stored in Hive. Hive is a data warehousing system built on top of Hadoop,
and it provides a way to store and manage large datasets. By using the
HiveContext, you can take advantage of the power of Hive to query and analyze
data in PySpark.
The HiveContext is created using the SparkContext, which is the entry point for
PySpark. Once you have created a SparkContext, you can create a HiveContext as
follows:
from pyspark . sql import HiveContext
hiveContext = HiveContext ( sparkContext )
The HiveContext provides a way to create DataFrame objects from Hive tables,
which can be used to perform various operations on the data. For example, you can
use the select method to select specific columns from a table, and you can use
the filter method to filter rows based on certain conditions.
# create a DataFrame from a Hive table df = hiveContext . table ("my_table") #
select specific columns from the DataFrame df . select ("col1", "col2") # filter rows
based on a condition df . filter ( df . col1 > 10)
You can also create temporary tables in the HiveContext, which are not persisted
to disk but can be used in subsequent queries. To create a temporary table, you can
use the registerTempTable method:
# create a temporary table from a
DataFrame df . registerTempTable ("my_temp_table") # query the temporary
table hiveContext . sql ("SELECT * FROM my_temp_table WHERE col1 > 10")
In addition to querying and analyzing data, the HiveContext also provides a way to
write data back to Hive tables. You can use the saveAsTable method to write a
DataFrame to a new or existing Hive table:
# write a DataFrame to a Hive table df . write . saveAsTable ("freshers_in_table")
# Create a SparkSession
spark = SparkSession.builder.appName("Ceil Example").getOrCreate()
Python
COPY
This code creates a SparkSession and a DataFrame with a single column “num” containing some sample
decimal numbers. Then it uses the ceil() function to round these numbers up to the nearest integer and
create a new column “rounded_num” with the result. The DataFrame is then displayed and show the
rounded number.
The output of this code will be:
+-----------+
|rounded_num|
+-----------+
| 2|
| 3|
| 4|
| 5|
+-----------+
Bash
COPY
spark.sql.hive.convertMetastoreParquet
Bash
COPY
and
spark.sql.hive.metastorePartitionPruning
Bash
COPY
Python
COPY
You can also enable predicate pushdown while creating a Dataframe using
the .filter() method in the following way:
Python
COPY
It’s worth noting that for this technique to work, the data must be stored in a
format that supports predicate pushdown, such as Parquet or ORC.
Additionally, the optimization only works when the filter conditions are
expressed in terms of the columns of the table, not on the result of an
expression.
It is also worth noting that when using Hive metastore, partition pruning
should be also enabled by
setting spark.sql.hive.metastorePartitionPruning to true in order to push
down the filtering conditions to the storage layer.
How to run dataframe as Spark SQL – PySpark
USER JUNE 14, 2021 LEAVE A COMMENTON HOW TO RUN DATAFRAME AS SPARK SQL – PYSPARK
If you have a situation that you can easily get the result using SQL/ SQL
already existing , then you can convert the dataframe to a table and do a
query on top of it. Converting dataframe to a table as bellow
from pyspark.sql import SparkSession
sc = SparkContext()
spark=SparkSession.builder.getOrCreate()
tot_salary.show(30,False)
+----------+------------+
|department|total_salary|
+----------+------------+
|Teacher |900 |
|Finance |1120 |
+----------+------------+
You can also try the bellow to get all the column from data frame
tot_salary.selectExpr('*').show()
Python
COPY
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
|-- car_make_year: string (nullable = true)
Bash
COPY
car_df_updated = car_df.withColumn("car_make_year_dt",to_date("car_make_year"))
car_df_updated.show()
Python
COPY
+-----+--------------+-------------+----------------+
|si_no|country_origin|car_make_year|car_make_year_dt|
+-----+--------------+-------------+----------------+
| 1| Japan| 2023-01-11| 2023-01-11|
| 2| Italy| 2023-04-21| 2023-04-21|
| 3| France| 2023-05-22| 2023-05-22|
| 4| India| 2023-07-18| 2023-07-18|
| 5| USA| 2023-08-23| 2023-08-23|
+-----+--------------+-------------+----------------+
Bash
COPY
Check the schema that is going to print , you can see the date data time for the new
column car_make_year_dt
car_df_updated.printSchema()
Python
COPY
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
|-- car_make_year: string (nullable = true)
<span style="color: #0000ff;"> |-- car_make_year_dt: date (nullable = true)</span>
Bash
COPY
Python
COPY
+-----+--------------+----------------------------------+
|si_no|country_origin|to_date(car_table.`car_make_year`)|
+-----+--------------+----------------------------------+
| 1| Japan| 2023-01-11|
| 2| Italy| 2023-04-21|
| 3| France| 2023-05-22|
| 4| India| 2023-07-18|
| 5| USA| 2023-08-23|
+-----+--------------+----------------------------------+
Bash
COPY
Python
COPY
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
<span style="color: #0000ff;"> |-- to_date(car_table.`car_make_year`): date (nullable =
true)</
# Create a UDF (User Defined Function) for decoding base64 encoded data
decode_udf = spark.udf.register("decode", lambda x: base64.b64decode(x).decode("utf-
8"))
Python
COPY
Output
+-----+------------+------------+
| key|encoded_data|decoded_data|
+-----+------------+------------+
|data1| ZGF0YTE=| data1|
|data2| ZGF0YTI=| data2|
+-----+------------+------------+
Bash
COPY
Explanation
1. The first step is to import the necessary
libraries, SparkContext and SparkSession from pyspark and base64 library.
2. Next, we initialize the SparkContext and SparkSession by creating an instance of SparkContext
with the name “local” and “base64 decode example” as the application name.
3. In the next step, we create a Spark dataframe with two columns, key and encoded_data, and
load some sample data into the dataframe.
4. Then, we create a UDF (User Defined Function) called decode which takes a base64 encoded
string as input and decodes it using the base64.b64decode method and returns the decoded
string. The .decode("utf-8") is used to convert the binary decoded data into a readable string
format.
5. After creating the UDF, we use the withColumn method to apply the UDF to
the encoded_data column of the dataframe and add a new column called decoded_data to
store the decoded data.
6. Finally, we display the decoded data using the show method.
AWS Glue : Example on how to read a sample csv file with
PySpark
USER DECEMBER 28, 2021 LEAVE A COMMENTON AWS GLUE : EXAMPLE ON HOW TO READ A SAMPLE
CSV FILE WITH PYSPARK
Here assume that you have your CSV data in AWS S3 bucket. The next step is the
crawl the data that is in AWS S3 bucket. Once its done , you can find the crawler
has created a metadata table for your csv data.
import sys
glueContext = GlueContext(SparkContext.getOrCreate())
freshers_data = spark.read.format("com.databricks.spark.csv").option(
"header", "true").option(
"inferSchema", "true").load(
's3://freshers_in_datasets/training/students/final_year.csv')
freshers_data.printSchema()
Result
root
Spark Reference
PySpark : HiveContext in PySpark – A brief explanation
USER FEBRUARY 26, 2023 LEAVE A COMMENTON PYSPARK : HIVECONTEXT IN PYSPARK – A BRIEF
EXPLANATION
One of the key components of PySpark is the HiveContext, which provides a SQL-
like interface to work with data stored in Hive tables. The HiveContext provides a
way to interact with Hive from PySpark, allowing you to run SQL queries against
tables stored in Hive. Hive is a data warehousing system built on top of Hadoop,
and it provides a way to store and manage large datasets. By using the
HiveContext, you can take advantage of the power of Hive to query and analyze
data in PySpark.
The HiveContext is created using the SparkContext, which is the entry point for
PySpark. Once you have created a SparkContext, you can create a HiveContext as
follows:
from pyspark . sql import HiveContext
hiveContext = HiveContext ( sparkContext )
The HiveContext provides a way to create DataFrame objects from Hive tables,
which can be used to perform various operations on the data. For example, you can
use the select method to select specific columns from a table, and you can use
the filter method to filter rows based on certain conditions.
# create a DataFrame from a Hive table df = hiveContext . table ("my_table") #
select specific columns from the DataFrame df . select ("col1", "col2") # filter rows
based on a condition df . filter ( df . col1 > 10)
You can also create temporary tables in the HiveContext, which are not persisted
to disk but can be used in subsequent queries. To create a temporary table, you can
use the registerTempTable method:
# create a temporary table from a
DataFrame df . registerTempTable ("my_temp_table") # query the temporary
table hiveContext . sql ("SELECT * FROM my_temp_table WHERE col1 > 10")
In addition to querying and analyzing data, the HiveContext also provides a way to
write data back to Hive tables. You can use the saveAsTable method to write a
DataFrame to a new or existing Hive table:
# write a DataFrame to a Hive table df . write . saveAsTable ("freshers_in_table")
the HiveContext in PySpark provides a powerful SQL-like interface for working
with data stored in Hive. It allows you to easily query and analyze large datasets,
and it provides a way to write data back to Hive tables. By using the HiveContext,
you can take advantage of the power of Hive in your PySpark applications.
PySpark : How to decode in PySpark ?
USER FEBRUARY 3, 2023 LEAVE A COMMENTON PYSPARK : HOW TO DECODE IN PYSPARK ?
pyspark.sql.functions.decode
The pyspark.sql.functions.decode Function in PySpark
PySpark is a popular library for processing big data using Apache Spark. One of its
many functions is the pyspark.sql.functions.decode function, which is used to
convert binary data into a string using a specified character set. The
pyspark.sql.functions.decode function takes two arguments: the first argument is
the binary data to be decoded, and the second argument is the character set to use
for decoding the binary data.
The pyspark.sql.functions.decode function in PySpark supports the following
character sets: US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-
16. The character set specified in the second argument must match one of these
supported character sets in order to perform the decoding successfully.
Here’s a simple example to demonstrate the use of the
pyspark.sql.functions.decode function in PySpark:
Python
COPY
Output
+-----------+-----------+
|binary_data|string_data|
+-----------+-----------+
| Team| Team|
|Freshers.in|Freshers.in|
+-----------+-----------+
Bash
COPY
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- hobbies: map (nullable = true)
| |-- key: string
| |-- value: integer
Bash
COPY
Python
COPY
Result
+----+---+------------------------------+
|name|age|hobbies |
+----+---+------------------------------+
|John|30 |[reading -> 3, traveling -> 5]|
|Jane|25 |[painting -> 4, cooking -> 2] |
+----+---+------------------------------+
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas
spark.read.table() is used to read a table from a structured data source, such as a
data lake or a database.
The spark.table() method requires that you have previously created a table in the
Spark catalog and registered it using the spark.createTable() method or the
CREATE TABLE SQL statement. Once a table has been registered in the catalog,
you can use the spark.table() method to access it.
On the other hand, spark.read.table() reads a table from a structured data source
and returns a DataFrame. It requires a configuration specifying the data source and
the options to read the table.
Here is an example of using spark.read.table() to read a table from a database:
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://localhost/mydatabase") \
.option("dbtable", "mytable") \
.option("user", "username") \
.option("password", "password") \
.load()
Python
COPY
Python
COPY
root
|-- id: long (nullable = true)
|-- value: string (nullable = true)
Bash
COPY
Use expr
Python
COPY
root
|-- id: long (nullable = true)
|-- value: integer (nullable = true)
Bash
COPY
In this example, we create a Spark dataframe with two columns, id and value. The
value column is a string column, but we want to convert it to a numeric column. To
do this, we use the expr function to create a column expression that casts the value
column as an integer. The result is a new Spark dataframe with the value column
converted to a numeric column.
Another common use for expr is to perform operations on columns. For example,
you can use expr to create a new column that is the result of a calculation involving
multiple columns. Here is an example:
Python
COPY
Result
+---+------+------+---+
| id|value1|value2|sum|
+---+------+------+---+
| 1| 100| 10|110|
| 2| 200| 20|220|
| 3| 300| 30|330|
+---+------+------+---+
Bash
COPY
In this example, we create a Spark dataframe with three columns, id, value1, and
value2. We use the expr function to create a new column, sum, that is the result of
adding value1 and value2. The result is a new Spark dataframe with the sum
column containing the result of the calculation.
The expr module also provides a number of other functions that can be used to
perform operations on Spark dataframes. For example, you can use the coalesce
function to select the first non-null value from a set of columns, the ifnull function
to return a specified value if a column is null, and the case function to perform
conditional operations on columns.
In conclusion, the expr module in PySpark provides a convenient and flexible way
to perform operations on Spark dataframes. Whether you want to transform
columns, calculate new columns, or perform other operations, the expr module
provides the tools you need to do so
Explain dense_rank. How to use dense_rank function in
PySpark ?
USER JANUARY 16, 2023 LEAVE A COMMENTON EXPLAIN DENSE_RANK. HOW TO USE DENSE_RANK
FUNCTION IN PYSPARK ?
In PySpark, the dense_rank function is used to assign a rank to each row within a
result set, based on the values of one or more columns. It is a window function that
assigns a unique rank to each unique value within a result set, with no gaps in the
ranking values.
The dense_rank function is a window function that assigns a rank to each row
within a result set, based on the values in one or more columns. The rank assigned
is unique and dense, meaning that there are no gaps in the sequence of rank values.
For example, if there are three rows with the same value in the column used for
ranking, they will be assigned the same rank, and the next row will be assigned the
rank that is three greater than the previous rank. The dense_rank function is
typically used in conjunction with an ORDER BY clause to sort the result set by
the column(s) used for ranking.
Here is an example of how to use the dense_rank function in PySpark:
spark = SparkSession.builder.appName("dense_rank").getOrCreate()
data = [("Peter John", 25),("Wisdon Mike", 30),("Sarah Johns", 25),("Bob Beliver", 22),("Lucas
Marget", 30)]
Python
COPY
In this example, the dense_rank function is used to assign a unique rank to each unique value of the “age”
column, based on the order of the “name” column. The output will be
+------------+---+----+
| name|age|rank|
+------------+---+----+
| Bob Beliver| 22| 1|
| Peter John| 25| 1|
| Sarah Johns| 25| 2|
|Lucas Marget| 30| 1|
| Wisdon Mike| 30| 2|
+------------+---+----+
Bash
COPY
This means that Peter John and Sarah Johns have the same age with Peter John
having 1st rank and Sarah Johns having 2nd rank.
PySpark : Combine two or more arrays into a single array of
tuple
USER JANUARY 18, 2023 LEAVE A COMMENTON PYSPARK : COMBINE TWO OR MORE ARRAYS INTO A
SINGLE ARRAY OF TUPLE
pyspark.sql.functions.arrays_zip
In PySpark, the arrays_zip function can be used to combine two or more arrays
into a single array of tuple. Each tuple in the resulting array contains elements from
the corresponding position in the input arrays. This will returns a merged array of
structs in which the N-th struct contains all N-th values of input arrays.
Python
COPY
+---------+-------------------------------------+
|si_no |name |
+---------+-------------------------------------+
|[1, 2, 3]|[Sam John, Perter Walter, Johns Mike]|
+---------+-------------------------------------+
Bash
COPY
zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)
Python
COPY
Result
zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)
Bash
COPY
You can also use arrays_zip with more than two arrays as input. For example:
Python
COPY
Result
+----------------------------------------------------------------+
|arrays_zip(si_no, name, age) |
+----------------------------------------------------------------+
|[[1, Sam John, 23], [2, Perter Walter, 43], [3, Johns Mike, 41]]|
+----------------------------------------------------------------+
Bash
COPY
Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('www.freshers.in
training').getOrCreate()
raw_data= [("Berkshire",["Alabama","Alaska","Arizona"],
["Alabama","Alaska","Arizona","Arkansas"]),
("Allianz",["California","Connecticut","Delaware"],
["California","Colorado","Connecticut","Delaware"]),
("Zurich",["Delaware","Florida","Georgia","Hawaii","Idaho"],
["Delaware","Florida","Georgia","Hawaii","Idaho"]),
("AIA",["Iowa","Kansas","Kentucky"],
["Iowa","Kansas","Kentucky","Louisiana"]),
("Munich",["Hawaii","Idaho","Illinois","Indiana"],
["Hawaii","Illinois","Indiana"])]
df =
spark.createDataFrame(data=raw_data,schema=["Insurace_Provider","Countr
y_2022","Country_2023"])
df.show(20,False)
df2=df.select(array_except(df.Country_2023,df.Country_2022))
df2.show(20,False)
df3=df.select(array_except(df.Country_2022,df.Country_2023))
df3.show(20,False)
df4= df.withColumn("Insurance_Company",df.Insurace_Provider)\
.withColumn("Newly_Introduced_Country",array_except(df.Country_2023,df.
Country_2022))\
.withColumn("Operation_Closed_Country",array_except(df.Country_2022
Python
COPY
column: The column or expression to apply the LAG or LEAD function on.
offset: The number of rows to look behind (LAG) or ahead (LEAD) from the current row (default is 1).
default: The value to return when no previous or next row exists. If not specified, it returns NULL.
Python
COPY
Now that we have our DataFrame, let’s apply the LAG and LEAD functions
using a Window specification:
Python
COPY
+----------+-----+--------------------+----------------+
| Date|Sales|Previous Month Sales|Next Month Sales|
+----------+-----+--------------------+----------------+
|2023-01-01| 100| null| 200|
|2023-02-01| 200| 100| 300|
|2023-03-01| 300| 200| 400|
|2023-04-01| 400| 300| null|
+----------+-----+--------------------+----------------+
Bash
COPY
In this example, we used the LAG function to obtain the sales from the
previous month and the LEAD
The last_day function is a part of the PySpark SQL library, which provides various
functions to work with dates and times. It is useful when you need to perform time-
based aggregations or calculations based on the end of the month.
Syntax:
pyspark.sql.functions.last_day(date)
Python
COPY
To illustrate the usage of the last_day function, let’s create a PySpark DataFrame
containing date information and apply the function to it.
First, let’s import the necessary libraries and create a sample DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import last_day, to_date
from pyspark.sql.types import StringType, DateType
# Sample data
data = [("2023-01-15",), ("2023-02-25",), ("2023-03-05",), ("2023-04-10",)]
Python
COPY
Now that we have our DataFrame, let’s apply the last_day function to it:
Python
COPY
Output
+----------+-----------------+
| Date|Last Day of Month|
+----------+-----------------+
|2023-01-15| 2023-01-31|
|2023-02-25| 2023-02-28|
|2023-03-05| 2023-03-31|
|2023-04-10| 2023-04-30|
+----------+-----------------+
Bash
COPY
In this example, we created a PySpark DataFrame with a date column and applied
the last_day function to calculate the last day of the month for each date. The
output DataFrame displays the original date along with the corresponding last day
of the month.
The PySpark last_day function is a powerful and convenient tool for working with
dates, particularly when you need to determine the last day of the month for a
given date. With the help of the detailed example provided in this article, you
should be able to effectively use the last_day function in your own PySpark
projects.
PySpark-What is map side join and How to perform map side
join in Pyspark
USER JANUARY 28, 2023 LEAVE A COMMENTON PYSPARK-WHAT IS MAP SIDE JOIN AND HOW TO
PERFORM MAP SIDE JOIN IN PYSPARK
Map-side join is a method of joining two datasets in PySpark where one dataset is
broadcast to all executors, and then the join is performed in the same executor,
instead of shuffling and sorting the data across multiple executors. This can
significantly reduce the amount of data shuffling and improve performance for
large datasets.
To perform a map-side join in PySpark, you can use the broadcast() function to
broadcast one of the datasets, and then use the join() function to perform the join.
Here’s an example of how to perform a map-side join in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast
# Create a SparkSession
spark = SparkSession.builder.appName("Map-side join example").getOrCreate()
Python
COPY
In the above example, df2 is broadcasted and the join is performed in the same
executor where the broadcasted dataframe is present.
Output
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
| 1| a| 1| A|
| 1| a| 2| B|
| 1| a| 3| C|
| 2| b| 1| A|
| 2| b| 2| B|
| 2| b| 3| C|
| 3| c| 1| A|
| 3| c| 2| B|
| 3| c| 3| C|
+---+-----+---+-----+
Bash
COPY
It’s worth noting that map-side join is efficient when the data size of one dataset is
small enough to fit in memory. Also, broadcast join is not recommended when the
size of data is too large, it can cause out of memory issues.
It’s also worth noting that you should use this method with caution, as
broadcasting large datasets can cause out-of-memory errors on the executors.
Make sure that the join column is indexed and has a small size, otherwise it will
cause a slow join.
Comparing PySpark with Map Reduce programming
USER JANUARY 17, 2023 LEAVE A COMMENTON COMPARING PYSPARK WITH MAP REDUCE
PROGRAMMING
Python
COPY
The output of this code will be [2, 4, 6, 8, 10]. The map operation takes a lambda
function (or any other function) that takes a single integer as input and returns its
double. The collect action is used to retrieve the elements of the RDD back to the
driver program as a list.
PySpark : How to create a map from a column of structs :
map_from_entries
USER JANUARY 25, 2023 LEAVE A COMMENTON PYSPARK : HOW TO CREATE A MAP FROM A COLUMN
OF STRUCTS : MAP_FROM_ENTRIES
pyspark.sql.functions.map_from_entries
map_from_entries(col) is a function in PySpark that creates a map from a column
of structs, where the structs have two fields: key and value. This is a collection
function which returns a map created from the given array of entries
Python
COPY
In this example, we first import the necessary functions and create a SparkSession.
We then create a DataFrame with a column called “ person_map ” which contains
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- salary: long (nullable = true)
|-- map_col: map (nullable = true)
| |-- key: stringap_col")["name"]).show()
| |-- value: string (valueContainsNull = true)
|map_col[name]|
Bash
COPY
Result
+---+------+------+---------------------------+
|id |name |salary|map_col |
+---+------+------+---------------------------+
|1 |John |25000 |[name -> John, age -> 25] |
|2 |Mike |30000 |[name -> Mike, age -> 30] |
|3 |Sophia|35000 |[name -> Sophia, age -> 35]|
+---+------+------+---------------------------+
Bash
COPY
In PySpark, creating a map column from entries allows you to convert existing
columns in a DataFrame into a map, where each row in the DataFrame becomes a
key-value pair in the map. This can be useful for organizing and structuring data in
a more readable and efficient way. Additionally, it can also be used to perform
operations such as filtering, aggregation and joining on the map column.
PySpark : How to create a map from a column of structs :
map_from_entries
USER JANUARY 25, 2023 LEAVE A COMMENTON PYSPARK : HOW TO CREATE A MAP FROM A COLUMN
OF STRUCTS : MAP_FROM_ENTRIES
pyspark.sql.functions.map_from_entries
map_from_entries(col) is a function in PySpark that creates a map from a column
of structs, where the structs have two fields: key and value. This is a collection
function which returns a map created from the given array of entries
Python
COPY
In this example, we first import the necessary functions and create a SparkSession.
We then create a DataFrame with a column called “ person_map ” which contains
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- salary: long (nullable = true)
|-- map_col: map (nullable = true)
| |-- key: stringap_col")["name"]).show()
| |-- value: string (valueContainsNull = true)
|map_col[name]|
Bash
COPY
Result
+---+------+------+---------------------------+
|id |name |salary|map_col |
+---+------+------+---------------------------+
|1 |John |25000 |[name -> John, age -> 25] |
|2 |Mike |30000 |[name -> Mike, age -> 30] |
|3 |Sophia|35000 |[name -> Sophia, age -> 35]|
+---+------+------+---------------------------+
Bash
COPY
In PySpark, creating a map column from entries allows you to convert existing
columns in a DataFrame into a map, where each row in the DataFrame becomes a
key-value pair in the map. This can be useful for organizing and structuring data in
a more readable and efficient way. Additionally, it can also be used to perform
operations such as filtering, aggregation and joining on the map column.
POSTED INSPARK
array_except
In PySpark , array_except will returns an array of the elements in one
column but not in another column and without duplicates.
Syntax :
array_except(array1, array2)
Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('www.freshers.in
training').getOrCreate()
raw_data= [("Berkshire",["Alabama","Alaska","Arizona"],
["Alabama","Alaska","Arizona","Arkansas"]),
("Allianz",["California","Connecticut","Delaware"],
["California","Colorado","Connecticut","Delaware"]),
("Zurich",["Delaware","Florida","Georgia","Hawaii","Idaho"],
["Delaware","Florida","Georgia","Hawaii","Idaho"]),
("AIA",["Iowa","Kansas","Kentucky"],
["Iowa","Kansas","Kentucky","Louisiana"]),
("Munich",["Hawaii","Idaho","Illinois","Indiana"],
["Hawaii","Illinois","Indiana"])]
df =
spark.createDataFrame(data=raw_data,schema=["Insurace_Provider","Countr
y_2022","Country_2023"])
df.show(20,False)
df2=df.select(array_except(df.Country_2023,df.Country_2022))
df2.show(20,False)
df3=df.select(array_except(df.Country_2022,df.Country_2023))
df3.show(20,False)
df4= df.withColumn("Insurance_Company",df.Insurace_Provider)\
.withColumn("Newly_Introduced_Country",array_except(df.Country_2023,df.
Country_2022))\
.withColumn("Operation_Closed_Country",array_except(df.Country_2022,df.
Country_2023))
df4.show(20,False)
pyspark.sql.functions.arrays_zip
In PySpark, the arrays_zip function can be used to combine two or more arrays
into a single array of tuple. Each tuple in the resulting array contains elements from
the corresponding position in the input arrays. This will returns a merged array of
structs in which the N-th struct contains all N-th values of input arrays.
Python
COPY
+---------+-------------------------------------+
|si_no |name |
+---------+-------------------------------------+
|[1, 2, 3]|[Sam John, Perter Walter, Johns Mike]|
+---------+-------------------------------------+
Bash
COPY
zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)
Python
COPY
Result
zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)
Bash
COPY
You can also use arrays_zip with more than two arrays as input. For example:
Python
COPY
Result
+----------------------------------------------------------------+
|arrays_zip(si_no, name, age) |
+----------------------------------------------------------------+
|[[1, Sam John, 23], [2, Perter Walter, 43], [3, Johns Mike, 41]]|
+----------------------------------------------------------------