0% found this document useful (0 votes)
304 views6 pages

Redshift ETL with AWS Glue Dynamic Frames

The document provides instructions for transforming raw data using AWS Glue and writing the transformed data to Redshift tables. The key steps are: 1. Create dynamic frames from raw and reference tables in the Glue catalog and explore the schemas. 2. Define transformation functions to combine partition columns and apply them to records in the raw dynamic frame. 3. Apply the transformations to the raw dynamic frame and write the results to a new transformed dynamic frame.

Uploaded by

Aswin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
304 views6 pages

Redshift ETL with AWS Glue Dynamic Frames

The document provides instructions for transforming raw data using AWS Glue and writing the transformed data to Redshift tables. The key steps are: 1. Create dynamic frames from raw and reference tables in the Glue catalog and explore the schemas. 2. Define transformation functions to combine partition columns and apply them to records in the raw dynamic frame. 3. Apply the transformations to the raw dynamic frame and write the results to a new transformed dynamic frame.

Uploaded by

Aswin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

12/13/22, 7:47 PM analytics-workshop-redshift-notebook

Redshift Lab
Take your time to read through the instructions provided in this notebook.

Learning Objectives
Understand how to interactivly author Glue ETL scripts using Glue Dev Endpoints & SageMaker notebooks (This portion has already been covered under "Transform
Data with AWS Glue" module).
Use Glue to do record level transformations and write them to redshift tables.

Here are the steps which we will perform


image.png

Execute the code blocks one cell at a time

Execute Code 🔻

In [1]:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

Starting Spark application

ID YARN Application ID Kind State Spark UI

Link (http://ip-172-35-233-
0 application_1670937024714_0001 pyspark idle
71.ec2.internal:20888/proxy/application_1670937024714_0001/) 203.ec2.internal:8042/node/containerlogs/container_

SparkSession available as 'spark'.

Exploring your raw dataset


In this step you will:
Create a dynamic frame for your 'raw' table from AWS Glue catalog
Explore the schema of the datasets
Count rows in raw table
View a sample of the data

Glue Dynamic Frames Basics


AWS Glue's dynamic data frames is a powerful data structure.
They provide a precise representation of the underlying semi-structured data, especially when dealing with columns or fields with varying types.
They also provide powerful primitives to deal with nesting and unnesting.
A dynamic record is a self-describing record: Each record encodes its columns and types, so every record can have a schema that is unique from all others in the
dynamic frame.
For ETL, we needed somthing more dynamic, hence we created the Glue Dynamic DataFrames. DDF are an implementaion of DF that relaxes the requiements of
having a rigid schema. Its designed for semi-structured data.
It maintains a schema per-record, its easy to restucture, tag and modify.

Read More : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html


(https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html)

Execute Code 🔻

In [3]:

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

Crate dynamic frame from Glue catalog


In this block we are using gluecontext to create a new dynamicframe from glue catalog

Other ways to create dynamicframes in Glue:

create_dynamic_frame_from_rdd
create_dynamic_frame_from_catalog
create_dynamic_frame_from_options

Read More:https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html
(https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html)

https://aws-glue-analyticsworkshopnotebook-ph1x.notebook.us-east-1.sagemaker.aws/notebooks/analytics-workshop-redshift-notebook.ipynb 1/6
12/13/22, 7:47 PM analytics-workshop-redshift-notebook
Execute Code 🔻

In [4]:

raw_data = glueContext.create_dynamic_frame.from_catalog(database="analyticsworkshopdb", table_name="raw")



reference_data = glueContext.create_dynamic_frame.from_catalog(database="analyticsworkshopdb", table_name="reference_data")

View schema
In this step we view the schema of the dynamic frame
printSchema() – Prints the schema of the underlying DataFrame.

Execute Code 🔻

In [5]:

raw_data.printSchema()

root
|-- uuid: string
|-- device_ts: string
|-- device_id: int
|-- device_temp: int
|-- track_id: int
|-- activity_type: string
|-- partition_0: string
|-- partition_1: string
|-- partition_2: string
|-- partition_3: string

In [6]:

reference_data.printSchema()

root
|-- track_id: string
|-- track_name: string
|-- artist_name: string

Count records
In this step we will count the number of records in the dataframe
count() – Returns the number of rows in the underlying DataFrame

Execute Code 🔻

In [7]:

print(f'raw_data (count) = {raw_data.count()}')


print(f'reference_data (count) = {reference_data.count()}')

raw_data (count) = 751500


reference_data (count) = 100

Show sample raw records


You can use to method to show samples of data in the datasets
use show() method to display a sample of records in the frame
here were are showing the top 5 records in the DF

Execute Code 🔻

https://aws-glue-analyticsworkshopnotebook-ph1x.notebook.us-east-1.sagemaker.aws/notebooks/analytics-workshop-redshift-notebook.ipynb 2/6
12/13/22, 7:47 PM analytics-workshop-redshift-notebook

In [8]:

raw_data.toDF().show(5)

+--------------------+--------------------+---------+-----------+--------+-------------+-----------+-----------+-----------
+-----------+
| uuid| device_ts|device_id|device_temp|track_id|activity_type|partition_0|partition_1|partition_2
|partition_3|
+--------------------+--------------------+---------+-----------+--------+-------------+-----------+-----------+-----------
+-----------+
|61dc201f-8842-4f2...|2022-12-11 16:00:...| 10| 28| 15| Traveling| 2022| 12| 11
| 16|
|29dc7f74-7745-48a...|2022-12-11 16:00:...| 30| 28| 30| Traveling| 2022| 12| 11
| 16|
|657528bd-5c85-433...|2022-12-11 16:00:...| 49| 40| 25| Traveling| 2022| 12| 11
| 16|
|1fafd84a-99ef-4fc...|2022-12-11 16:00:...| 41| 28| 22| Traveling| 2022| 12| 11
| 16|
|ea336a3b-167d-469...|2022-12-11 16:00:...| 41| 28| 15| Traveling| 2022| 12| 11
| 16|
+--------------------+--------------------+---------+-----------+--------+-------------+-----------+-----------+-----------
+-----------+
only showing top 5 rows

Define Transformation Functions


You can define attribute level transformation functions (load_time_fn here). "load_time_fn" combines partition column values into one single attribute "load_time" in
YYYYMMDDHH24 format as an integer.
Call all attribute level transformation functions for each record in dynamic dataframe in record level transformation function (transformRec here)

Execute Code 🔻

In [9]:

def load_time_fn(partition_0, partition_1, partition_2, partition_3):


x = partition_0 + partition_1 + partition_2 + partition_3
x = int(x)
return x

In [10]:

def transformRec(rec):
rec["load_time"] = load_time_fn(rec["partition_0"], rec["partition_1"], rec["partition_2"], rec["partition_3"])
return rec

Apply Transformations
Apply all transformations and store it back in dynamic data frame - "raw_data_x"

Read more about AWS Glue transforms here : https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html


(https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html)

Execute Code 🔻

In [11]:

raw_data_x = Map.apply(frame=raw_data, f=transformRec)

Show sample transformed raw records


You can use to method to show samples of data in the datasets
use show() method to display a sample of records in the frame
here were are showing the top 5 records in the DF

Read more about AWS Glue transforms here : https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html


(https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html)

Execute Code 🔻

https://aws-glue-analyticsworkshopnotebook-ph1x.notebook.us-east-1.sagemaker.aws/notebooks/analytics-workshop-redshift-notebook.ipynb 3/6
12/13/22, 7:47 PM analytics-workshop-redshift-notebook

In [12]:

raw_data_x.toDF().show(5)

+--------+----------+-----------+-------------+-----------+-----------+-----------+--------------------+---------+---------
--+--------------------+
|track_id| load_time|partition_2|activity_type|partition_1|device_temp|partition_3| device_ts|device_id|partition
_0| uuid|
+--------+----------+-----------+-------------+-----------+-----------+-----------+--------------------+---------+---------
--+--------------------+
| 15|2022121116| 11| Traveling| 12| 28| 16|2022-12-11 16:00:...| 10| 20
22|61dc201f-8842-4f2...|
| 30|2022121116| 11| Traveling| 12| 28| 16|2022-12-11 16:00:...| 30| 20
22|29dc7f74-7745-48a...|
| 25|2022121116| 11| Traveling| 12| 40| 16|2022-12-11 16:00:...| 49| 20
22|657528bd-5c85-433...|
| 22|2022121116| 11| Traveling| 12| 28| 16|2022-12-11 16:00:...| 41| 20
22|1fafd84a-99ef-4fc...|
| 15|2022121116| 11| Traveling| 12| 28| 16|2022-12-11 16:00:...| 41| 20
22|ea336a3b-167d-469...|
+--------+----------+-----------+-------------+-----------+-----------+-----------+--------------------+---------+---------
--+--------------------+
only showing top 5 rows

Drop fields
Once "load_time" attribute is generated, we will drop original partition columns using "drop_fields" method.
These were generated by firehose for placing the files in yyyy/mm/dd/hh directory structure in S3
We will use Glue's in-built DropFields transform to drop partition columns

Read more about AWS Glue transforms here : https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html


(https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html)

Execute Code 🔻

In [13]:

raw_data_clean = raw_data_x.drop_fields(['partition_0', 'partition_1', 'partition_2', 'partition_3'])

Show sample transformed cleaned raw records


You can use to method to show samples of data in the datasets
use show() method to display a sample of records in the frame
here were are showing the top 5 records in the DF

Execute Code 🔻

In [14]:

raw_data_clean.toDF().show(5)

+--------+----------+-------------+-----------+--------------------+---------+--------------------+
|track_id| load_time|activity_type|device_temp| device_ts|device_id| uuid|
+--------+----------+-------------+-----------+--------------------+---------+--------------------+
| 15|2022121116| Traveling| 28|2022-12-11 16:00:...| 10|61dc201f-8842-4f2...|
| 30|2022121116| Traveling| 28|2022-12-11 16:00:...| 30|29dc7f74-7745-48a...|
| 25|2022121116| Traveling| 40|2022-12-11 16:00:...| 49|657528bd-5c85-433...|
| 22|2022121116| Traveling| 28|2022-12-11 16:00:...| 41|1fafd84a-99ef-4fc...|
| 15|2022121116| Traveling| 28|2022-12-11 16:00:...| 41|ea336a3b-167d-469...|
+--------+----------+-------------+-----------+--------------------+---------+--------------------+
only showing top 5 rows

Redshift Connection Parameters


We will use "analytics_workshop" Glue connection to connect to Redshift cluster.
We will create connection option for raw table consisting of schema name, table name and database name.
We will create a temp output directory for Glue to use as a staging area for loading data into Redshift.

Execute Code 🔻

In [15]:

connection_options_raw = {
"dbtable": "redshift_lab.f_raw_1",
"database": "dev"
}

output_dir_tmp = "s3://yourname-analytics-workshop-bucket/data"

https://aws-glue-analyticsworkshopnotebook-ph1x.notebook.us-east-1.sagemaker.aws/notebooks/analytics-workshop-redshift-notebook.ipynb 4/6
12/13/22, 7:47 PM analytics-workshop-redshift-notebook

Cast columns into desired format


We will explicitly cast all columns into desired datatypes.
If we dont perform this step, redshift on mismatch will create additional columns and then load the data. Ex: "device_ts" defined as timestamp in Redshift raw table
DDL. If we dont cast this column from string to timestamp, a new column will be created in redshift "f_raw_1" table as "device_ts_string" which will have device_ts
attribute values while original "device_ts" column which is defined as timestamp will stay blank.

Execute Code 🔻

In [16]:

raw_data_clean = ApplyMapping.apply(
frame=raw_data_clean,
mappings=[
("uuid", "string", "uuid", "string"),
("device_ts", "string", "device_ts", "timestamp"),
("device_id", "int", "device_id", "int"),
("device_temp", "int", "device_temp", "int"),
("track_id", "int", "track_id", "int"),
("activity_type", "string", "activity_type", "string"),
("load_time", "int", "load_time", "int")
]
)

Load raw data in Redshift


Finally, we will load cleaned raw data dynamic frame into redshift table - "redshift_lab.f_raw_1"
We will Glue dynamic frame writer class to perform this action.

Read more about AWS Glue dynamic frame writer here : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-
frame-writer.html (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.html)

Execute Code 🔻

In [*]:

try:
print("INFO: Loading raw data into Amazon Redshift")
glueContext.write_dynamic_frame.from_jdbc_conf(
frame=raw_data_clean,
catalog_connection="analytics_workshop",
connection_options=connection_options_raw,
redshift_tmp_dir=output_dir_tmp + "/tmp/"
)
print("INFO: Raw data loading into Amazon Redshift complete")
except Exception as e:
print(f"ERROR: An exception has occurred: {str(e)}")

Progress:

Redshift Connection Parameters


We will use "analytics_workshop" Glue connection to connect to Redshift cluster.
We will create connection option for raw table consisting of schema name, table name and database name.

Execute Code 🔻

In [*]:

connection_options_rd = {
"dbtable": "redshift_lab.d_ref_data_1",
"database": "dev"
}

Cast columns into desired format


We will explicitly cast all columns into desired datatypes.
If we dont perform this step, redshift on mismatch will create additional columns and then load the data. Ex: "track_id" defined as integer in Redshift raw table DDL. If
we dont cast this column from string to int, a new column will be created in redshift "d_ref_data_1" table as "track_id_string" which will have track_id attribute values
while original "track_id" column which is defined as int will stay blank.

Execute Code 🔻

https://aws-glue-analyticsworkshopnotebook-ph1x.notebook.us-east-1.sagemaker.aws/notebooks/analytics-workshop-redshift-notebook.ipynb 5/6
12/13/22, 7:47 PM analytics-workshop-redshift-notebook

In [*]:

reference_data_clean = ApplyMapping.apply(
frame=reference_data,
mappings=[
("track_id", "string", "track_id", "int"),
("track_name", "string", "track_name", "string"),
("artist_name", "string", "artist_name", "string")
]
)

Show sample transformed reference records


You can use to method to show samples of data in the datasets
use show() method to display a sample of records in the frame
here were are showing the top 5 records in the DF

Execute Code 🔻

In [*]:

reference_data_clean.toDF().show(5)

Load reference data in Redshift


Finally, we will load cleaned reference data dynamic frame into redshift table - "redshift_lab.d_ref_data_1"
We will Glue dynamic frame writer class to perform this action.

Read more about AWS Glue dynamic frame writer here : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-
frame-writer.html (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.html)

Execute Code 🔻

In [*]:

try:
print("INFO: Loading reference data into Amazon Redshift")
glueContext.write_dynamic_frame.from_jdbc_conf(
frame=reference_data_clean,
catalog_connection = "analytics_workshop",
connection_options = connection_options_rd,
redshift_tmp_dir = output_dir_tmp + "/tmp/"
)
print("INFO: Reference data loading into Amazon Redshift complete")
except Exception as e:
print(f"ERROR: An exception has occurred: {str(e)}")

😎
=========================

If you wish you take this notebook and its output back home - you can download / export it:
In Jupyter's menu bar click File:
Download As: Notebook(.ipynb) (you can reimport it a jupyter notebook in the future)
Download As: HTML (shows code + results in an easy to read format)

NEXT Steps: Go back to the lab guide


=========================

https://aws-glue-analyticsworkshopnotebook-ph1x.notebook.us-east-1.sagemaker.aws/notebooks/analytics-workshop-redshift-notebook.ipynb 6/6

You might also like