Developer Training For Apache Spark and Hadoop: Hands-On Exercises
Developer Training For Apache Spark and Hadoop: Hands-On Exercises
Table of Contents
General Notes ............................................................................................................. 1
Hands-On Exercise: Starting the Exercise Environment ........................................ 4
Hands-On Exercise: Querying Hadoop Data with Apache Impala ......................... 8
Hands-On Exercise: Accessing HDFS with the Command Line and Hue ............. 11
Hands-On Exercise: Running and Monitoring a YARN Job .................................. 17
Hands-On Exercise: Exploring DataFrames Using the Apache Spark Shell ......... 23
Hands-On Exercise: Working with DataFrames and Schemas ............................ 28
Hands-On Exercise: Analyzing Data with DataFrame Queries ............................ 32
Hands-On Exercise: Working With RDDs .............................................................. 39
Hands-On Exercise: Transforming Data Using RDDs ........................................... 43
Hands-On Exercise: Joining Data Using Pair RDDs ............................................... 50
Hands-On Exercise: Querying Tables and Views with SQL .................................. 54
Hands-On Exercise: Using Datasets in Scala ......................................................... 57
Hands-On Exercise: Writing, Configuring, and Running a Spark
Application ................................................................................................................ 59
Hands-On Exercise: Exploring Query Execution ................................................... 66
Hands-On Exercise: Persisting Data ...................................................................... 71
Hands-On Exercise: Implement an Iterative Algorithm with Apache Spark ....... 75
Hands-On Exercise: Writing a Streaming Application ......................................... 78
Hands-On Exercise: Processing Multiple Batches of Streaming Data ................. 83
Hands-On Exercise: Processing Streaming Apache Kafka Messages ................... 87
Appendix Hands-On Exercise: Producing and Consuming Apache Kafka
Messages ................................................................................................................... 91
Appendix Hands-On Exercise: Collecting Web Server Logs with Apache
Flume ......................................................................................................................... 94
Appendix Hands-On Exercise: Sending Messages from Flume to Kafka ............. 97
Appendix Hands-On Exercise: Import Data from MySQL Using Apache
Sqoop ......................................................................................................................... 99
Appendix: Enabling Jupyter Notebook for PySpark ........................................... 102
Appendix: Troubleshooting Tips .......................................................................... 105
Appendix: Continuing Exercises After Class ....................................................... 110
General Notes
• data—contains the data files used in all the exercises. Usually you will upload the
files to Hadoop’s distributed file system (HDFS) before working with them.
• examples—contains example code and data presented in the chapter slides in the
course.
• scripts—contains the course setup scripts and other scripts required to complete
the exercises.
The dollar sign ($) at the beginning of each line indicates the Linux shell
prompt. The actual prompt will include additional information (for example,
training@localhost:~/training_materials$) but this is omitted from
these instructions for brevity.
The backslash (\) at the end of a line signifies that the command is not complete
and continues on the next line. You can enter the code exactly as shown (on multiple
lines), or you can enter it on a single line. If you do the latter, you should not type in
the backslash.
• The command-line environment defines a few environment variables that are often
used in place of longer paths in the instructions. Since each variable is automatically
replaced with its corresponding values when you run commands in the terminal, this
makes it easier and faster for you to enter a command:
◦ $DEVDATA refers to the directory containing the data files used in the exercises.
Use the echo to see the value of an environment variable:
$ echo $DEVSH
• Graphical editors
If you prefer a graphical text editor, you can use gedit. You can start gedit using an
icon from the remote desktop tool bar. (You can also use emacs if you prefer.)
Bonus Exercises
There are additional challenges for some of the hands-on exercises. If you finish the
main exercise, please attempt the additional steps.
Catch-Up Script
If you are unable to complete an exercise, there is a script to catch you up automatically.
Each exercise has instructions for running the catch-up script if the exercise depends on
completion of prior exercises.
$ $DEVSH/scripts/catchup.sh
The script will prompt you for the exercise that you are starting; it will then set up all
the required data as if you had completed all of the previous exercises.
Note: If you run the catch-up script, you may lose your work. For example, all exercise
data will be deleted from HDFS before uploading the required files.
Troubleshooting
If you have trouble or unexpected behavior in the exercise environment, refer to the
tips in Appendix: Troubleshooting Tips.
Exercise Instructions
Start Your Exercise Environment
1. In your local browser, open the URL provided by your instructor to view the
exercise environment portal.
2. The environment portal page displays a thumbnail image for your exercise
environment remote host. The host will initially be powered off, indicated by a
gray background. The icon background will change to green when the machine is
running. (The machine name version may be different in your environment than the
one below.)
Click the host icon to open a new window showing the remote host machine. Start
the machine by clicking the play button (triangle icon). It will take a few minutes to
start.
3. When it is fully started, the remote host’s desktop will display. These exercises
refer to this as the “remote desktop” to distinguish it from your own local machine’s
desktop.
The remainder of the exercises in the course will be performed on the remote
desktop.
5. Wait for about five minutes after the script finishes before continuing.
6. Start the Firefox browser using the icon on the remote desktop, then click the
Cloudera Manager bookmark.
8. Verify that all the services in your cluster are healthy (indicated by a green dot), as
shown below. You may disregard yellow configuration warning icons.
Note: Some health warnings may appear as the cluster is starting. They will typically
resolve themselves within five minutes after the Start Cluster script finishes. Please be
patient. If health issues remain, refer to the tip entitled “Cloudera Manager displays
unhealthy status of services“ in Appendix: Troubleshooting Tips.
Your cluster may not be running exactly the services shown. This is okay and will
not interfere with completing the exercises.
Optional: Download and View the Exercise Manual on Your Remote Desktop
In order to be able to copy and paste from the Exercise Manual (the document you are
currently viewing) to your remote desktop, you need to view the document on your
remote machine rather than on your local machine.
a. On your remote desktop, start the Firefox browser using the shortcut icon in
the menu bar. The default page will display the Cloudera University home page.
(You can return to this page at any time by clicking the home icon in Firefox or
by visiting https://university.cloudera.com/user/learning/
enrollments.)
c. Select the course title, then click to download the Exercise Manual under
Materials. This will save the Exercise Manual PDF file in the Downloads folder
in the training user’s home directory.
a. Open a terminal window using the shortcut on the remote desktop, and then
start the Evince PDF viewer:
$ evince &
b. In Evince, select menu item File > Open and open the Exercise Manual PDF file
in the Downloads directory.
a. In your local browser, return to the exercise environment portal using the URL
provided by your instructor.
Click the remote host thumbnail to open the remote desktop, then click the play
button to start the host.
b. From the desktop, select Applications > Training > Start Cloudera Cluster to
restart your cluster services.
c. Wait about five minutes after the script has completed, then verify that the
services in your cluster services are running correctly by following the steps in
Verify Your Cluster Services.
In this exercise, you will use the Hue Impala Query Editor to explore data in a
Hadoop cluster.
This exercise is intended to let you begin to familiarize yourself with the course
exercise environment as well as Hue. You will also briefly explore the Impala Query
Editor.
Before starting this exercise, confirm that your cluster is running correctly by following
the steps in Verify Your Cluster Services.
Log in to Hue
1. Start Firefox on the remote desktop using the shortcut provided on the main menu
bar at the top of the screen.
3. Because this is the first time anyone has logged in to Hue on this server, you will be
prompted to create a new user account. Enter username training and password
training, and then click Create account. When prompted, click Remember.
Note: Make sure to use this exact username and password. Your exercise environment
is configured with a system user called training and your Hue username must
match. If you accidentally use the wrong username, refer to the instructions in the
Appendix: Troubleshooting Tips section at the end of the exercises.
4. The first time you log in to Hue, you may see a welcome message with that offers a
tour of new features in the latest version of Hue. The tour is optional. However, it
is very short and you might find it valuable. To start the tour, click Next. To skip it,
close the welcome popup using the X in the upper right corner.
If a different data source is shown, click the SQL data source symbol.
Then use the back arrow symbol in the navigation panel ( ) and navigate to the
Impala > default data source.
6. In the left panel under the default database, select the accounts table. This will
display the table‘s column definitions.
7. Hover your pointer over the accounts table to reveal the associated Show details
icon (labeled i), as shown below, then click the icon to bring up the details popup.
8. Select the Sample tab. The tab will display the first several rows of data in the table.
When you are done viewing the data, click the X in the upper right corner of the
popup to close it.
10. In the query editor text box, enter a SQL query like the one below:
11. Click the Execute button (labeled as a blue “play” symbol: ) to execute the
command.
12. View the returned data in the Results tab below the query area.
13. Optional: If you have extra time, continue exploring the Impala Query Editor on
your own. For instance, try selecting other tabs after viewing the results.
In this exercise, you will practice working with HDFS, the Hadoop Distributed File
System.
You will use the HDFS command line tool and the Hue File Browser web-based interface
to manipulate files in HDFS.
1. Open a terminal window using the shortcut on the remote desktop menu bar.
2. In the new terminal session, use the HDFS command line to list the content of the
HDFS root directory using the following command:
There will be multiple entries, one of which is /user. Each user has a “home”
directory under this directory, named after their username; your username in this
course is training, therefore your home directory is /user/training.
Relative Paths
In HDFS, relative (non-absolute) paths are considered relative
to your home directory. There is no concept of a “current” or
“working” directory as there is in Linux and similar filesystems.
Note that the directory structure in HDFS is unrelated to the directory structure of
the local filesystem on the remote host; they are completely separate namespaces.
5. Start by creating a new top-level directory for exercises. You will use this directory
throughout the rest of the course.
6. Change directories to the Linux local filesystem directory containing the sample
data we will be using in the course.
$ cd $DEVDATA
If you perform a regular Linux ls command in this directory, you will see several
files and directories that will be used in this course. One of the data directories
is kb. This directory holds Knowledge Base articles that are part of Loudacre’s
customer service website.
This copies the local kb directory and its contents into a remote HDFS directory
named /loudacre/kb.
You should see the KB articles that were in the local directory.
9. Practice uploading a directory, confirm the upload, and then remove it, as it is not
actually needed for the exercises.
This prints the first 20 lines of the article to your terminal. This command is handy
for viewing text data in HDFS. An individual file is often very large, making it
inconvenient to view the entire file in the terminal. For this reason, it is often a good
idea to pipe the output of the dfs -cat command into head, more, or less. You
can also use hdfs dfs -tail to more efficiently view the end of the file, rather
than piping the whole content.
11. To download a file to work with on the local filesystem use the
hdfs dfs -get command. This command takes two arguments: an HDFS path
and a local Linux path. It copies the HDFS contents into the local filesystem:
Enter the letter q to quit the less command after reviewing the downloaded file.
$ hdfs dfs
You see a help message describing all the filesystem commands provided by HDFS.
Try experimenting with a few of these commands if you like.
Use the Hue File Browser to Browse, View, and Manage Files
13. In the browser running on your remote desktop, visit Hue by clicking the Hue
bookmark, or going to URL http://master-1:8889/.
If your prior session has expired, log in again using the login credentials you created
earlier: username training and password training.
14. To access HDFS, open the triple bar menu in the upper left, then select Browsers >
Files.
15. By default, the contents of your HDFS home directory (/user/training) are
displayed. In the directory path name, click the leading slash (/) to view the HDFS
root directory.
16. The contents of the root directory are displayed, including the loudacre directory
you created earlier. Click that directory to see the contents.
17. Click the name of the kb directory to see the Knowledge Base articles you uploaded.
18. Click the checkbox next to any of the files, and then click the Actions button to see a
list of actions that can be performed on the selected file(s).
19. View the contents of one of the files by clicking on the name of the file.
• Note: In the file viewer, the contents of the file are displayed on the right. In
this case, the file is fairly small, but typical files in HDFS are very large, so rather
than displaying the entire contents on one screen, Hue provides buttons to move
between pages.
20. Return to the directory view by clicking View file location in the Actions panel on
the left of the file contents.
a. Click the Upload button on the right. You can choose to upload a plain file, or to
upload a zipped file (which will automatically be unzipped after upload). In this
case, select Files.
c. Browse to /home/training/training_materials/devsh/data,
choose base_stations.parquet, and click the Open button.
d. Confirm that the file was correctly uploaded into the current directory.
23. Optional: Explore the various file actions available. When you have finished, select
any additional files you have uploaded and click the Move to trash button to
delete. (Do not delete base_stations.parquet; that file will be used in later
exercises.)
In this exercise, you will submit an application to the YARN cluster, and monitor
the application using both the Hue Job Browser and the YARN Web UI.
The application you will run is provided for you. It is a simple Spark application
written in Python that counts the occurrence of words in Loudacre’s customer service
Knowledge Base (which you uploaded in a previous exercise). The focus of this exercise
is not on what the application does, but on how YARN distributes tasks in a job across a
cluster, and how to monitor an application and view its log files.
Important: This exercise depends on a previous exercise: “Access HDFS with the
Command Line and Hue.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. Take note of the values in the Cluster Metrics section, which displays information
such as the number of applications running currently, previously run, or waiting to
run; the amount of memory used and available; and how many worker nodes are in
the cluster.
3. Click the Nodes link in the Cluster menu on the left. The bottom section will display
a list of worker nodes in the cluster.
7. In your terminal, run the example wordcount.py program on the YARN cluster to
count the frequency of words in the Knowledge Base file set:
$ spark2-submit \
$DEVSH/exercises/yarn/wordcount.py /loudacre/kb/*
9. The Job Browser displays a list of currently running and recently completed
applications. (If you do not see the application you just started, wait a few seconds
for the page to automatically reload; it can take some time for the application to be
accepted and start running.) Review the entry for the current job.
This page allows you to click the application ID to see details of the running
application, or to kill a running job. (Do not do kill the job now though!)
10. Reload the YARN RM page in Firefox. Notice that the application you just started is
displayed in the list of applications in the bottom section of the RM home page.
12. Select the node HTTP address link for localhost.localdomain to open the Node
Manager UI .
13. Now that an application is running, you can click List of Applications to see the
application you submitted.
This will display the containers the Resource Manager has allocated on the selected
node for the current application. (No containers will show if no applications are
running; if you missed it because the application completed, you can run the
application again. In the terminal window, use the up arrow key to recall previous
commands.)
Tip: Resize the terminal window to be as wide as possible to make it easier to read
the command output.
If your application is still running, you should see it listed, including the application
ID (such as application_1469799128160_0001), the application name
(PythonWordCount), the type (SPARK), and so on.
If there are no applications on the list, your application has probably finished
running. By default, only current applications are included. Use the -appStates
ALL option to include all applications in the list:
2. Log into Cloudera Manager with the username admin and password admin.
3. On the Cloudera Manager home page, open the Clusters menu and select YARN
Applications.
Applications that are currently running or have recently run are shown. Confirm
that the application you ran above is displayed in the list. (If your application has
completed, you can restart it to explore the CM Applications manager working with
a running application.)
In this exercise, you will use the Spark shell to work with DataFrames.
You will start by viewing and bookmarking the Spark documentation in your browser.
Then you will start the Spark shell and read a simple JSON file into a DataFrame.
Important: This exercise depends on a previous exercise: “Access HDFS with Command
Line and Hue.” If you did not complete that exercise, run the course catch-up script and
advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. From the Programming Guides menu, select the DataFrames, Datasets and SQL.
Briefly review the guide and bookmark the page for later review.
3. From the API Docs menu, select either Scala or Python, depending on your
language preference. Bookmark the API page for use during class. Later exercises
will refer to this documentation.
4. If you are viewing the Scala API, notice that the package names are displayed on
the left. Use the search box or scroll down to find the org.apache.spark.sql
package. This package contains most of the classes and objects you will be working
with in this course. In particular, note the Dataset class. Although this exercise
focuses on DataFrames, remember that DataFrames are simply an alias for Datasets
of Row objects. So all the DataFrame operations you will practice using in this
exercise are documented on the Dataset class.
5. If you are viewing the Python API, locate the pyspark.sql module. This module
contains most of the classes you will be working with in this course. At the top are
some of the key classes in the module. View the API for the DataFrame class; these
are the operations you will practice using in this exercise.
7. In the terminal window, start the Spark 2 shell. Start either the Python shell or the
Scala shell, not both.
To start the Python shell, use the pyspark2 command.
$ pyspark2
$ spark2-shell
You may get several WARN messages, which you can disregard.
8. Spark creates a SparkSession object for you called spark. Make sure the object
exists. Use the first command below if you are using Python, and the second one if
you are using Scala. (You only need to complete the exercises in Python or Scala.)
pyspark> spark
scala> spark
Python will display information about the spark object such as:
<pyspark.sql.session.SparkSession at address>
Scala will display similar information in a different format:
org.apache.spark.sql.SparkSession =
org.apache.spark.sql.SparkSession@address
Note: In subsequent instructions, both Python and Scala commands will be shown
but not noted explicitly; Python shell commands are in blue and preceded with
pyspark>, and Scala shell commands are in red and preceded with scala>.
9. Using command completion, you can see all the available Spark session methods:
type spark. (spark followed by a dot) and then the TAB key.
Note: You can exit the Scala shell by typing sys.exit. To exit the Python shell,
press Ctrl+D or type exit. However, stay in the shell for now to complete the
remainder of this exercise.
11. Review the simple text file you will be using: $DEVDATA/devices.json. You
can view the file either in gedit, or by starting a new terminal window then using
the less command. (Do not modify the file.) This file contains records for each of
Loudacre’s supported devices. For example:
{"devnum":1,"release_dt":"2008-10-21T00:00:00.000-07:00",
"make":"Sorrento","model":"F00L","dev_type":"phone"}
Notice the field names and types of values in the first few records.
13. In the Spark shell, create a new DataFrame based on the devices.json file in
HDFS.
14. Spark has not yet read the data in the file, but it has scanned the file to infer the
schema. View the schema, and note that the column names match the record field
names in the JSON file.
pyspark> devDF.printSchema()
scala> devDF.printSchema
15. Display the data in the DataFrame using the show function. If you don’t pass an
argument to show, Spark will display the first 20 rows in the DataFrame. For this
step, display the first five rows. Note that the data is displayed in tabular form,
using the column names defined in the schema.
> devDF.show(5)
Note: Like many Spark queries, this command is the same whether you are using
Scala or Python.
16. The show and printSchema operations are actions—that is, they return a value
from the distributed DataFrame to the Spark driver. Both functions display the data
in a nicely formatted table. These functions are intended for interactive use in the
shell, but do not allow you actually work with the data that is returned. Try using
the take action instead, which returns an array (Scala) or list (Python) of Row
objects. You can display the data by iterating through the collection.
Query a DataFrame
17. Use the count action to return the number of items in the DataFrame.
> devDF.count()
columns, then display its schema. Note that only the selected columns are in the
schema.
pyspark> makeModelDF.show()
scala> makeModelDF.show
pyspark> devDF.select("devnum","make","model"). \
where("make = 'Ronin'"). \
show()
scala> devDF.select("devnum","make","model").
where("make = 'Ronin'").
show
In this exercise, you will work with structured account and mobile device data
using DataFrames.
You will practice creating and saving DataFrames using different types of data sources,
and inferring and defining schemas.
Important: This exercise depends on a previous exercise: “Exploring DataFrames Using
the Spark Shell.” If you did not complete that exercise, run the course catch-up script
and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. If you do not have one already, open a terminal , and start the Spark 2 shell (either
Scala or Python, as you prefer).
4. Print the schema and the first few rows of the DataFrame, and note that the schema
and data are the same as the Hive table.
5. Create a new DataFrame with rows from the accounts data where the zip code is
94913, and save the result to CSV files in the /loudacre/accounts_zip94913
HDFS directory. You can do this in a single command, as shown below, or with
multiple commands.
6. Use Hue or the command line (in a separate terminal window) to view the
/loudacre/accounts_zip94913 directory in HDFS and the data in one of the
saved files. Confirm that the CSV file includes a header line, and that only records
for the selected zip code are included.
7. Optional: Try creating a new DataFrame based on the CSV files you created above.
Compare the schema of the original accountsDF and the new DataFrame. What’s
different? Try again, this time setting the inferSchema option to true and
compare again.
9. Create a new DataFrame based on the devices.json file. (This command could
take several seconds while it infers the schema.)
10. View the schema of the devDF DataFrame. Note the column names and types that
Spark inferred from the JSON file. In particular, note that the release_dt column
is of type string, whereas the data in the column actually represents a timestamp.
11. Define a schema that correctly specifies the column types for this DataFrame. Start
by importing the package with the definitions of necessary classes and types.
pyspark> devColumns = [
StructField("devnum",LongType()),
StructField("make",StringType()),
StructField("model",StringType()),
StructField("release_dt",TimestampType()),
StructField("dev_type",StringType())]
13. Create a schema (a StructType object) using the column definition list.
14. Recreate the devDF DataFrame, this time using the new schema.
15. View the schema and data of the new DataFrame, and confirm that the
release_dt column type is now timestamp.
16. Now that the device data uses the correct schema, write the data in Parquet format,
which automatically embeds the schema. Save the Parquet data files into an HDFS
directory called /loudacre/devices_parquet.
17. Optional: In a separate terminal window, use parquet-tools to view the schema
of the saved files.
$ parquet-tools schema \
hdfs://master-1/loudacre/devices_parquet/
Note that the type of the release_dt column is noted as int96; this is how Spark
denotes a timestamp type in Parquet.
For more information about parquet-tools, run parquet-tools --help.
18. Create a new DataFrame using the Parquet files you saved in devices_parquet
and view its schema. Note that Spark is able to correctly infer the timestamp type
of the release_dt column from Parquet’s embedded schema.
In this exercise, you will analyze account and mobile device data using DataFrame
queries.
First, you will practice using column expressions in queries. You will analyze data in
DataFrames by grouping and aggregating data, and by joining two DataFrames. Then
you will query multiple sets of data to find out how many of each mobile device model
is used in active accounts.
Important: This exercise depends on a previous exercise: “Working with DataFrames
and Schemas.” If you did not complete that exercise, run the course catch-up script and
advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. Start the Spark 2 shell in a terminal if you do not already have it running.
3. Create a new DataFrame called accountsDF based on the Hive accounts table.
4. Try a simple query with select, using both column reference syntaxes.
pyspark> accountsDF. \
select(accountsDF["first_name"]).show()
pyspark> accountsDF.select(accountsDF.first_name).show()
scala> accountsDF.
select(accountsDF("first_name")).show
scala> accountsDF.select($"first_name").show
5. To explore column expressions, create a column object to work with, based on the
first_name column in the accountsDF DataFrame.
6. Note that the object type is Column. To see available methods and attributes, use
tab completion—that is, enter fnCol. followed by TAB.
7. New Column objects are created when you perform operations on existing
columns. Create a new Column object based on a column expression that identifies
users whose first name is Lucy using the equality operator on the fnCol object you
created above.
depending on the value of the first_name column. Confirm that users named
Lucy are identified with the value true.
pyspark> accountsDF. \
select(accountsDF.first_name, \
accountsDF.last_name,lucyCol). \
show()
scala> accountsDF.
select($"first_name",$"last_name",lucyCol).show
> accountsDF.where(lucyCol).show(5)
10. Column expressions do not need to be assigned to a variable. Try the same query
without using the lucyCol variable.
11. Column expressions are not limited to where operations like those above. They can
be used in any transformation for which a simple column could be used, such as a
select. Try selecting the city and state columns, and the first three characters
of the phone_number column (in the U.S., the first three digits of a phone number
are known as the area code). Use the substr operator on the phone_number
column to extract the area code.
pyspark> accountsDF. \
select("city", "state", \
accountsDF.phone_number.substr(1,3)). \
show(5)
scala> accountsDF.
select($"city", $"state",
$"phone_number".substr(1,3)).
show(5)
12. Notice that in the last step, the values returned by the query were correct, but the
column name was substring(phone_number, 1, 3), which is long and
hard to work with. Repeat the same query, using the alias operator to rename that
column as area_code.
pyspark> accountsDF. \
select("city", "state", \
accountsDF.phone_number. \
substr(1,3).alias("area_code")). \
show(5)
scala> accountsDF.
select($"city", $"state",
$"phone_number".substr(1,3).alias("area_code")).
show(5)
13. Perform a query that results in a DataFrame with just first_name and
last_name columns, and only includes users whose first and last names both
begin with the same two letters. (For example, the user Robert Roget would be
included, because both his first and last names begin with “Ro”.)
pyspark> accountsDF.groupBy("last_name").count().show(5)
scala> accountsDF.groupBy("last_name").count.show(5)
15. You can also group by multiple columns. Query accountsDF again, this time
counting the number of people who share the same last and first name.
pyspark> accountsDF. \
groupBy("last_name","first_name").count().show(5)
scala> accountsDF.
groupBy("last_name","first_name").count.show(5)
$ parquet-tools schema \
hdfs://master-1/loudacre/base_stations.parquet
$ parquet-tools head \
hdfs://master-1/loudacre/base_stations.parquet
17. In your Spark shell, create a new DataFrame called baseDF using the base stations
data. Review the baseDF schema and data to ensure it matches the data in the
Parquet file.
18. Some account holders live in zip codes that have a base station. Join baseDF and
accountsDF to find those users, and for each, include their account ID, first name,
last name, and the ID and location data for the base station in their zip code.
pyspark> accountsDF. \
select("acct_num","first_name","last_name","zipcode"). \
join(baseDF, baseDF.zip == accountsDF.zipcode). \
show()
scala> accountsDF.
select("acct_num","first_name","last_name","zipcode").
join(baseDF,$"zip" === $"zipcode").show()
21. Use the account device data and the DataFrames you created previously in this
exercise to find the total number of each device model across all active accounts
(that is, accounts that have not been closed). The new DataFrame should be sorted
from most to least common model. Save the data as Parquet files in a directory
called /loudacre/top_devices with the following columns:
Column Description Example Value
Name
device_id The ID number of each known device 18
(including those that might not be in
use by any account)
make The manufacturer name for the device Ronin
model The model name for the device Novelty Note 2
active_num The total number of the model used 2092
by active accounts
Hints:
• Active accounts are those with a null value for acct_close_dt (account close
date) in the accounts table.
• The device_id column in the device accounts data corresponds to the devnum
column in the list of known devices in the /loudacre/devices.json file.
In this exercise, you will use the Spark shell to work with RDDs.
You will start reading a simple text file into a Resilient Distributed Dataset (RDD) and
displaying the contents. You will then create two new RDDs and use transformations to
union them and remove duplicates.
Important: This exercise depends on a previous exercise: “Accessing HDFS with
Command Line and Hue.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
3. In a terminal window on your remote desktop, upload the text file to HDFS
directory /loudacre.
4. In the Spark shell, define an RDD based on the frostroad.txt text file.
5. Using command completion, you can see all the available transformations and
actions you can perform on an RDD. Type myRDD. and then the TAB key.
6. Spark has not yet read the file. It will not do so until you perform an action on the
RDD. Try counting the number of elements in the RDD using the count action:
pyspark> myRDD.count()
scala> myRDD.count
The count operation causes the RDD to be materialized (created and populated).
The number of lines (23) should be displayed, for example:
Out[2]: 23 (Python) or
res1: Long = 23 (Scala)
7. Call the collect operation to return all data in the RDD to the Spark driver. Take
note of the type of the return value; in Python will be a list of strings, and in Scala it
will be an array of strings.
Note: collect returns the entire set of data. This is convenient for very small
RDDs like this one, but be careful using collect for more typical large sets of data.
8. Display the contents of the collected data by looping through the collection.
12. Display the contents of the makes1RDD data using collect and then looping
through returned collection.
13. Repeat the previous steps to create and display an RDD called makes2RDD based
on the second file, /loudacre/makes2.txt.
14. Create a new RDD by appending the second RDD to the first using the union
transformation.
15. Collect and display the contents of the new allMakesRDD RDD.
17. Optional: Try performing different transformations on the RDDs you created above,
such as intersection, subtract, or zip. See the RDD API documentation for
details.
$ $DEVSH/scripts/catchup.sh
2. Copy the weblogs directory from the local filesystem to the /loudacre HDFS
directory.
3. In spark, create an RDD from the uploaded web logs data files in the /loudacre/
weblogs/ directory in HDFS.
4. Create an RDD containing only those lines that are requests for JPG files. Use the
filter operation with a transformation function that takes a string RDD element
and returns a boolean value.
pyspark> jpglogsRDD = \
logsRDD.filter(lambda line: ".jpg" in line)
5. Use take to return the first five lines of the data in jpglogsRDD. The return value
is a list of strings (in Python) or array of strings (in Scala).
scala> jpgLines.foreach(println)
7. Now try using the map transformation to define a new RDD. Start with a simple map
function that returns the length of each line in the log file. This results in an RDD of
integers.
pyspark> lineLengthsRDD = \
logsRDD.map(lambda line: len(line))
8. Loop through and display the first five elements (integers) in the RDD.
9. Calculating line lengths is not very useful. Instead, try mapping each string in
logsRDD by splitting the strings based on spaces. The result will be an RDD in
which each element is a list of strings (in Python) or an array of strings (in Scala).
Each string represents a “field” in the web log line.
pyspark> lineFieldsRDD = \
logsRDD.map(lambda line: line.split(' '))
10. Return the first five elements of lineFieldsRDD. The result will be a list of lists of
strings (in Python) or an array of arrays of strings (in Scala).
11. Display the contents of the return from take. Unlike in examples above, which
returned collections of simple values (strings and ints), this time you have a set of
compound values (arrays or lists containing strings). Therefore, to display them
properly, you will need to loop through the arrays/lists in lineFields, and then
loop through each string in the array/list. To make it easier to read the output, use
------- to separate each set of field values.
If you choose to copy and paste the Pyspark code below into the shell, it may not
automatically indent properly; be sure the indentation is correct before executing
the command.
12. Now that you know how map works, create a new RDD containing just the IP
addresses from each line in the log file. (The IP address is the first space-delimited
field in each line.)
pyspark> ipsRDD = \
logsRDD.map(lambda line: line.split(' ')[0])
pyspark> for ip in ipsRDD.take(5): print ip
pyspark> ipsRDD.saveAsTextFile("/loudacre/iplist")
scala> ipsRDD.saveAsTextFile("/loudacre/iplist")
• Note: If you re-run this command, you will not be able to save to the same
directory because it already exists. Be sure to first delete the directory using
either the hdfs command (in a separate terminal window) or the Hue file
browser.
14. In a separate terminal window or the Hue file browser, list the contents of the
/loudacre/iplist folder. Review the contents of one of the files to confirm that
they were created correctly.
165.32.101.206,8
100.219.90.44,102
182.4.148.56,173
246.241.6.175,45395
175.223.172.207,4115
…
16. Now that the data is in CSV format, it can easily be used by Spark SQL. Load the new
CSV files in /loudacre/userips_csv created above into a DataFrame, then
view the data and schema.
18. Determine which delimiter to use (the 20th character—position 19—is the first use
of the delimiter).
19. Filter out any records which do not parse correctly (hint: each record should have
exactly 14 values).
20. Extract the date (first field), model (second field), device ID (third field), and
latitude and longitude (13th and 14th fields respectively).
21. The second field contains the device manufacturer and model name (such as Ronin
S2). Split this field by spaces to separate the manufacturer from the model (for
example, manufacturer Ronin, model S2). Keep just the manufacturer name.
23. Confirm that the data in the file(s) was saved correctly. The lines in the file should
all look similar to this, with all fields delimited by commas.
2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-
b500-28f2342679af,33.6894754264,-117.543308253
24. Review the data on the local Linux filesystem in the directory $DEVDATA/
activations. Each XML file contains data for all the devices activated by
customers during a specific month.
Sample input data:
<activations>
<activation timestamp="1225499258" type="phone">
<account-number>316</account-number>
<device-id>
d61b6971-33e1-42f0-bb15-aa2ae3cd8680
</device-id>
<phone-number>5108307062</phone-number>
<model>iFruit 1</model>
</activation>
…
</activations>
Follow the steps below to write code to go through a set of activation XML files and
extract the account number and device model for each activation, and save the list to a
file as account_number:model.
The output will look something like:
1234:iFruit 1
987:Sorrento F00L
4566:iFruit 1
…
26. Start with the ActivationModels stub script in the bonus exercise directory:
$DEVSH/exercises/transform-rdds/bonus-xml. (Stubs are provided for
Scala and Python; use whichever language you prefer.) Note that for convenience
you have been provided with functions to parse the XML, as that is not the focus of
this exercise. Copy the stub code into the Spark shell of your choice.
27. Use wholeTextFiles to create an RDD from the activations dataset. The
resulting RDD will consist of tuples, in which the first value is the name of the file,
and the second value is the contents of the file (XML) as a string.
28. Each XML file can contain many activation records; use flatMap to map the
contents of each file to a collection of XML records by calling the provided
getActivations function. getActivations takes an XML string, parses it, and
returns a collection of XML records; flatMap maps each record to a separate RDD
element.
30. Save the formatted strings to a text file in the directory /loudacre/account-
models.
In this exercise, you will explore the Loudacre web server log files, as well as the
Loudacre user account data, using key-value pair RDDs.
Important: This exercise depends on a previous exercise: “Transforming Data Using
RDDs.” If you did not complete that exercise, run the course catch-up script and advance
to the current exercise:
$ $DEVSH/scripts/catchup.sh
1. Using map-reduce logic, count the number of requests from each user.
a. Use map to create a pair RDD with the user ID as the key and the integer 1
as the value. (The user ID is the third field in each line.) Your data will look
something like this:
(userid,1)
(userid,1)
(userid,1)
…
b. Use reduceByKey to sum the values for each user ID. Your RDD data will be
similar to this:
(userid,5)
(userid,7)
(userid,2)
…
2. Use countByKey to determine how many users visited the site for each frequency.
That is, how many users visited once, twice, three times, and so on.
3. Create an RDD where the user ID is the key, and the value is the list of all the IP
addresses that user has connected from. (IP address is the first field in each request
line.)
(userid,[20.1.34.55, 74.125.239.98])
(userid,[75.175.32.10, 245.33.1.1, 66.79.233.99])
(userid,[65.50.196.141])
…
ID, which corresponds to the user ID in the web server logs. The other fields include
account details such as creation date, first and last name, and so on.
4. Join the accounts data with the weblog data to produce a dataset keyed by user ID
which contains the user account information and the number of website hits for
that user.
b. Join the pair RDD with the set of user-id/hit-count pairs calculated in the first
step.
(9012,([9012,2008-11-24 10:04:08,\N,Cheryl,West, 4905 Olive
Street,San Francisco,CA,…],4))
(2312,([2312,2008-11-23 14:05:07,\N,Elizabeth,Kerns, 4703
Eva Pearl Street,Richmond,CA,…],8))
(1195,([1195,2008-11-02 17:12:12,2013-07-18
16:42:36,Melissa, Roman,3539 James Martin
Circle,Oakland,CA,…],1))
…
c. Display the user ID, hit count, and first name (4th value) and last name (5th
value) for the first five elements. The output should look similar to this:
Bonus Exercises
If you have more time, attempt the following extra bonus exercises:
1. Use keyBy to create an RDD of account data with the postal code (9th field in the
CSV file) as the key.
Tip: Assign this new RDD to a variable for use in the next bonus exercise.
2. Create a pair RDD with postal code as the key and a list of names (Last Name,First
Name) in that postal code as the value.
• Hint: First name and last name are the 4 and 5 fields respectively.
th th
--- 85003
Jenkins,Thad
Rick,Edward
Lindsay,Ivy
…
--- 85004
Morris,Eric
Reiser,Hazel
Gregg,Alicia
Preston,Elizabeth
…
In this exercise, you will use the Catalog API to explore Hive tables, and create
DataFrames by executing SQL queries.
Use the Catalog API to list the tables in the default Hive database, and view the schema
of the accounts table. Perform queries on the accounts table, and review the resulting
DataFrames. Create a temporary view based on the accountdevice CSV files, and use
SQL to join that table with the accounts table.
Important: This exercise depends on a previous exercise: “Analyzing Data with
DataFrame Queries.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
scala> spark.catalog.listTables.show
scala> spark.catalog.listColumns("accounts").show
3. Create a new DataFrame based on the accounts table, and confirm that its schema
matches that of the column list above.
5. Optional: Perform the equivalent query using the DataFrame API, and compare the
schema and data in the results to those of the query above.
8. Confirm the view was created correctly by listing the tables and views in the
default database as you did earlier. Notice that the account_dev table type is
TEMPORARY.
9. Using a SQL query, create a new DataFrame based on the first five rows of the
account_dev table, and display the results.
11. Save nameDevDF as a table called name_dev (with the file path as /loudacre/
name_dev).
12. Use the Catalog API to confirm that the table was created correctly with the right
schema.
13. Optional: If you are familiar with using Hive or Impala, verify that the name_dev
table now exists in the Hive metastore. If you use Impala, be sure to invalidate
Impala’s local store of the metastore using the INVALIDATE METADATA command
or the refresh icon in the Hue Impala Query Editor.
14. Optional: Exit and restart the shell and confirm that the temporary view is no longer
available.
In this exercise, you will explore Datasets using web log data.
Create an RDD of account ID/IP address pairs, and then create a new Dataset of
products (case class objects) based on that RDD. Compare the results of typed and
untyped transformations to better understand the relationship between DataFrames
and Datasets.
Note: These exercises are in Scala only, because Datasets are not defined in Python.
Important: This exercise depends on a previous exercise: “Transforming Data Using
RDDs.” If you did not complete that exercise, run the course catch-up script and advance
to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. Create an RDD of AccountIP objects by using the web log data in /loudacre/
weblogs. Split the data by spaces and use the first field as IP address and the third
as account ID.
6. Save the accountIPDS Dataset as a Parquet file, then read the file back into
a DataFrame. Note that the type of the original Dataset (AccountIP) is not
preserved, but the types of the columns are.
Bonus Exercises
1. Try creating a new Dataset of AccountIP objects based on the DataFrame you
created above.
2. Create a view on the AccountIPDS Dataset, and perform a SQL query on the view.
What is the return type of the SQL query? Were column types preserved?
In this exercise, you will write your own Spark application instead of using the
interactive Spark shell application.
Write a simple Spark application that takes a single argument, a state code (such as CA).
The program should read the data from the accounts Hive table and save the rows
whose state column value matches the specified state code. Write the results to /
loudacre/accounts_by_state/state-code (such as accounts_by_state/
CA).
Depending on which programming language you are using, follow the appropriate set of
instructions below to write a Spark program.
Important: This exercise depends on a previous exercise: “Accessing HDFS with the
Command Line and Hue.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. A simple stub file to get you started has been provided in the exercise directory:
$DEVSH/exercises/spark-application/python-stubs/accounts-
by-state.py. This stub imports the required Spark classes and sets up your main
code block. Open the stub file in an editor.
spark = SparkSession.builder.getOrCreate()
4. In the body of the program, load the accounts Hive table into a DataFrame. Select
accounts where the state column value matches the string provided as the first
argument to the application. Save the results to a directory called /loudacre/
accounts_by_state/state-code (where state-code is a string such
as CA.) Use overwrite mode when saving the file so that you can re-run the
application without needing to delete the directory.
spark.stop()
6. If you have a Spark shell running in any terminal, exit the shell before running your
application.
$ cd $DEVSH/exercises/spark-application/
$ spark2-submit python-stubs/accounts-by-state.py CA
8. After the program completes, use parquet-tools to verify that the file contents
are correct. For example, if you used the state code CA, you would use the command
below:
$ parquet-tools head \
hdfs://master-1/loudacre/accounts_by_state/CA
9. Skip the section below on writing and running a Spark application in Scala and
continue with View the Spark Application UI.
10. In a terminal window, change to the project directory. Be sure to enter the
command line shown below as a single line.
$ cd \
$DEVSH/exercises/spark-application/accounts-by-state_project
$ mvn package
Maven will begin to download the required Spark libraries. Next time you build the
project, Maven will used the libraries in its local cache. While Maven downloads the
libraries, continue with the exercise steps below.
14. In the body of the program, load the accounts Hive table into a DataFrame. Select
accounts where the state column value matches the string provided as the first
argument to the application. Save the results to a Parquet file called /loudacre/
accounts_by_state/state-code (where state-code is a string such
as CA). Use overwrite mode when saving the file so that you can re-run the
application without needing to delete the save directory.
15. At the end of the application, be sure to stop the Spark session:
spark.stop
16. Return to the terminal in which you ran Maven earlier in order to cache Spark
libraries locally. If the Maven command is still running, wait for it to finish. When
it finishes, rebuild the project, this time including the code you added above. This
time, the Maven command should take much less time.
$ mvn package
If the build is successful, Maven will generate a JAR file called accounts-by-
state-1.0.jar in the target directory.
17. If you have a Spark shell running in any terminal, exit the shell before running your
application.
18. Run the program in the compiled JAR file, passing the state code to select. For
example, to select accounts in California, use the following command:
$ spark2-submit \
--class stubs.AccountsByState \
target/accounts-by-state-1.0.jar CA
19. After the program completes, use parquet-tools to verify that the file contents
are correct. For example, if you used the state code CA, you would use the command
below:
$ parquet-tools head \
hdfs://master-1/loudacre/accounts_by_state/CA/
20. Open Firefox on your remote desktop and visit the YARN Resource Manager UI
using the provided RM bookmark (or go to URI http://master-1:8088/).
While the application is running, it appears in the list of applications something like
this:
After the application has completed, it will appear in the list like this:
To view the Spark Application UI if your application is still running, select the
ApplicationMaster link. To view the History Service UI if your application has
completed, select the History link.
22. Go back to the YARN RM UI in your browser, and confirm that the application name
was set correctly in the list of applications.
23. Follow the ApplicationMaster or History link. View the Environment tab. Take
note of the spark.* properties such as master and app.name.
24. You can set most of the common application properties using submit script
flags such as name, but for others you need to use conf. Use conf to set the
spark.default.parallelism property, which controls how many partitions
result after a "wide" RDD operation like reduceByKey.
25. View the application history for this application to confirm that the
spark.default.parallelism property was set correctly. (You will need to
view the YARN RM UI again to view the correct application’s history.)
27. Examine the extra output displayed when the application starts up.
a. The first section starts with Using properties file, and shows the
file name and the default property settings the application loaded from that
properties file.
• Does the list correctly include the value you set with --name?
• Which arguments (flags) have defaults set in the script and which do not?
c. Scroll down to the section that starts with System properties. This list
shows all the properties set—those loaded from the system properties file,
those you set using submit script arguments, and those you set using the conf
flag.
1. Edit the Python or Scala application you wrote above, and use the builder function
appName to set the application name.
3. View the YARN UI to confirm that the application name was correctly set.
In this exercise, you will explore how Spark executes RDD and DataFrame/
Dataset queries.
First, you will explore RDD partitioning and lineage-based execution plans using the
Spark shell and the Spark Application UI. Then you will explore how Catalyst executes
DataFrame and Dataset queries.
Important: This exercise depends on a previous exercise: “Transforming Data Using
RDDs”. If you did not complete those exercises, run the course catch-up script and
advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. In the Spark shell, create an RDD called accountsRDD by reading the accounts
data, splitting it by commas, and keying it by account ID, which is the first field of
each line.
pyspark> accountsRDD.getNumPartitions()
scala> accountsRDD.getNumPartitions
scala> accountsRDD.toDebugString
6. In the browser, view the application in the YARN RM UI using the provided
bookmark (or http://master-1:8088) and click through to view the Spark
Application UI.
7. Make sure the Jobs tab is selected, and review the list of completed jobs. The most
recent job, which you triggered by calling count, should be at the top of the list.
(Note that the job description is usually based on the action that triggered the job
execution.) Confirm that the number of stages is correct, and the number of tasks
completed for the job matches the number of RDD partitions you noted when you
used toDebugString.
8. Click on the job description to view details of the job. This will list all the stages in
the job, which in this case is one.
9. Click on DAG Visualization to see a diagram of the execution plan based on the
RDD’s lineage. The main diagram displays only the stages, but if you click on a stage,
it will show you the tasks within that stage.
10. Optional: Explore the partitioning and DAG of a more complex query like the one
below. Before you view the execution plan or job details, try to figure out how many
stages the job will have.
This query loads Loudacre’s web log data, and calculates how many times each user
visited. Then it joins that user count data with account data for each user.
Note: If you execute the query multiple times, you may note that some tasks within
a stage are marked as “skipped.” This is because whenever a shuffle operation
is executed, Spark temporarily caches the data that was shuffled. Subsequent
executions of the same query re-use that data if it’s available to save some steps and
increase performance.
12. View the full execution plan for the new DataFrame.
pyspark> activeAccountsDF.explain(True)
scala> activeAccountsDF.explain(true)
Can you locate the line in the physical plan corresponding to the command to load
the accounts table into a DataFrame?
How many stages do you think this query has?
14. View the Spark Application UI and choose the SQL tab. This displays a list of
DataFrame and Dataset queries you have executed, with the most recent query at
the top.
15. Click the description for the top query to see the visualization of the query’s
execution. You can also see the query’s full execution plan by opening the Details
panel below the visualization graph.
16. The first step in the execution is a HiveTableScan, which loaded the account data
into the DataFrame. Hover your mouse over the step to show the step’s execution
plan. Compare that to the physical plan for the query. Note that it is the same as the
last line in the physical execution plan, because it is the first step to execute. Did
you correctly identify this line in the execution plan as the one corresponding to the
DataFrame.read.table operation?
17. The Succeeded Jobs label provides links to the jobs that executed as part of this
query execution. In this case, there is just a single job. Click the job’s ID (in the Jobs
column) to view the job details. This will display a list of stages that were completed
for the query.
How many stages executed? Is that the number of stages you predicted it would be?
18. Optional: Click the description of the stage to view metrics on the execution of the
stage and its tasks.
19. The previous query was very simple, involving just a single data source with a
where to return only active accounts. Try executing a more complex query that
joins data from two different data sources.
This query reads in the accountdevice data file, which maps that maps account
IDs to associated device IDs. Then it joins that data with the DataFrame of active
accounts you created above. The result is DataFrame consisting of all device IDs in
use by currently active accounts.
20. Review the full execution plan using explain, as you did with the previous
DataFrame.
Can you identify which lines in the execution plan load the two different data
sources?
How many stages do you think this query will execute?
21. Execute the query and review the execution visualization in the Spark UI.
What differences do you see between the execution of the earlier query and this
one?
How many stages executed? Is this what you expected?
22. Optional: Explore an even more complex query that involves multiple joins with
three data sources. You can use the last query in the solutions file for this exercise
(in the $DEVSH/exercises/query-execution/solution/ directory). That
query creates a list of device IDs, makes, and models, and the number of active
accounts that use that type of device, sorted in order from most popular device type
to least.
$ $DEVSH/scripts/catchup.sh
2. The query code you pasted above defines a new DataFrame called
accountsDevsDF, which joins account data and device data for all active
accounts. Try executing a query starting with the accountsDevsDF DataFrame
that displays the account number, first name, last name and device ID for each row.
pyspark> accountsDevsDF. \
select("acct_num","first_name","last_name","device_id"). \
show(5)
scala> accountsDevsDF.
select("acct_num","first_name","last_name","device_id").
show(5)
3. In your browser, go to the SQL tab of your application’s Spark UI, and view the
execution visualization of the query you just executed. Take note of the complexity
so that you can compare it to later executions when using persistence.
Remember that queries are listed in the SQL tab in the order they were executed,
starting with the most recent. The descriptions of multiple executions of the same
action will not distinguish one query from another, so make sure you choose the
correct one for the query you are looking at.
4. In your Spark shell, persist the accountsDevsDF DataFrame using the default
storage level.
pyspark> accountsDevsDF.persist()
scala> accountsDevsDF.persist
pyspark> accountsDevsDF. \
select("acct_num","first_name","last_name","device_id"). \
show(5)
scala> accountsDevsDF.
select("acct_num","first_name","last_name","device_id").
show(5)
6. In the browser, reload the Spark UI SQL tab, and view the execution diagram for
the query just just executed. Notice that it has far fewer steps. Instead of reading,
filtering, and joining the data from the two sources, it reads the persisted data from
memory. If you hover your mouse over the memory scan step, you will see that the
only operation it performs on the data in memory is the last step of the query: the
unpersisted select transformation. Compare the diagram for this query with the
first one you executed above, before persisting.
7. The first time you execute a query on a persisted DataFrame, Dataset, or RDD, Spark
has to execute the full query in order to materialize the data that gets saved in
memory or on disk. Compare the difference between the first and second queries
after executing persist by re-executing the query one final time. Then use the
Spark UI to compare both queries executed after the persist operation, and
consider these questions.
• Did one query take longer than the other? If so, which one, and why?
• How many partitions of the RDD were persisted and how much space do those
partitions take up in memory and on disk?
• Note that only a small percentage of the data is cached. Why is that? How could
you cache more of the data?
• Click the RDD name to view the storage details. Which executors are storing data
for this RDD?
9. Execute the same query as above using the write action instead of show.
pyspark> accountsDevsDF.write.mode("overwrite"). \
save("/loudacre/accounts_devices")
scala> accountsDevsDF.write.mode("overwrite").
save("/loudacre/accounts_devices")
• What percentage of the data is cached? Why? How does this compare to the last
time you persisted the data?
• How much memory is the data taking up? How much disk space?
> accountsDevsDF.unpersist()
12. View the Spark UI Storage to verify that the cache for accountsDevsDF has been
removed.
13. Repersist the same DataFrame, setting the storage level to save the data to files on
disk, replicated twice.
15. Reload the Storage tab to confirm that the storage level for the RDD is set correctly.
Also consider these questions:
• How much memory is the data taking up? How much disk space?
2. Examine the data in the dataset. Note that the latitude and longitude are the 4th and
5th fields, respectively, as shown in the sample data below:
2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-
28f2342679af,33.6894754264,-117.543308253
2014-03-15:10:10:20,MeeToo,ef8c7564-0a1a-4650-a655-
c8bbd5f8f943,37.4321088904,-121.485029632
• addPoints: given two points, return a point which is the sum of the two points
—that is, (x1+x2, y1+y2)
• distanceSquared: given two points, returns the squared distance of the two
—this is a common calculation required in graph analysis
Note that the stub code sets the variable K equal to 5—this is the number of
means to calculate.
4. The stub code also sets the variable convergeDist. This will be used to decide
when the k-means calculation is done—when the amount the locations of the
means changes between iterations is less than convergeDist. A “perfect”
solution would be 0; this number represents a “good enough” solution. For this
exercise, use a value of 0.1.
Or in Scala:
7. Iteratively calculate a new set of K means until the total distance between the
means calculated for this iteration and the last is smaller than convergeDist. For
each iteration:
a. For each coordinate point, use the provided closestPoint function to map
that point to the index in the kPoints array of the location closest to that
point. The resulting RDD should be keyed by the index, and the value should be
the pair: (point, 1). (The value 1 will later be used to count the number of
points closest to a given mean.) For example:
b. Reduce the result: for each center in the kPoints array, sum the latitudes and
longitudes, respectively, of all the points closest to that center, and also find the
number of closest points. For example:
(0, ((2638919.87653,-8895032.182481), 74693)))
(1, ((3654635.24961,-12197518.55688), 101268)))
(2, ((1863384.99784,-5839621.052003), 48620)))
(3, ((4887181.82600,-14674125.94873), 126114)))
(4, ((2866039.85637,-9608816.13682), 81162)))
c. The reduced RDD should have (at most) K members. Map each to a new center
point by calculating the average latitude and longitude for each set of closest
points: that is, map (index,(totalX,totalY),n) to (index,(totalX/
n, totalY/n)).
d. Collect these new points into a local map or array keyed by index.
f. Copy the new center points to the kPoints array in preparation for the next
iteration.
8. When all iterations are complete, display the final K center points.
In this exercise, you will write a Spark Streaming application to count Knowledge
Base article requests.
This exercise has two parts. First, you will review the Spark Streaming documentation.
Then you will write and test a Spark Streaming application to read streaming web
server log data and count the number of requests for Knowledge Base articles.
• Follow the links at the top of the package page to view the DStream and
PairDStreamFunctions classes— these will show you the methods available
on a DStream of regular RDDs and pair RDDs respectively.
For Python:
2. You may also wish to view the Spark Streaming Programming Guide (select
Programming Guides > Spark Streaming on the Spark documentation main
page).
3. Stream the Loudacre web log files at a rate of 20 lines per second using the
provided test script.
This script will exit after the client disconnects, so you will need to restart the script
when you restart your Spark application.
Tip: This exercise involves using multiple terminal windows on your remote
desktop. To avoid confusion, set a different title for each one by selecting Set Title…
on the Terminal menu:
dstreams). To complete the exercise, start with the stub code in src/main/
scala/stubs/StreamingLogs.scala, which imports the necessary classes
for the application.
6. Create a DStream by reading the data from the host and port provided as input
parameters.
7. Filter the DStream to only include lines containing the string KBDOC.
8. To confirm that your application is correctly receiving the streaming web log data,
display the first five records in the filtered DStream for each one-second batch. (In
Scala, use the DStream print function; in Python, use pprint.)
9. For each RDD in the filtered DStream, display the number of items—that is, the
number of requests for KB articles.
Tip: Python does not allow calling print within a lambda function, so create a
named defined function to print.
10. Save the filtered logs to text files in HDFS. Use the base directory name
/loudacre/streamlog/kblogs.
11. Finally, start the Streaming context, and then call awaitTermination().
12. Open a new terminal window, and change to the correct directory for the language
you are using for your application.
For Python, change to the exercise directory:
$ cd $DEVSH/exercises/streaming-dstreams
$ cd \
$DEVSH/exercises/streaming-dstreams/streaminglogs_project
13. If you are using Scala, build your application JAR file using Maven.
$ mvn package
Note: If this is your first time compiling a Spark Scala application, it may take
several minutes for Maven to download the required libraries to package the
application.
15. After a few moments, the application will connect to the test script’s simulated
stream of web server log output. Confirm that for every batch of data received
(every second), the application displays the first few Knowledge Base requests and
the count of requests in the batch. Review the HDFS files the application saved in /
loudacre/streamlog.
16. Return to the terminal window in which you started the streamtest.py test
script earlier. Stop the test script by typing Ctrl+C.
17. Return to the terminal window in which your application is running. Stop your
application by typing Ctrl+C. (You may see several error messages resulting from
the interruption of the job in Spark; you may disregard these.)
In this exercise, you will write a Spark Streaming application to count web page
requests over time.
1. Open a new terminal window. This exercise uses multiple terminal windows. To
avoid confusion, set a different title for the new window, such as “Test Stream.”
2. Stream the Loudacre Web log files at a rate of 20 lines per second using the
provided test script.
This script exits after the client disconnects, so you will need to restart the script
when you restart your Spark application.
5. Count the number of page requests over a window of five seconds. Print out the
updated five-second total every two seconds.
6. In a different terminal window than the one in which you started the
streamtest.py script, change to the correct directory for the language you are
using for your application. To avoid confusion, you might wish to set a different title
for the new window such as “Application”.
For Python, change to the exercise directory:
$ cd $DEVSH/exercises/streaming-multi
$ cd \
$DEVSH/exercises/streaming-multi/streaminglogsMB_project
7. If you are using Scala, build your application JAR file using the mvn package
command.
For Python:
9. After a few moments, the application should connect to the test script’s simulated
stream of web server log output. Confirm that for every batch of data received
(every second), the application displays the count of requests in the batch. Review
the files.
10. Return to the terminal window in which you started the streamtest.py test
script earlier. Stop the test script by typing Ctrl+C.
11. Return to the terminal window in which your application is running. Stop your
application by typing Ctrl+C. (You may see several error messages resulting from
the interruption of the job in Spark; you may disregard these.)
Bonus Exercise
Extend the application you wrote above to also count the total number of page requests
by user from the start of the application, and then display the top ten users with the
highest number of requests.
Follow the steps below to implement a solution for this bonus exercise:
1. Use map-reduce to count the number of times each user made a page request in
each batch (a hit-count).
2. Define a function called updateCount that takes an array (in Python) or sequence
(in Scala) of hit-counts and an existing hit-count for a user. The function should
return the sum of the new hit-counts plus the existing count.
• Hint: You will have to swap the key (user ID) with the value (hit-count) to sort.
Note: The solution files for this bonus exercise are in the bonus package in the exercise
Maven project directory (Scala) and in solution-python/bonus in the exercise
directory (Python).
In this exercise, you will write an Apache Spark Streaming application to handle
web logs received as messages on a Kafka topic.
• For Python, start with the stub file StreamingLogsKafka.py in the stubs-
python directory.
1. Your application should accept two input arguments that the user will set when
starting the application:
3. Kafka messages are in (key, value) form, but for this application, the key is null
and only the value is needed. (The value is the web log line.) Map the DStream to
remove the key and use only the value.
4. To verify that the DStream is correctly receiving messages, display the first 10
elements in each batch.
5. For each RDD in the DStream, display the number of items—that is, the number of
requests.
Tip: Python does not allow calling print within a lambda function, so define a
named function to print.
6. Save the filtered logs to text files in HDFS. Use the base directory name
/loudacre/streamlog/kafkalogs.
$ cd \
$DEVSH/exercises/streaming-kafka/streaminglogskafka_project
8. Build your application JAR file using the mvn package command.
10. Use the kafka-topics script to create a Kafka topic called weblogs from which
your application will consume messages.
11. Confirm your topic was created correctly by listing topics. Make sure weblogs is
displayed.
$ $DEVSH/scripts/streamtest-kafka.sh \
weblogs worker-1:9092 20 $DEVDATA/weblogs/*
The script will begin displaying the messages it is sending to the weblogs Kafka
topic. (You may disregard any SLF4J messages.)
14. Change to the correct directory for the language you are using for your application.
$ cd $DEVSH/exercises/streaming-kafka
$ cd \
$DEVSH/exercises/streaming-kafka/streaminglogskafka_project
15. Use spark2-submit to run your application. Your application takes two
parameters: the name of the Kafka topic from which the DStream will read
messages, weblogs, and a comma-separated list of broker hosts and ports.
• For Python:
• For Scala:
16. Confirm that your application is correctly displaying the Kafka messages it receives,
as well as displaying the number of received messages, every second.
Note: It may take a few moments for your application to start receiving messages.
Occasionally you might find that after 30 seconds or so, it is still not receiving any
messages. If that happens, press Ctrl+C to stop the application, then restart it.
Clean Up
17. Stop the Spark application in the first terminal window by pressing Ctrl+C. (You
might see several error messages resulting from the interruption of the job in
Spark; you may disregard these.)
In this exercise, you will use Kafka’s command line tool to create a Kafka topic.
You will also use the command line producer and consumer clients to publish and
read messages.
$ kafka-topics --create \
--zookeeper master-1:2181 \
--replication-factor 1 \
--partitions 2 \
--topic weblogs
2. Display all Kafka topics to confirm that the new topic you just created is listed:
$ kafka-topics --list \
--zookeeper master-1:2181
$ kafka-console-producer \
--broker-list worker-1:9092 \
--topic weblogs
You will see a few SLF4J messages, at which point the producer is ready to accept
messages on the command line.
Tip: This exercise involves using multiple terminal windows. To avoid confusion,
set a different title for each one by selecting Set Title… on the Terminal menu:
5. Publish a test message to the weblogs topic by typing the message text and then
pressing Enter. For example:
6. Open a new terminal window and adjust it to fit on the window beneath the
producer window. Set the title for this window to “Kafka Consumer.”
7. In the new terminal window, start a Kafka consumer that will read from the
beginning of the weblogs topic:
$ kafka-console-consumer \
--zookeeper master-1:2181 \
--topic weblogs \
--from-beginning
After a few SLF4J messages, you should see the status message you sent using the
producer displayed on the consumer’s console, such as:
test weblog entry 1
8. Press Ctrl+C to stop the weblogs consumer, and restart it, but this time omit the
--from-beginning option to this command. You should see that no messages
are displayed.
9. Switch back to the producer window and type another test message into the
terminal, followed by the Enter key:
10. Return to the consumer window and verify that it now displays the alert message
you published from the producer in the previous step.
Cleaning Up
11. Press Ctrl+C in the consumer terminal window to end its process.
12. Press Ctrl+C in the producer terminal window to end its process.
In this exercise, you will run a Flume agent to ingest web log data from a local
directory to HDFS.
Apache web server logs are generally stored in files on the local machines running the
server. In this exercise, you will simulate an Apache server by placing provided web log
files into a local spool directory, and then use Flume to collect the data.
Both the local and HDFS directories must exist before using the spooling directory
source.
Configure Flume
A Flume agent configuration file has been provided for you:
$DEVSH/exercises/flume/spooldir.conf.
Review the configuration file. You do not need to edit this file. Take note in particular of
the following:
• The source is a spooling directory source that pulls from the local
/flume/weblogs_spooldir directory.
4. Start the Flume agent using the configuration you just reviewed:
$ flume-ng agent \
--conf /etc/flume-ng/conf \
--conf-file $DEVSH/exercises/flume/spooldir.conf \
--name agent1 -Dflume.root.logger=INFO,console
5. Wait a few moments for the Flume agent to start up. You will see a message like:
Component type: SOURCE, name: webserver-log-source started
$ $DEVSH/scripts/copy-move-weblogs.sh \
/flume/weblogs_spooldir
This script will create a temporary copy of the web log files and move them to the
spooldir directory.
7. Return to the terminal that is running the Flume agent and watch the logging
output. The output will give information about the files Flume is putting into HDFS.
8. Once the Flume agent has finished, enter Ctrl+C to terminate the process.
9. Using the command line or Hue File Browser, list the files that were added by the
Flume agent in the HDFS directory /loudacre/weblogs_flume.
Note that the files that were imported are tagged with a Unix
timestamp corresponding to the time the file was imported, such as
FlumeData.1427214989392.
In this exercise, you will run a Flume agent that ingests web logs from a local
spool directory and sends each line as a message to a Kafka topic.
The Flume agent is configured to send messages to the weblogs topic you created
earlier.
Important: This exercise depends on two prior exercises: “Collect Web Server Logs
with Flume” and “Produce and Consume Kafka Messages.” If you did not complete both
of these exercises, run the catch-up script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
1. Review the configuration file. You do not need to edit this file. Take note in particular
of the following points:
• The source and channel configurations are identical to the ones in the “Collect
Web Server Logs with Flume” exercise: a spooling directory source that pulls
from the local /flume/weblogs_spooldir directory, and a memory channel.
• Instead of an HDFS sink, this configuration uses a Kafka sink that publishes
messages to the weblogs topic.
3. Wait a few moments for the Flume agent to start up. You will see a message like:
Component type: SINK, name: kafka-sink started
Tip: This exercise involves using multiple terminal windows. To avoid confusion,
set a different title for each window. Set the title of the current window to “Flume
Agent.”
$ kafka-console-consumer \
--zookeeper master-1:2181 \
--topic weblogs
5. In a separate new terminal window, run the script to place the web log files in the /
flume/weblogs_spooldir directory:
$ $DEVSH/scripts/copy-move-weblogs.sh \
/flume/weblogs_spooldir
Note: If you completed an earlier Flume exercise or ran catchup.sh, the script
will prompt you whether you want to clear out the spooldir directory. Be sure to
enter y when prompted.
6. In the terminal that is running the Flume agent, watch the logging output. The
output will give information about the files Flume is ingesting from the source
directory.
7. In the terminal that is running the Kafka consumer, confirm that the consumer tool
is displaying each message (that is, each line of the web log file Flume is ingesting).
8. Once the Flume agent has finished, enter Ctrl+C in both the Flume agent terminal
and the Kafka consumer terminal to end their respective processes.
In this exercise, you will import tables from MySQL into HDFS using Sqoop.
Important: This exercise depends on a previous exercise: “Accessing HDFS with
Command Line and Hue.” If you did not complete that exercise, run the course catch-up
script and advance to the current exercise:
$ $DEVSH/scripts/catchup.sh
2. Run the sqoop help command to familiarize yourself with the options in Sqoop:
$ sqoop help
$ sqoop list-tables \
--connect jdbc:mysql://gateway/loudacre \
--username training --password training
5. Use Sqoop to import the basestations table in the loudacre database and save
it in HDFS under /loudacre:
$ sqoop import \
--connect jdbc:mysql://gateway/loudacre \
--username training --password training \
--table basestations \
--target-dir /loudacre/basestations_import \
--null-non-string '\\N'
6. Optional: While the Sqoop job is running, try viewing it in the Hue Job Browser or
YARN Web UI.
8. Use either the Hue File Browser or the -tail option to the hdfs command to view
the last part of the file for each of the MapReduce partition files, for example:
$ sqoop import \
--connect jdbc:mysql://gateway/loudacre \
--username training --password training \
--table basestations \
--target-dir /loudacre/basestations_import_parquet \
--as-parquetfile
10. View the results of the import command by listing the contents of the
basestations_import_parquet directory in HDFS, using either Hue or the
hdfs command. Note that the Parquet files are each given unique names, such as
e8f3424e-230d-4101-abba-66b521bae8ef.parquet.
11. You can’t directly view the contents of the Parquet files because they are binary
files rather than text. Use the parquet-tools head command to view the first
few records in the set of data files imported by Sqoop.
$ parquet-tools head \
hdfs://master-1/loudacre/basestations_import_parquet/
4. Open a new terminal window. (It must be a new terminal so it reloads your edited
.bashrc file.)
The output should include the setting below. If not, the .bashrc file was not edited
or saved properly.
8. On the right hand side of the page select Python 2 from the New menu.
9. Enter some Spark code such as the following and use the play button to execute
your Spark code.
11. To stop the Spark notebook server, enter Ctrl+C in the terminal running Spark.
1. Use the Cloudera Manager bookmark on your remote desktop browser to view the
Cloudera Manager web UI, and log in using username admin with password admin.
2. If any of the cluster services you need for the exercises are shown with anything
other than a green dot (such as a gray or red dot), restart the service by clicking on
the dropdown menu next to the service name and selecting Restart.
This screenshot shows an example in which the HDFS-1 service is stopped, and
how you would restart it.
3. After restarting the service, you may find that other services that depend on the
restarted service also need restarting, which is indicated by an icon next to the
service name. For example, Hue depends on HDFS, so after restarting HDFS, you
would need to restart Hue following the same steps. The screenshot below shows
the icon indicating that a service restart is required.
• If you have Process Status issues where the Cloudera Manager agent is not
responding (as indicated by Hosts with unhealthy status), try restarting the cluster
services. From the desktop, select Applications > Training > Start Cluster to restart
your cluster services. Give the script time to complete.
• If you have any type of Canary issues, these typically clear up on their own, given
time.
• If any other issues still exist after solving any process status issues:
◦ In Cloudera Manager, note the name of one of the services reporting the issue
(such as HDFS).
If the username is incorrect, reset the password as below if you do not remember it,
then go the following section to create a new user called training.
3. Reset the password for all users to training. The entire command must be
entered on a single line. Be sure to type the value of the password as shown. That is
the secret key to use for the training password.
2. Select the Manage Users item on the user menu in the upper-right hand corner of
the screen.
4. Next, click the Add User button on the right-hand side of the screen.
5. In the Step 1 tab, enter the correct credentials: Username: training and
Password: training. Uncheck the box labeled Create home directory.
6. Skip step 2 by clicking on the Step 3 tab. Check the box labeled Superuser status.
FAILED org.spark-project.jetty.server.Server@69419d59:
java.net.BindException: Address already in use
This is usually because you are attempting to run two instances of the Spark shell at the
same time.
To fix this issue, exit one of your two running Spark shells.
If you do not have a terminal window running a second Spark shell, you may have one
running in the background. View the applications running on the YARN cluster using
the Hue Job Browser. Check the start times to determine which application is the one
you want to keep running. Select the other one and click kill.
After a few seconds, you should be notified that the job status is now RUNNING. If the
ACCEPTED message keeps displaying and the application or query never executes, this
means that YARN has scheduled the job to run when cluster resources are availble, but
none (or too few) resources are available.
Cause: This usually happens if you are running multiple Spark applications (such as
two Spark shells, or a shell and an application) at the same time. It can also mean that a
Spark application has crashed or exited without releasing its cluster resources.
Fix: Stop running one of the Spark applications. If you cannot finding a running
application, use the Hue job browser to see what applications are running on the YARN
cluster, and kill the one you do not need.
2. Download the virtual machine file using the link in the Materials area for this
course in your Cloudera University account.
3. Unzip the VM file into a directory of your choice. On most systems, you do this by
simply double-clicking the file.
4. Run the VM. On most systems, you can do this by double-clicking on the .vmx
file (such as Cloudera-Training-DevSH-Single-Machine-5.13.0-
vmware.vmx) in the unzipped VM directory. You can also launch the VMware
player and load the same file in using the File menu.
5. When the VM has started, you will automatically be logged in as the user
training, and you will see the VM’s desktop.
6. We recommend you copy this exercise manual onto your VM desktop and use
the VM’s PDF viewer (Evince) to view it; some PDF viewers do not allow you to
properly cut and paste multi-line commands.
8. Wait for about five minutes after the script finishes before continuing.
9. Start the Firefox browser using the icon on the VM desktop, then click the Cloudera
Manager bookmark.
11. Verify that all the services in your cluster are healthy (indicated by a green dot), as
shown below. You may disregard yellow configuration warning icons.
Note: Some health warnings may appear as the cluster is starting. They will typically
resolve themselves within five minutes after the Start Cluster script finishes. Please be
patient. If health issues remain, refer to the tip entitled “Cloudera Manager displays
unhealthy status of services“ in Appendix: Troubleshooting Tips.
Your cluster may not be running exactly the services shown. This is okay and will
not interfere with completing the exercises.