Introduction to Databricks a Beginneers Guide
Introduction to Databricks a Beginneers Guide
Introduction to Databricks: A
Beginner’s Guide
Mariusz Kujawski Follow 19 min read · Feb 26, 2024
476 9
In this guide, I’ll walk you through everything you need to know to get
started with Databricks, a powerful platform for data engineering, data
science, and machine learning. From setting up your environment to
understanding key features like data processing and orchestration, this
guide has got you covered. Let’s jump right in!
Databricks Environment
In Databricks environments, we have four major components:
understanding of thedepending
world. on your needs, selecting a version of Databricks runtime,Read offline with the Medium app
memory, CPU for driver and nodes, etc. Monitor your code and job
executions. This is also where you can install the libraries required by
Sign up for free Try for $5/month
your code.
Apache Spark
Apache Spark is an open-source, unified computing engine and a set of
libraries for parallel data processing on computer clusters. Spark supports
multiple programming languages (Python, Java, Scala, and R) and includes
libraries for diverse tasks ranging from SQL to streaming and machine
learning. You can run Spark on your local machine or scale up to thousands
of computers, making it suitable for big data processing.
DataFrame
Lazy Evaluation
Spark’s lazy evaluation comes into play at this point. The logical execution
plan is not immediately executed, and Spark defers the computation until an
action is called.
Transformations
Transformations are the instructions you use to modify the DataFrame in the
way you want and are lazily executed. There are two types of
transformations: narrow and wide.
Actions
Actions are operations that trigger the data processing and return results or
write data to storage. Examples of actions include count, collect, write, show,
etc. When you call an action, Spark evaluates the entire logical execution
plan built through transformations and optimizes the execution plan before
executing it.
Databricks Clusters
To create a cluster in Databricks workspace we need to go to the compute
tab. We can select a few types of clusters with different configurations and
access modes. Each configuration will have its benefits and limitations, I’ll
discuss them briefly below.
Cluster Types
All Purpose Cluster
All Purpose Cluster: These clusters are primarily used for interactive
data analysis using Databricks notebooks. Multiple users can share these
clusters for collaborative interactive analysis. Clusters without any
activity are terminated after the specified time in the “terminate” field of
the cluster configuration.
Access Mode
Cluster access mode is a security feature that determines who can use a
cluster and what data they can access via the cluster. When creating any
cluster in Azure Databricks, you must select an access mode. Considerations
include how you want to use a cluster, supported languages, whether you
need mounts, or Unity Catalog integration, etc.
Shared Mode: Can be used by multiple users with data isolation among
them, requiring a premium plan. Supports Unity Catalog and SQL,
Python, and Scala, but does not support RDD API and DBFS mounts.
When used with credential pass-through, Unity Catalog features are
disabled. Databricks Runtime ML is not supported.
In Databricks, we can access the Spark session using the spark object.
Notice: This method should be used for training purposes or specific use
cases. If you use Unity Catalog (description provided in the next paragraph),
it’s not a recommended way of accessing storage.
Mounting Storage:
storageAccount="account"
tenantID = ""
mountpoint = "/mnt/bronze"
storageEndPoint =f"abfss://bronze@{storageAccount}.dfs.core.windows.net/"
#ClientId, TenantId and Secret is for the Application(ADLSGen2App) was have crea
clientID ="xxx-xx-xxx-xxx"
tenantID ="xx-xx-xx-xxx"
clientSecret ="xxx-xx-xxxxx"
oauth2Endpoint = f"https://login.microsoftonline.com/{tenantID}/oauth2/token"
dbutils.fs.mount(
source = storageEndPoint,
mount_point = mountpoint,
extra_configs = configs)
display(dbutils.fs.ls("dbfs:/mnt/bronze/Customer/"))
clientID ="xxx-xx-xxx-xxx"
tenantID ="xx-xx-xx-xxx"
clientSecret ="xxx-xx-xxxxx"
oauth2Endpoint = f"https://login.microsoftonline.com/{tenantID}/oauth2/token"
spark.conf.set("fs.azure.account.auth.type.cookbookadlsgen2storage.dfs.core.wind
spark.conf.set("fs.azure.account.oauth.provider.type.cookbookadlsgen2storage.dfs
spark.conf.set("fs.azure.account.oauth2.client.id.cookbookadlsgen2storage.dfs.co
spark.conf.set("fs.azure.account.oauth2.client.secret.cookbookadlsgen2storage.df
spark.conf.set("fs.azure.account.oauth2.client.endpoint.cookbookadlsgen2storage.
df_direct = (
spark.read
.format("csv")
.option("header",True)
.load("abfss://[email protected]/Customer")
)
The read method in Spark allows us to import data from other databases
such as Oracle, PostgreSQL, and more.
CSV
df = (
spark.read
.option("delimiter", ",")
.option("header", True)
.csv(pat_to_file)
)
SQL
-- OR
SELECT *
FROM
csv.`/mnt/path/to/file`
While Spark can automatically detect schema for us, there is also an option
to declare it manually.
cust_schema = StructType([
StructField("id", IntegerType()),
StructField("car", StringType()),
])
df = (
spark.read
.option("delimiter", ",")
.option("header", True)
.option("cust_schema")
.csv(pat_to_file)
)
Oracle
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:user/pass@//address:port/instance") \
.option("query", query) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
Saving data
df.write
.format("csv")
.option("header", "true")
.save(path_to_save)
Databricks tables
When we have data loaded in our data frame, we can transform them using
transformations, save the data frame on storage, or create a table in a
schema(database). To save data in a table, we need to create a database,
which we can do using Spark SQL:
Then, we can create and populate a table using PySpark.
df.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(table_name, path=destination)
Spark SQL provides the possibility to create a table using SQL DDL. Creating
tables using SQL could be simple for people who have just switched from
SQL Server, PostgreSQL, or Oracle. As mentioned earlier, tables in
Databricks are only metadata descriptions of files stored in the table
location. Optionally, we can specify the file format and its location; if we
don’t, Databricks will use the default location and format. Using the default
setting will result in the creation of a Spark-managed table.
The main difference between them is that Databricks only manages the
metadata; when you drop a table, you do not remove the underlying data.
With Spark and SQL, we can execute a query against a table existing in
Databricks meta store to retrieve data from them.
df = spark.sql("""
select
*
from
test_db.cars""")
df = spark.read.table("test_db.cars")
%sql
select
*
from
test_db.cars
Delta Tables
Delta Lake is the default storage format for all operations on Databricks.
Unless otherwise specified, all tables on Databricks are Delta tables. Delta
Lake serves as the optimized storage layer that forms the foundation for
storing data and tables in the Databricks lakehouse. It is an open-source
technology that extends Parquet data files with a file-based transaction log
for ACID transactions and scalable metadata handling. Delta Lake
seamlessly integrates with Apache Spark, offering a range of important
features:
After saving a table, you should observe the same structure reflected in
storage.
DataFrame Transformations
Transformation is one of the most important aspects of data processing.
Spark, in this context, is very powerful with its capabilities to integrate data
from different sources and the flexibility to use Python, Scala, and SQL. In
this paragraph, I’ll demonstrate how we can transform a DataFrame in the
most common scenarios.
To start working with a DataFrame, we need to read data from a file, table, or
we can use a list, as demonstrated in the example below to create a
DataFrame.
data = [
[1,"Alicia Kaiser","WA","001-677-774-4370","13714","Stacy Summit","54799","Lake
[2,"Donna Ellis","BR","001-631-995-8008","43599","Adam Trail","07204","Port Apri
[3,"Kenneth Smith","WA","001-592-849-6009x4173","649","Sherri Grove","14527","No
[4,"Danny Clark","WA","574.419.0221x348","285","Timothy Drive","41106","West Eri
[5,"Nicholas Thompson","CA","(259)268-8760x061","998","Russell Shoals","65647","
[6,"Frances Griffith","WA","7535316823","9559","Emily Branch","71422","Mcdanielh
[7,"Trevor Harrington","CA","742.224.9375","5960","Lisa Port","73881","Loganbury
[8,"Seth Mitchell","AA","(386)517-7589x04440","47352","Stafford Loop","01347","S
[9,"Patrick Caldwell","BR","001-307-225-9094","0170","Amanda Dam","24885","Port
[10,"Laura Hopkins","CA","9095819755","143","Lee Brook","23623","Jarvisland","Ha
]
schama = "client_number int,name string,branch string,phone_number string,buldin
df = spark.createDataFrame(data, schama)
Selecting columns
display(df.select("client_number","name", "amout"))
Spark allows us to use SQL to transform data. To utilize a DataFrame created
or imported within a SQL context, we need to create a temporary view:
df.createTempView("client")
There are various options to access columns in a DataFrame. Below, you can
find a few examples:
age_exp = "extract( year from current_date) - extract( year from birth_date) "
display(df1)
In this example, we utilize the withColumn function to calculate age, add load
date columns, and change the amount to decimal data type. Additionally, we
can drop columns with unnecessary data. The same result can be achieved
using SQL.
df.createOrReplaceTempView("client")
df1 = spark.sql("""
select
client_number,
name,
bulding_number,
street_name,
postcode,
city,
state,
cast(amout as decimal(10,2)) amount,
extract( year from current_date) - extract( year from birth_date
current_date() load_date
from
client
""")
display(df1)
Order by, on the other hand, sorts records in a DataFrame based on specified
sort conditions, allowing for the arrangement of data in ascending or
descending order.
# count by brnahc
display(df1.groupBy("branch").count().orderBy(desc("count")))
df.createOrReplaceTempView("client")
df1 = spark.sql("""
select
branch,
sum(amount) amount,
count(client_number) client_number
from
client
group by branch
""")
df1.show()
+------+------+-------------+
|branch| amout|client_number|
+------+------+-------------+
| WA|8000.0| 4|
| BR|3000.0| 2|
| CA|5100.0| 3|
| AA|3000.0| 1|
+------+------+-------------+
Alternatively, in PySpark, you can use the show command to display the
results of these operations directly in the notebook or console.
Joining DataFrames
Joining operations are crucial for various data processing tasks such as data
normalization, data modeling, and ensuring data quality. Spark supports
joins using DataFrame join and SQL joins.
trans = [
[str(uuid.uuid4()), 1, 100.00, '2024-02-01'],
[str(uuid.uuid4()), 1, 200.00, '2024-02-03'],
[str(uuid.uuid4()), 1, 130.00, '2024-02-04'],
[str(uuid.uuid4()), 2, 110.00, '2024-02-05'],
[str(uuid.uuid4()), 3, 200.00, '2024-02-01'],
[str(uuid.uuid4()), 2, 300.00, '2024-02-02'],
[str(uuid.uuid4()), 2, 50.00, '2024-02-03'],
The code below illustrates how to join two DataFrames. Spark supports
various join types, including:
Inner Join
Cross Join
- Left Anti join: Retrieves records from the left DataFrame that do not exist in
the right DataFrame.
Left Semi-join: Retrieves records and columns from the left DataFrame
where records match records in the right DataFrame.
Naturally, Spark SQL allows us to use SQL-like joins in our code. We can join
existing tables or create views and join these views using SQL syntax.
df.createOrReplaceTempView("clients")
df_tran.createOrReplaceTempView("tran")
display(spark.sql("""
select *
from
clients a
inner join
tran b on a.client_number = b.client_id
"""))
Union DataFrames
Spark facilitates union operations on DataFrames in several ways. We can
union DataFrames using column names, as shown in example 1.
Alternatively, we can union DataFrames without checking column names, as
demonstrated in example 2. Additionally, Spark allows for merging two
DataFrames while allowing missing columns.
# example 1
display(df1.unionByName(df2))
# example 2
display(df1.union(df2))
# example 3
display(df1.unionByName(df2, allowMissingColumns=True))
When function
A useful function in PySpark is when , which is employed in cases where we
need to translate or map a value to another value based on specified
conditions.
from pyspark.sql.functions import expr, col, column, lit, exp, current_date, whe
columns = ["name","surname","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
Data live tables are a great extension for Auto Loader to implement ETL
(Extract, Transform, Load) processes. Auto Loader automatically stores
information about processed files, eliminating the need for additional
maintenance steps. In case of failure, it will resume processing from the last
successful step.
For instructions on how to configure Auto Loader, you can refer to another
post.
%run fun
print(add(1,2))
.
├── utils
│ ├── __init__.py
│ └── fun.py
└── test_notebook
# fun.py
def add(a,b):
return a+b
print(add(1,2))
You can install custom .whl files onto a cluster and then import them into a
notebook. For code that is frequently updated, this process might be
inconvenient and error-prone.
User-defined functions
If we can’t find a function for our use case, it’s possible to create our custom
function. However, to use it with DataFrames or SQL, we need to register it.
def squared(s):
return s * s
spark.udf.register("squaredWithPython", squared_typed, LongType())
df1.createOrReplaceTempView("df1")
SQL
%sql
select id, squaredWithPython(col1) as id_squared from df1
PySpark
@udf("long")
def squared_udf(s):
return s * s
df1.createOrReplaceTempView("df1")
display(df.select("col1", squared_udf("col1").alias("col1_squared")))
Args:
param0:
param1:
Returns:
Describe the return value
For example, let’s create a function that returns a date based on the number
of days from 1900–01–01. To ensure clarity and correctness, we can write the
function with a docstring and typing as shown above.
Args:
number_of_days: numbers of days
Returns:
date: date in string format.
"""
date_ = datetime.date(1900, 1, 1)
Data Discovery: Unity Catalog lets you tag and document data assets and
provides a search interface to help data consumers find data.
Data Lineage: Data lineage supports use cases such as tracking and
monitoring jobs, debugging failures, understanding complex workflows,
and tracing transformation rules. It presents the flow of data in user-
friendly diagrams.
Data Sharing: Unity Catalog gives businesses more control over how, why,
and what data is being shared with whom.
Orchestration
For process orchestration, several tools are available, including Azure Data Top highlight
To test it, you can click the “Debug” button and check the execution
progress.
Databricks Workflows
Databricks Workspace provides “Workflows” functionality supporting job
orchestration and scheduling. If you prefer to work exclusively within the
Databricks environment or with cloud providers such as AWS or GCP, this
option is suitable. Workflows help create tasks and orchestrate steps in data
processing processes. For detailed configuration instructions, refer to the
documentation here.
Summary
In this post, I have compiled the most important information required to
start working with Databricks. While I have covered key aspects such as
environment setup, data processing, orchestration, and more, it’s important
to note that Databricks supports various powerful features beyond the scope
of this post. These include data live tables, MLflow, streaming processing,
maintenance, tuning, and monitoring.
If you found this article insightful, please click the ‘clap’ button and follow me on
Medium and LinkedIn. For any questions or advice, feel free to reach out to me on
LinkedIn.
476 9
Responses (9)
Write a response
Jouneid Raza
Feb 16
2 Reply
Jakub Grafik
Jul 1, 2024
3 Reply
Kdanielug
Feb 13
1 Reply
May 26 73 4d ago 68 2
6d ago 16 2 Jun 13 89
Databricks Just Changed the Game 💡Why I Liked Delta Live Tables in
with LakeFlow — But Is It Too Sma… Databricks
It’s not often that a new tool makes me stop Delta Live Tables (DLT) is a declarative
mid-scroll and say, “Wait… this changes… framework designed to simplify data…
Help Status About Careers Press Blog Privacy Rules Terms Text to speech