0% found this document useful (0 votes)
21 views20 pages

Introduction to Databricks a Beginneers Guide

This document serves as a beginner's guide to Databricks, detailing its environment components such as workspaces, notebooks, clusters, and the metastore, as well as the use of Apache Spark for data processing. It covers key concepts like DataFrames, lazy evaluation, transformations, and actions, along with practical examples of reading and writing data from various sources. Additionally, it discusses the creation and management of clusters, data processing basics, and the features of Delta Lake for optimized data storage and operations.

Uploaded by

animillasekhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views20 pages

Introduction to Databricks a Beginneers Guide

This document serves as a beginner's guide to Databricks, detailing its environment components such as workspaces, notebooks, clusters, and the metastore, as well as the use of Apache Spark for data processing. It covers key concepts like DataFrames, lazy evaluation, transformations, and actions, along with practical examples of reading and writing data from various sources. Additionally, it discusses the creation and management of clusters, data processing basics, and the features of Delta Lake for optimized data storage and operations.

Uploaded by

animillasekhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Search Write Sign up Sign in

Introduction to Databricks: A
Beginner’s Guide
Mariusz Kujawski Follow 19 min read · Feb 26, 2024

476 9

In this guide, I’ll walk you through everything you need to know to get
started with Databricks, a powerful platform for data engineering, data
science, and machine learning. From setting up your environment to
understanding key features like data processing and orchestration, this
guide has got you covered. Let’s jump right in!

Databricks Environment
In Databricks environments, we have four major components:

Workspace: A Databricks deployment in the cloud that functions as an


environment for your Databricks assets. Here, you can create notebooks,
jobs, clusters, and manage access. You can create multiple workspaces
for your teams. Free Membership

Notebooks: Web-based interfaces where


Distraction-free you
reading. can write code in Python,
No ads. Read member-only stories
Scala, or R, execute it, check its results, and visualize it.
Organize your knowledge with lists and Support writers you read most
highlights.
Cluster: Your compute resources and configuration on which youEarn
can runfor your writing
money
Sign up to discover human Tell your story. Find your audience.
your notebooks and jobs. You can create different types of clusters
stories that deepen your Listen to audio narrations

understanding of thedepending
world. on your needs, selecting a version of Databricks runtime,Read offline with the Medium app
memory, CPU for driver and nodes, etc. Monitor your code and job
executions. This is also where you can install the libraries required by
Sign up for free Try for $5/month
your code.

Metastore: A place where you can find metadata information about


tables that were created. You can find all databases and tables here. Every
Databricks deployment has a central Hive metastore accessible by all
clusters for table metadata.

Apache Spark
Apache Spark is an open-source, unified computing engine and a set of
libraries for parallel data processing on computer clusters. Spark supports
multiple programming languages (Python, Java, Scala, and R) and includes
libraries for diverse tasks ranging from SQL to streaming and machine
learning. You can run Spark on your local machine or scale up to thousands
of computers, making it suitable for big data processing.

Below is an example of Spark Application architecture. It’s crucial to notice


that operations on data are split between node workers and executors to
speed up our data processing. In a Spark cluster, each node is a separate
machine. For test purposes or small datasets, we can use the single-node
option with one machine.
Spark Application Architecture

DataFrame

A DataFrame is a distributed collection of data organized into named


columns. It is conceptually equivalent to a table in a relational database or a
data frame in R/Python but with richer optimizations under the hood.
DataFrames can be constructed from a wide array of sources such as
structured data files, tables in Hive, external databases, or existing RDDs.

Lazy Evaluation

Spark’s lazy evaluation comes into play at this point. The logical execution
plan is not immediately executed, and Spark defers the computation until an
action is called.

Transformations

Transformations are the instructions you use to modify the DataFrame in the
way you want and are lazily executed. There are two types of
transformations: narrow and wide.

Narrow Transformations: These are transformations for which each


input partition will contribute to only one output partition. Examples
include map, filter, and select.

Wide Transformations: These transformations will have input partitions


contributing to many output partitions. Spark will perform a data shuffle
operation, redistributing partitions across the cluster. Examples include
group by and join.

Actions

Actions are operations that trigger the data processing and return results or
write data to storage. Examples of actions include count, collect, write, show,
etc. When you call an action, Spark evaluates the entire logical execution
plan built through transformations and optimizes the execution plan before
executing it.

Databricks Clusters
To create a cluster in Databricks workspace we need to go to the compute
tab. We can select a few types of clusters with different configurations and
access modes. Each configuration will have its benefits and limitations, I’ll
discuss them briefly below.

Cluster Types
All Purpose Cluster

All Purpose Cluster: These clusters are primarily used for interactive
data analysis using Databricks notebooks. Multiple users can share these
clusters for collaborative interactive analysis. Clusters without any
activity are terminated after the specified time in the “terminate” field of
the cluster configuration.

Job Cluster: Job Clusters in Databricks are transient clusters specifically


created for running jobs. Running non-interactive workloads is preferred
for cost efficiency, as the compute is only used for the duration of the job,
after which it will be terminated.
SQL Warehouse: This type of cluster is for executing SQL queries against
tables. Access is provided through a SQL editor, and there are predefined
clusters available for configuration.

Access Mode

Cluster access mode is a security feature that determines who can use a
cluster and what data they can access via the cluster. When creating any
cluster in Azure Databricks, you must select an access mode. Considerations
include how you want to use a cluster, supported languages, whether you
need mounts, or Unity Catalog integration, etc.

Single-user Mode: Supports Unity Catalog and is assigned and used by


only a single user. Supports Scala, Python, SQL, and R, as well as RDD
API, DBFS mounts, init scripts, and external libraries. Databricks
Runtime ML is also supported.

Shared Mode: Can be used by multiple users with data isolation among
them, requiring a premium plan. Supports Unity Catalog and SQL,
Python, and Scala, but does not support RDD API and DBFS mounts.
When used with credential pass-through, Unity Catalog features are
disabled. Databricks Runtime ML is not supported.

No Isolation Shared Mode: This can be used by multiple users with no


data isolation among them. Does not support Unity Catalog but can be
used with SQL, Scala, Python, and R. Supports RDD API, DBFS mounts,
init scripts, libraries, and Databricks Runtime ML.

Basics of Data Processing


Databricks allows us to use Scala, Python, and Spark SQL. In this post, I’ll
focus on Python and Spark SQL. PySpark is the Python API for Apache
Spark, enabling real-time and large-scale data processing in a distributed
environment. Spark SQL is Apache Spark’s module for working with
structured data, seamlessly integrating SQL queries with Spark programs.
With PySpark DataFrames, you can efficiently read, write, transform, and
analyze data using Python and SQL. Whether you choose Python or SQL, the
same underlying execution engine is used, ensuring that you can always
leverage the full power of Spark.

Basic DataFrame Using Python

In Databricks, we can access the Spark session using the spark object.

data = [[1, "VW"],


[2, "BMW"]]

columns = ["ID", "Car"]

dataframe = spark.createDataFrame(data, columns)

# show data frame


display(dataframe)

Reading and Writing from and to Azure Data Lake Gen2


To start working with Databricks, we need to configure external storage. In
my case, I’ll be using Azure Account Storage. This storage will be used for
reading and writing data. While we can create tables in Databricks, it’s
important to note that we still operate on files, making storage a crucial
aspect of our setup. Databricks supports several options for integrating with
storage, including mounting account storage in the Databricks file system
dbfs or by using a URL. To connect to account storage, we need to create a
service principal and assign it to the “Blob Storage Contributor Role.”
Alternatively, we can use the access key. It’s considered a best practice to
store sensitive information in a secret scope.

Notice: This method should be used for training purposes or specific use
cases. If you use Unity Catalog (description provided in the next paragraph),
it’s not a recommended way of accessing storage.

Mounting Storage:

storageAccount="account"
tenantID = ""
mountpoint = "/mnt/bronze"
storageEndPoint =f"abfss://bronze@{storageAccount}.dfs.core.windows.net/"

#ClientId, TenantId and Secret is for the Application(ADLSGen2App) was have crea
clientID ="xxx-xx-xxx-xxx"
tenantID ="xx-xx-xx-xxx"
clientSecret ="xxx-xx-xxxxx"
oauth2Endpoint = f"https://login.microsoftonline.com/{tenantID}/oauth2/token"

configs = {"fs.azure.account.auth.type": "OAuth",


"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebf
"fs.azure.account.oauth2.client.id": clientID,
"fs.azure.account.oauth2.client.secret": clientSecret,
"fs.azure.account.oauth2.client.endpoint": oauth2Endpoint}

dbutils.fs.mount(
source = storageEndPoint,
mount_point = mountpoint,
extra_configs = configs)

To test the new month we can list objects from a directory:

display(dbutils.fs.ls("dbfs:/mnt/bronze/Customer/"))

Accessing files using a URI

clientID ="xxx-xx-xxx-xxx"
tenantID ="xx-xx-xx-xxx"
clientSecret ="xxx-xx-xxxxx"
oauth2Endpoint = f"https://login.microsoftonline.com/{tenantID}/oauth2/token"

spark.conf.set("fs.azure.account.auth.type.cookbookadlsgen2storage.dfs.core.wind
spark.conf.set("fs.azure.account.oauth.provider.type.cookbookadlsgen2storage.dfs
spark.conf.set("fs.azure.account.oauth2.client.id.cookbookadlsgen2storage.dfs.co
spark.conf.set("fs.azure.account.oauth2.client.secret.cookbookadlsgen2storage.df
spark.conf.set("fs.azure.account.oauth2.client.endpoint.cookbookadlsgen2storage.

df_direct = (
spark.read
.format("csv")
.option("header",True)
.load("abfss://[email protected]/Customer")
)

Reading and Saving Data from Various Sources


Spark enables us to effortlessly read and save data from various sources,
including CSV files, Parquet, Avro, JSON, and more.

The read method in Spark allows us to import data from other databases
such as Oracle, PostgreSQL, and more.

CSV
df = (
spark.read
.option("delimiter", ",")
.option("header", True)
.csv(pat_to_file)
)

SQL

SELECT * FROM read_files(


'/mnt/bronze/orders/orders.csv',
format => 'csv',
header => true,
mode => 'FAILFAST')

-- OR

SELECT *
FROM
csv.`/mnt/path/to/file`

While Spark can automatically detect schema for us, there is also an option
to declare it manually.

from pyspark.sql.types import *

cust_schema = StructType([
StructField("id", IntegerType()),
StructField("car", StringType()),
])

df = (
spark.read
.option("delimiter", ",")
.option("header", True)
.option("cust_schema")
.csv(pat_to_file)
)

Oracle

query = "select 1 a from dueal"

df = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:user/pass@//address:port/instance") \
.option("query", query) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()

Saving data

We can save files directly to storage using the following code:

df.write
.format("csv")
.option("header", "true")
.save(path_to_save)

Databricks tables

When we have data loaded in our data frame, we can transform them using
transformations, save the data frame on storage, or create a table in a
schema(database). To save data in a table, we need to create a database,
which we can do using Spark SQL:
Then, we can create and populate a table using PySpark.

df.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(table_name, path=destination)

As a result of execution, we should see a table in the Workspace in the


Catalog tab.

Spark SQL provides the possibility to create a table using SQL DDL. Creating
tables using SQL could be simple for people who have just switched from
SQL Server, PostgreSQL, or Oracle. As mentioned earlier, tables in
Databricks are only metadata descriptions of files stored in the table
location. Optionally, we can specify the file format and its location; if we
don’t, Databricks will use the default location and format. Using the default
setting will result in the creation of a Spark-managed table.

-- Creates a Delta table

CREATE TABLE test_db.cars(


id int,
model STRING,
year int
)

-- Use data from another table


CREATE TABLE test_db.cars
AS
SELECT * from load_cars

-- Use data from another table using its path


CREATE TABLE IF NOT EXISTS test_db.cars
AS SELECT *
FROM
delta.`raw/cars`;

If we specify a location, it will result in the creation of a Spark unmanaged


table(External Table).

CREATE TABLE test_db.cars(


id int,
model STRING,
year int
)
USING DELTA
LOCATION '/mnt/silver/cars';

The main difference between them is that Databricks only manages the
metadata; when you drop a table, you do not remove the underlying data.

Read Databricks tables

With Spark and SQL, we can execute a query against a table existing in
Databricks meta store to retrieve data from them.

df = spark.sql("""
select
*
from
test_db.cars""")

df = spark.read.table("test_db.cars")

Databricks notebooks allow us to query tables using Spark SQL directly. To


use SQL, we need to switch the language to SQL or use the magic command
%sql . The result will be displayed under the cell with the SQL command.

%sql

select
*
from
test_db.cars

Delta Tables

Delta Lake is the default storage format for all operations on Databricks.
Unless otherwise specified, all tables on Databricks are Delta tables. Delta
Lake serves as the optimized storage layer that forms the foundation for
storing data and tables in the Databricks lakehouse. It is an open-source
technology that extends Parquet data files with a file-based transaction log
for ACID transactions and scalable metadata handling. Delta Lake
seamlessly integrates with Apache Spark, offering a range of important
features:

Additional Operations: Delta Lake provides additional operations such as


inserts, updates, and deletes. You can update data in a target using the
SQL Merge operation, improving load strategies for a lakehouse and
supporting operations like Slowly Changing Dimensions (SCD).

ACID Transactions: Delta Lake ensures ACID (Atomicity, Consistency,


Isolation, Durability) transactions, guaranteeing data integrity and
reliability.

Time Travel: Delta Lake offers time travel capabilities by providing a


transaction log, enabling users to access and revert to earlier versions of
data, facilitating data auditing and versioning.

Schema Evolution: Delta Lake supports schema evolution, allowing


changes to a table schema that can be applied automatically without the
need for recreating a file, ensuring flexibility and ease of maintenance.

Optimization: Delta Lake provides optimization features such as


OPTIMIZE and ZORDER BY to reduce the number of files and physically
organize data within a file, improving query performance and efficiency.

Cache in Databricks: Delta Lake allows caching files on worker nodes


and caching the result of queries in memory, reducing the need for
reading from files and enhancing query performance.

After saving a table, you should observe the same structure reflected in
storage.

DataFrame Transformations
Transformation is one of the most important aspects of data processing.
Spark, in this context, is very powerful with its capabilities to integrate data
from different sources and the flexibility to use Python, Scala, and SQL. In
this paragraph, I’ll demonstrate how we can transform a DataFrame in the
most common scenarios.

To start working with a DataFrame, we need to read data from a file, table, or
we can use a list, as demonstrated in the example below to create a
DataFrame.

data = [
[1,"Alicia Kaiser","WA","001-677-774-4370","13714","Stacy Summit","54799","Lake
[2,"Donna Ellis","BR","001-631-995-8008","43599","Adam Trail","07204","Port Apri
[3,"Kenneth Smith","WA","001-592-849-6009x4173","649","Sherri Grove","14527","No
[4,"Danny Clark","WA","574.419.0221x348","285","Timothy Drive","41106","West Eri
[5,"Nicholas Thompson","CA","(259)268-8760x061","998","Russell Shoals","65647","
[6,"Frances Griffith","WA","7535316823","9559","Emily Branch","71422","Mcdanielh
[7,"Trevor Harrington","CA","742.224.9375","5960","Lisa Port","73881","Loganbury
[8,"Seth Mitchell","AA","(386)517-7589x04440","47352","Stafford Loop","01347","S
[9,"Patrick Caldwell","BR","001-307-225-9094","0170","Amanda Dam","24885","Port
[10,"Laura Hopkins","CA","9095819755","143","Lee Brook","23623","Jarvisland","Ha
]
schama = "client_number int,name string,branch string,phone_number string,buldin
df = spark.createDataFrame(data, schama)

The display command will present the content of the DataFrame.

Selecting columns

display(df.select("client_number","name", "amout"))
Spark allows us to use SQL to transform data. To utilize a DataFrame created
or imported within a SQL context, we need to create a temporary view:

df.createTempView("client")

There are various options to access columns in a DataFrame. Below, you can
find a few examples:

from pyspark.sql.functions import expr, col, column


display(df.select(
expr("client_number as client_num"),
col("client_number").alias("client_num"),
column("client_number"))\
.limit(2))

During data transformation, common scenarios include changing data types,


renaming columns, adding new columns, and deriving new columns based
on values from others. Let’s explore how we can achieve these tasks.

from pyspark.sql.functions import expr, col, column, lit, exp, current_date

age_exp = "extract( year from current_date) - extract( year from birth_date) "

df1 = df.withColumn("amout", col("amout").cast("decimal(10,2)")) \


.withColumn("birth_date", col("birth_date").cast("date")) \
.withColumn("age", expr(age_exp)) \
.withColumn("load_date", lit(current_date()))

df1 = df1.drop("credit_card_number", "birth_date")

display(df1)

In this example, we utilize the withColumn function to calculate age, add load
date columns, and change the amount to decimal data type. Additionally, we
can drop columns with unnecessary data. The same result can be achieved
using SQL.
df.createOrReplaceTempView("client")

df1 = spark.sql("""
select
client_number,
name,
bulding_number,
street_name,
postcode,
city,
state,
cast(amout as decimal(10,2)) amount,
extract( year from current_date) - extract( year from birth_date
current_date() load_date
from
client
""")
display(df1)

Filtering Rows in a DataFrame

Filtering rows in a DataFrame involves creating a condition that separates


the data you want to keep from the data you don’t. This condition can be
written as a simple expression or built from multiple comparisons.
DataFrames offer two methods, where and filter , to achieve this filtering
based on your chosen condition.

display(df1.where("age >= 85"))


display(df1.where(col("age") >= 85))

It’s possible to build more complex expressions in DataFrame filtering using


AND or OR conditions, allowing for greater flexibility in specifying
conditions for row selection.

display(df1.where("age >= 85 or age <=30"))


display(df1.filter( (col("age") >= 85) | (col("age") <= 30)))

display(df1.where("age >=30 and branch != 'WA' "))


display(df1.filter( (col("age") >= 30) & (col("branch") != 'WA')))

Grouping and Sorting

Group by is a transformation operation in PySpark used to group the data in


a Spark DataFrame based on specified columns. This operation is often
followed by aggregating functions such as count(), sum(), avg(), etc.,
allowing for the summarization of grouped data.

Order by, on the other hand, sorts records in a DataFrame based on specified
sort conditions, allowing for the arrangement of data in ascending or
descending order.

from pyspark.sql.functions import desc, count, sum

# count by brnahc
display(df1.groupBy("branch").count().orderBy(desc("count")))

# sum, count by branch with order by client_number count


display(df1.groupBy("branch").agg(
sum("amout").alias("amout"), count("client_number").alias("qt")
).orderBy(desc("qt")))
This SQL version mirrors the explanation provided for PySpark:

df.createOrReplaceTempView("client")
df1 = spark.sql("""
select
branch,
sum(amount) amount,
count(client_number) client_number

from
client
group by branch
""")

df1.show()

+------+------+-------------+
|branch| amout|client_number|
+------+------+-------------+
| WA|8000.0| 4|
| BR|3000.0| 2|
| CA|5100.0| 3|
| AA|3000.0| 1|
+------+------+-------------+

Alternatively, in PySpark, you can use the show command to display the
results of these operations directly in the notebook or console.

Joining DataFrames
Joining operations are crucial for various data processing tasks such as data
normalization, data modeling, and ensuring data quality. Spark supports
joins using DataFrame join and SQL joins.

To demonstrate the join operation, we need an additional DataFrame. I’ll


create it similarly so you can easily replicate my steps, or you can load data
from a file or table.

trans = [
[str(uuid.uuid4()), 1, 100.00, '2024-02-01'],
[str(uuid.uuid4()), 1, 200.00, '2024-02-03'],
[str(uuid.uuid4()), 1, 130.00, '2024-02-04'],
[str(uuid.uuid4()), 2, 110.00, '2024-02-05'],
[str(uuid.uuid4()), 3, 200.00, '2024-02-01'],
[str(uuid.uuid4()), 2, 300.00, '2024-02-02'],
[str(uuid.uuid4()), 2, 50.00, '2024-02-03'],

schama = "id string, client_id int, value double, tran_date string"


df_tran = spark.createDataFrame(trans, schama)
display(df_tran)

The code below illustrates how to join two DataFrames. Spark supports
various join types, including:

Inner Join

Left / Left Outer Join

Right / Right Outer Join


Outer / Full Join

Cross Join

Left Anti Join

Left Semi Join

# column name index style


display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'inner'))

# column name property style


display(df.join(df_tran, df.client_number == df_tran.client_id, 'inner'))

These joins function similarly to their counterparts in SQL. Additionally,


Spark provides a few new types of joins that you may not have seen in SQL:

- Left Anti join: Retrieves records from the left DataFrame that do not exist in
the right DataFrame.

display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'left_anti

Left Semi-join: Retrieves records and columns from the left DataFrame
where records match records in the right DataFrame.

display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'leftsemi'

Naturally, Spark SQL allows us to use SQL-like joins in our code. We can join
existing tables or create views and join these views using SQL syntax.

df.createOrReplaceTempView("clients")
df_tran.createOrReplaceTempView("tran")

display(spark.sql("""
select *
from
clients a
inner join
tran b on a.client_number = b.client_id
"""))

Union DataFrames
Spark facilitates union operations on DataFrames in several ways. We can
union DataFrames using column names, as shown in example 1.
Alternatively, we can union DataFrames without checking column names, as
demonstrated in example 2. Additionally, Spark allows for merging two
DataFrames while allowing missing columns.

df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])

df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])

# example 1
display(df1.unionByName(df2))

# example 2
display(df1.union(df2))

df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])

# example 3
display(df1.unionByName(df2, allowMissingColumns=True))

When function
A useful function in PySpark is when , which is employed in cases where we
need to translate or map a value to another value based on specified
conditions.

from pyspark.sql.functions import expr, col, column, lit, exp, current_date, whe

data = [("Robert", "Smith","M",40000),("Ana","Novak","M",60000),


("Carl","Xyz",None,500000),("Maria","Free","F",500000),
("Johm","Az","",None), ("Steve","Smith","",1000)]

columns = ["name","surname","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)

df2 = df.withColumn("new_gender", when(df.gender == "M","Male")


.when(df.gender == "F","Female")
.when(df.gender.isNull() ,"")
.otherwise(df.gender))

Databricks Auto loader


Auto Loader is a feature that enables the incremental and efficient
processing of new files as they arrive in cloud storage. It can load data files
from Azure Account Storage, AWS S3, and Google GCS.

Data live tables are a great extension for Auto Loader to implement ETL
(Extract, Transform, Load) processes. Auto Loader automatically stores
information about processed files, eliminating the need for additional
maintenance steps. In case of failure, it will resume processing from the last
successful step.
For instructions on how to configure Auto Loader, you can refer to another
post.

Azure Databricks Auto Loader


Databricks Auto Loader is a feature that allows us to quickly ingest
data from Azure Storage Account, AWS S3, or GCP…
medium.com

Python Functions and Modules


In Databricks, we can organize our code using either notebooks or Python
modules. When using notebooks, the process is straightforward. We create a
notebook, for example, named “Utils,” where we define all the functions that
we commonly use in our solution. Then, we can call the %run command to
include the defined functions in other notebooks.

%run fun

print(add(1,2))

Alternatively, we can use a module by creating a folder named “utils” with


the following structure:

.
├── utils
│ ├── __init__.py
│ └── fun.py
└── test_notebook

# fun.py
def add(a,b):
return a+b

We can import this module using well-known Python syntax:

from utils.fun import add

print(add(1,2))

You can install custom .whl files onto a cluster and then import them into a
notebook. For code that is frequently updated, this process might be
inconvenient and error-prone.

User-defined functions
If we can’t find a function for our use case, it’s possible to create our custom
function. However, to use it with DataFrames or SQL, we need to register it.

from pyspark.sql.types import LongType

def squared(s):
return s * s
spark.udf.register("squaredWithPython", squared_typed, LongType())
df1.createOrReplaceTempView("df1")

SQL

%sql
select id, squaredWithPython(col1) as id_squared from df1

PySpark

from pyspark.sql.functions import udf

@udf("long")
def squared_udf(s):
return s * s

df1.createOrReplaceTempView("df1")
display(df.select("col1", squared_udf("col1").alias("col1_squared")))

Docstring and Typing


Using docstrings and typing in Python is crucial for well-documented code.
Docstrings and typing provide the ability to document our classes, functions,
etc., improving the readability and usability of the code. Information
provided in docstrings can be utilized by code intelligence tools, the help()

function, or accessed via the __doc__ attribute.

def exmplae(param0: type, param1: type)-> type:


"""
Does what

Args:
param0:
param1:

Returns:
Describe the return value

For example, let’s create a function that returns a date based on the number
of days from 1900–01–01. To ensure clarity and correctness, we can write the
function with a docstring and typing as shown above.

def to_date(number_of_days: int) -> str:


"""
Converts days from 1900-01-01 to date.

Args:
number_of_days: numbers of days
Returns:
date: date in string format.

"""
date_ = datetime.date(1900, 1, 1)

end_date = date_ + datetime.timedelta(days=number_of_days)


return end_date.strftime("%Y-%m-%d")

To access the documentation of this function use the help() command or


IntelliSense:
Unity Catalog
Databricks Unity Catalog is a unified governance solution that centralizes
access control, auditing, lineage, and data discovery capabilities across
Databricks workspaces. Key features of Unity Catalog include security, data
discovery, lineage, and data sharing.

Security: Unity Catalog offers a single place to administer data access


policies that apply across all workspaces.

Data Discovery: Unity Catalog lets you tag and document data assets and
provides a search interface to help data consumers find data.

Data Lineage: Data lineage supports use cases such as tracking and
monitoring jobs, debugging failures, understanding complex workflows,
and tracing transformation rules. It presents the flow of data in user-
friendly diagrams.

Data Sharing: Unity Catalog gives businesses more control over how, why,
and what data is being shared with whom.

The Unity Catalog Object Model


The Unity Catalog object model uses a tree-level namespace to address
various types of data assets in the catalog. The top-level meta store acts as a
global container for assets such as catalogs, followed by schemas for all your
data entities like tables, views, models, and functions.

Interesting Functions in Preview


In the current Public Preview, several noteworthy features are worth
mentioning:

Volumes: Provide access to non-tabular data stored in cloud object


storage.

Lakehouse Federation: Provides direct access to relational database


engines such as MySQL, PostgreSQL, Amazon Redshift, Snowflake,
Microsoft SQL Server, Azure Synapse (SQL Data Warehouse), and Google
BigQuery via Databricks.

Declaration of Primary Key and Foreign Key Relationships.

Orchestration
For process orchestration, several tools are available, including Azure Data Top highlight

Factory, Synapse Pipelines, Databricks Workflows, and Airflow. If you are


working with Azure services, it is recommended to use Azure Data Factory
or Synapse

Azure Synapse and Data Factory


Azure Data Factory or Synapse pipelines allow you to incorporate Notebook
activities, configure parameters, and establish dependencies between other
activities. Below, you can see an example of how to configure a pipeline. It
requires a linked service configuration for the Databricks cluster. While I
won’t delve into detailed explanations here, you can find more information
in the documentation.
In the settings tab, you need to provide a path to the notebook and
parameters as shown in the screenshot below.

To test it, you can click the “Debug” button and check the execution
progress.

Databricks Workflows
Databricks Workspace provides “Workflows” functionality supporting job
orchestration and scheduling. If you prefer to work exclusively within the
Databricks environment or with cloud providers such as AWS or GCP, this
option is suitable. Workflows help create tasks and orchestrate steps in data
processing processes. For detailed configuration instructions, refer to the
documentation here.

Summary
In this post, I have compiled the most important information required to
start working with Databricks. While I have covered key aspects such as
environment setup, data processing, orchestration, and more, it’s important
to note that Databricks supports various powerful features beyond the scope
of this post. These include data live tables, MLflow, streaming processing,
maintenance, tuning, and monitoring.

I strongly recommend familiarizing yourself with these topics or feel free to


let me know in the comments which ones you would like me to cover in my
next post. Your feedback is valuable in shaping future content.

If you found this article insightful, please click the ‘clap’ button and follow me on
Medium and LinkedIn. For any questions or advice, feel free to reach out to me on
LinkedIn.

Databricks Data Engineering Data Science Data Azure

476 9

Written by Mariusz Kujawski Follow


3.1K followers · 2 following

Cloud Data Architect | Data Engineer https://www.linkedin.com/in/mariusz-


kujawski-812bb1103/

Responses (9)

Write a response

What are your thoughts?

Jouneid Raza
Feb 16

Great read, perfect step to kickstart ur journey to learn db.

2 Reply

Jakub Grafik
Jul 1, 2024

Great introduction, good job!

3 Reply

Kdanielug
Feb 13

This is a beautiful and comprehensive information. Thanks Mariusz

1 Reply

See all responses

More from Mariusz Kujawski


Mariusz Kujawski Mariusz Kujawski

Databricks Orchestration: Dimensional Data Modeling with


Databricks Workflows, Azure Dat… Databricks
Orchestration is a critical component of any Can we use Databricks for data warehousing
data processing pipeline. When developing… and dimensional modeling? Absolutely — a…

May 26 73 4d ago 68 2

Mariusz Kujawski Mariusz Kujawski

💡Why I Liked Delta Live Tables in Python for Data Engineering


Databricks Python plays a crucial role in the world of data
Delta Live Tables (DLT) is a declarative engineering, offering versatile and powerful…
framework designed to simplify data…

Feb 6 138 5 Dec 4, 2023 794 9

See all from Mariusz Kujawski

Recommended from Medium

Mayuran In Towards Data Enginee… by THE BRICK LEARN…

Databricks LakeBase: Is This the Databricks Open Sources


Future of OLTP on the Lakehouse? Declarative Pipelines: A New Era…
Understanding LakeBase: Databricks’ new Databricks just made a landmark move in the
solution for real-time, low-latency data… data engineering ecosystem — it open-…

6d ago 16 2 Jun 13 89

Nnaemezue Obi-Eyisi Mariusz Kujawski

Databricks Just Changed the Game 💡Why I Liked Delta Live Tables in
with LakeFlow — But Is It Too Sma… Databricks
It’s not often that a new tool makes me stop Delta Live Tables (DLT) is a declarative
mid-scroll and say, “Wait… this changes… framework designed to simplify data…

5d ago 32 1 Feb 6 138 5

In Data Engineer Things by Sohail Saifi In The BI Corner by Isabelle Bittar


The End of ETL: The Radical Shift in Smarter Python Visuals in Power
Data Processing That’s Coming… BI: 5 UX Tips for Better Insights
ETL is dying. Not slowly, not quietly, but in a From violin plots to field parameters — how to
spectacular blaze of irrelevance that most… make advanced data visuals both beautiful…

6d ago 379 20 6d ago 47 3

See more recommendations

Help Status About Careers Press Blog Privacy Rules Terms Text to speech

You might also like