0% found this document useful (0 votes)

21 views20 pages

Introduction to Databricks a Beginneers Guide

This document serves as a beginner's guide to Databricks, detailing its environment components such as workspaces, notebooks, clusters, and the metastore, as well as the use of Apache Spark for data processing. It covers key concepts like DataFrames, lazy evaluation, transformations, and actions, along with practical examples of reading and writing data from various sources. Additionally, it discusses the creation and management of clusters, data processing basics, and the features of Delta Lake for optimized data storage and operations.

Uploaded by

animillasekhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views20 pages

Introduction to Databricks a Beginneers Guide

Uploaded by

animillasekhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Search Write Sign up Sign in

Introduction to Databricks: A
Beginner’s Guide
Mariusz Kujawski Follow 19 min read · Feb 26, 2024

476 9

In this guide, I’ll walk you through everything you need to know to get
started with Databricks, a powerful platform for data engineering, data
science, and machine learning. From setting up your environment to
understanding key features like data processing and orchestration, this
guide has got you covered. Let’s jump right in!

Databricks Environment
In Databricks environments, we have four major components:

Workspace: A Databricks deployment in the cloud that functions as an

environment for your Databricks assets. Here, you can create notebooks,
jobs, clusters, and manage access. You can create multiple workspaces
for your teams. Free Membership

Notebooks: Web-based interfaces where

Distraction-free you
reading. can write code in Python,
No ads. Read member-only stories
Scala, or R, execute it, check its results, and visualize it.
Organize your knowledge with lists and Support writers you read most
highlights.
Cluster: Your compute resources and configuration on which youEarn
can runfor your writing
money
Sign up to discover human Tell your story. Find your audience.
your notebooks and jobs. You can create different types of clusters
stories that deepen your Listen to audio narrations

understanding of thedepending
world. on your needs, selecting a version of Databricks runtime,Read offline with the Medium app
memory, CPU for driver and nodes, etc. Monitor your code and job
executions. This is also where you can install the libraries required by
Sign up for free Try for $5/month
your code.

Metastore: A place where you can find metadata information about

tables that were created. You can find all databases and tables here. Every
Databricks deployment has a central Hive metastore accessible by all
clusters for table metadata.

Apache Spark
Apache Spark is an open-source, unified computing engine and a set of
libraries for parallel data processing on computer clusters. Spark supports
multiple programming languages (Python, Java, Scala, and R) and includes
libraries for diverse tasks ranging from SQL to streaming and machine
learning. You can run Spark on your local machine or scale up to thousands
of computers, making it suitable for big data processing.

Below is an example of Spark Application architecture. It’s crucial to notice

that operations on data are split between node workers and executors to
speed up our data processing. In a Spark cluster, each node is a separate
machine. For test purposes or small datasets, we can use the single-node
option with one machine.
Spark Application Architecture

DataFrame

A DataFrame is a distributed collection of data organized into named

columns. It is conceptually equivalent to a table in a relational database or a
data frame in R/Python but with richer optimizations under the hood.
DataFrames can be constructed from a wide array of sources such as
structured data files, tables in Hive, external databases, or existing RDDs.

Lazy Evaluation

Spark’s lazy evaluation comes into play at this point. The logical execution
plan is not immediately executed, and Spark defers the computation until an
action is called.

Transformations

Transformations are the instructions you use to modify the DataFrame in the
way you want and are lazily executed. There are two types of
transformations: narrow and wide.

Narrow Transformations: These are transformations for which each

input partition will contribute to only one output partition. Examples
include map, filter, and select.

Wide Transformations: These transformations will have input partitions

contributing to many output partitions. Spark will perform a data shuffle
operation, redistributing partitions across the cluster. Examples include
group by and join.

Actions

Actions are operations that trigger the data processing and return results or
write data to storage. Examples of actions include count, collect, write, show,
etc. When you call an action, Spark evaluates the entire logical execution
plan built through transformations and optimizes the execution plan before
executing it.

Databricks Clusters
To create a cluster in Databricks workspace we need to go to the compute
tab. We can select a few types of clusters with different configurations and
access modes. Each configuration will have its benefits and limitations, I’ll
discuss them briefly below.

Cluster Types
All Purpose Cluster

All Purpose Cluster: These clusters are primarily used for interactive
data analysis using Databricks notebooks. Multiple users can share these
clusters for collaborative interactive analysis. Clusters without any
activity are terminated after the specified time in the “terminate” field of
the cluster configuration.

Job Cluster: Job Clusters in Databricks are transient clusters specifically

created for running jobs. Running non-interactive workloads is preferred
for cost efficiency, as the compute is only used for the duration of the job,
after which it will be terminated.
SQL Warehouse: This type of cluster is for executing SQL queries against
tables. Access is provided through a SQL editor, and there are predefined
clusters available for configuration.

Access Mode

Cluster access mode is a security feature that determines who can use a
cluster and what data they can access via the cluster. When creating any
cluster in Azure Databricks, you must select an access mode. Considerations
include how you want to use a cluster, supported languages, whether you
need mounts, or Unity Catalog integration, etc.

Single-user Mode: Supports Unity Catalog and is assigned and used by

only a single user. Supports Scala, Python, SQL, and R, as well as RDD
API, DBFS mounts, init scripts, and external libraries. Databricks
Runtime ML is also supported.

Shared Mode: Can be used by multiple users with data isolation among
them, requiring a premium plan. Supports Unity Catalog and SQL,
Python, and Scala, but does not support RDD API and DBFS mounts.
When used with credential pass-through, Unity Catalog features are
disabled. Databricks Runtime ML is not supported.

No Isolation Shared Mode: This can be used by multiple users with no

data isolation among them. Does not support Unity Catalog but can be
used with SQL, Scala, Python, and R. Supports RDD API, DBFS mounts,
init scripts, libraries, and Databricks Runtime ML.

Basics of Data Processing

Databricks allows us to use Scala, Python, and Spark SQL. In this post, I’ll
focus on Python and Spark SQL. PySpark is the Python API for Apache
Spark, enabling real-time and large-scale data processing in a distributed
environment. Spark SQL is Apache Spark’s module for working with
structured data, seamlessly integrating SQL queries with Spark programs.
With PySpark DataFrames, you can efficiently read, write, transform, and
analyze data using Python and SQL. Whether you choose Python or SQL, the
same underlying execution engine is used, ensuring that you can always
leverage the full power of Spark.

Basic DataFrame Using Python

In Databricks, we can access the Spark session using the spark object.

data = [[1, "VW"],

[2, "BMW"]]

columns = ["ID", "Car"]

dataframe = spark.createDataFrame(data, columns)

# show data frame

display(dataframe)

Reading and Writing from and to Azure Data Lake Gen2

To start working with Databricks, we need to configure external storage. In
my case, I’ll be using Azure Account Storage. This storage will be used for
reading and writing data. While we can create tables in Databricks, it’s
important to note that we still operate on files, making storage a crucial
aspect of our setup. Databricks supports several options for integrating with
storage, including mounting account storage in the Databricks file system
dbfs or by using a URL. To connect to account storage, we need to create a
service principal and assign it to the “Blob Storage Contributor Role.”
Alternatively, we can use the access key. It’s considered a best practice to
store sensitive information in a secret scope.

Notice: This method should be used for training purposes or specific use
cases. If you use Unity Catalog (description provided in the next paragraph),
it’s not a recommended way of accessing storage.

Mounting Storage:

storageAccount="account"
tenantID = ""
mountpoint = "/mnt/bronze"
storageEndPoint =f"abfss://bronze@{storageAccount}.dfs.core.windows.net/"

#ClientId, TenantId and Secret is for the Application(ADLSGen2App) was have crea
clientID ="xxx-xx-xxx-xxx"
tenantID ="xx-xx-xx-xxx"
clientSecret ="xxx-xx-xxxxx"
oauth2Endpoint = f"https://login.microsoftonline.com/{tenantID}/oauth2/token"

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebf
"fs.azure.account.oauth2.client.id": clientID,
"fs.azure.account.oauth2.client.secret": clientSecret,
"fs.azure.account.oauth2.client.endpoint": oauth2Endpoint}

dbutils.fs.mount(
source = storageEndPoint,
mount_point = mountpoint,
extra_configs = configs)

To test the new month we can list objects from a directory:

display(dbutils.fs.ls("dbfs:/mnt/bronze/Customer/"))

Accessing files using a URI

clientID ="xxx-xx-xxx-xxx"
tenantID ="xx-xx-xx-xxx"
clientSecret ="xxx-xx-xxxxx"
oauth2Endpoint = f"https://login.microsoftonline.com/{tenantID}/oauth2/token"

spark.conf.set("fs.azure.account.auth.type.cookbookadlsgen2storage.dfs.core.wind
spark.conf.set("fs.azure.account.oauth.provider.type.cookbookadlsgen2storage.dfs
spark.conf.set("fs.azure.account.oauth2.client.id.cookbookadlsgen2storage.dfs.co
spark.conf.set("fs.azure.account.oauth2.client.secret.cookbookadlsgen2storage.df
spark.conf.set("fs.azure.account.oauth2.client.endpoint.cookbookadlsgen2storage.

df_direct = (
spark.read
.format("csv")
.option("header",True)
.load("abfss://[email protected]/Customer")
)

Reading and Saving Data from Various Sources

Spark enables us to effortlessly read and save data from various sources,
including CSV files, Parquet, Avro, JSON, and more.

The read method in Spark allows us to import data from other databases
such as Oracle, PostgreSQL, and more.

CSV
df = (
spark.read
.option("delimiter", ",")
.option("header", True)
.csv(pat_to_file)
)

SQL

SELECT * FROM read_files(

'/mnt/bronze/orders/orders.csv',
format => 'csv',
header => true,
mode => 'FAILFAST')

-- OR

SELECT *
FROM
csv.`/mnt/path/to/file`

While Spark can automatically detect schema for us, there is also an option
to declare it manually.

from pyspark.sql.types import *

cust_schema = StructType([
StructField("id", IntegerType()),
StructField("car", StringType()),
])

df = (
spark.read
.option("delimiter", ",")
.option("header", True)
.option("cust_schema")
.csv(pat_to_file)
)

Oracle

query = "select 1 a from dueal"

df = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:user/pass@//address:port/instance") \
.option("query", query) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()

Saving data

We can save files directly to storage using the following code:

df.write
.format("csv")
.option("header", "true")
.save(path_to_save)

Databricks tables

When we have data loaded in our data frame, we can transform them using
transformations, save the data frame on storage, or create a table in a
schema(database). To save data in a table, we need to create a database,
which we can do using Spark SQL:
Then, we can create and populate a table using PySpark.

df.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(table_name, path=destination)

As a result of execution, we should see a table in the Workspace in the

Catalog tab.

Spark SQL provides the possibility to create a table using SQL DDL. Creating
tables using SQL could be simple for people who have just switched from
SQL Server, PostgreSQL, or Oracle. As mentioned earlier, tables in
Databricks are only metadata descriptions of files stored in the table
location. Optionally, we can specify the file format and its location; if we
don’t, Databricks will use the default location and format. Using the default
setting will result in the creation of a Spark-managed table.

-- Creates a Delta table

CREATE TABLE test_db.cars(

id int,
model STRING,
year int
)

-- Use data from another table

CREATE TABLE test_db.cars
AS
SELECT * from load_cars

-- Use data from another table using its path

CREATE TABLE IF NOT EXISTS test_db.cars
AS SELECT *
FROM
delta.`raw/cars`;

If we specify a location, it will result in the creation of a Spark unmanaged

table(External Table).

CREATE TABLE test_db.cars(

id int,
model STRING,
year int
)
USING DELTA
LOCATION '/mnt/silver/cars';

The main difference between them is that Databricks only manages the
metadata; when you drop a table, you do not remove the underlying data.

Read Databricks tables

With Spark and SQL, we can execute a query against a table existing in
Databricks meta store to retrieve data from them.

df = spark.sql("""
select
*
from
test_db.cars""")

df = spark.read.table("test_db.cars")

Databricks notebooks allow us to query tables using Spark SQL directly. To

use SQL, we need to switch the language to SQL or use the magic command
%sql . The result will be displayed under the cell with the SQL command.

%sql

select
*
from
test_db.cars

Delta Tables

Delta Lake is the default storage format for all operations on Databricks.
Unless otherwise specified, all tables on Databricks are Delta tables. Delta
Lake serves as the optimized storage layer that forms the foundation for
storing data and tables in the Databricks lakehouse. It is an open-source
technology that extends Parquet data files with a file-based transaction log
for ACID transactions and scalable metadata handling. Delta Lake
seamlessly integrates with Apache Spark, offering a range of important
features:

Additional Operations: Delta Lake provides additional operations such as

inserts, updates, and deletes. You can update data in a target using the
SQL Merge operation, improving load strategies for a lakehouse and
supporting operations like Slowly Changing Dimensions (SCD).

ACID Transactions: Delta Lake ensures ACID (Atomicity, Consistency,

Isolation, Durability) transactions, guaranteeing data integrity and
reliability.

Time Travel: Delta Lake offers time travel capabilities by providing a

transaction log, enabling users to access and revert to earlier versions of
data, facilitating data auditing and versioning.

Schema Evolution: Delta Lake supports schema evolution, allowing

changes to a table schema that can be applied automatically without the
need for recreating a file, ensuring flexibility and ease of maintenance.

Optimization: Delta Lake provides optimization features such as

OPTIMIZE and ZORDER BY to reduce the number of files and physically
organize data within a file, improving query performance and efficiency.

Cache in Databricks: Delta Lake allows caching files on worker nodes

and caching the result of queries in memory, reducing the need for
reading from files and enhancing query performance.

After saving a table, you should observe the same structure reflected in
storage.

DataFrame Transformations
Transformation is one of the most important aspects of data processing.
Spark, in this context, is very powerful with its capabilities to integrate data
from different sources and the flexibility to use Python, Scala, and SQL. In
this paragraph, I’ll demonstrate how we can transform a DataFrame in the
most common scenarios.

To start working with a DataFrame, we need to read data from a file, table, or
we can use a list, as demonstrated in the example below to create a
DataFrame.

data = [
[1,"Alicia Kaiser","WA","001-677-774-4370","13714","Stacy Summit","54799","Lake
[2,"Donna Ellis","BR","001-631-995-8008","43599","Adam Trail","07204","Port Apri
[3,"Kenneth Smith","WA","001-592-849-6009x4173","649","Sherri Grove","14527","No
[4,"Danny Clark","WA","574.419.0221x348","285","Timothy Drive","41106","West Eri
[5,"Nicholas Thompson","CA","(259)268-8760x061","998","Russell Shoals","65647","
[6,"Frances Griffith","WA","7535316823","9559","Emily Branch","71422","Mcdanielh
[7,"Trevor Harrington","CA","742.224.9375","5960","Lisa Port","73881","Loganbury
[8,"Seth Mitchell","AA","(386)517-7589x04440","47352","Stafford Loop","01347","S
[9,"Patrick Caldwell","BR","001-307-225-9094","0170","Amanda Dam","24885","Port
[10,"Laura Hopkins","CA","9095819755","143","Lee Brook","23623","Jarvisland","Ha
]
schama = "client_number int,name string,branch string,phone_number string,buldin
df = spark.createDataFrame(data, schama)

The display command will present the content of the DataFrame.

Selecting columns

display(df.select("client_number","name", "amout"))
Spark allows us to use SQL to transform data. To utilize a DataFrame created
or imported within a SQL context, we need to create a temporary view:

df.createTempView("client")

There are various options to access columns in a DataFrame. Below, you can
find a few examples:

from pyspark.sql.functions import expr, col, column

display(df.select(
expr("client_number as client_num"),
col("client_number").alias("client_num"),
column("client_number"))\
.limit(2))

During data transformation, common scenarios include changing data types,

renaming columns, adding new columns, and deriving new columns based
on values from others. Let’s explore how we can achieve these tasks.

from pyspark.sql.functions import expr, col, column, lit, exp, current_date

age_exp = "extract( year from current_date) - extract( year from birth_date) "

df1 = df.withColumn("amout", col("amout").cast("decimal(10,2)")) \

.withColumn("birth_date", col("birth_date").cast("date")) \
.withColumn("age", expr(age_exp)) \
.withColumn("load_date", lit(current_date()))

df1 = df1.drop("credit_card_number", "birth_date")

display(df1)

In this example, we utilize the withColumn function to calculate age, add load
date columns, and change the amount to decimal data type. Additionally, we
can drop columns with unnecessary data. The same result can be achieved
using SQL.
df.createOrReplaceTempView("client")

df1 = spark.sql("""
select
client_number,
name,
bulding_number,
street_name,
postcode,
city,
state,
cast(amout as decimal(10,2)) amount,
extract( year from current_date) - extract( year from birth_date
current_date() load_date
from
client
""")
display(df1)

Filtering Rows in a DataFrame

Filtering rows in a DataFrame involves creating a condition that separates

the data you want to keep from the data you don’t. This condition can be
written as a simple expression or built from multiple comparisons.
DataFrames offer two methods, where and filter , to achieve this filtering
based on your chosen condition.

display(df1.where("age >= 85"))

display(df1.where(col("age") >= 85))

It’s possible to build more complex expressions in DataFrame filtering using

AND or OR conditions, allowing for greater flexibility in specifying
conditions for row selection.

display(df1.where("age >= 85 or age <=30"))

display(df1.filter( (col("age") >= 85) | (col("age") <= 30)))

display(df1.where("age >=30 and branch != 'WA' "))

display(df1.filter( (col("age") >= 30) & (col("branch") != 'WA')))

Grouping and Sorting

Group by is a transformation operation in PySpark used to group the data in

a Spark DataFrame based on specified columns. This operation is often
followed by aggregating functions such as count(), sum(), avg(), etc.,
allowing for the summarization of grouped data.

Order by, on the other hand, sorts records in a DataFrame based on specified
sort conditions, allowing for the arrangement of data in ascending or
descending order.

from pyspark.sql.functions import desc, count, sum

# count by brnahc
display(df1.groupBy("branch").count().orderBy(desc("count")))

# sum, count by branch with order by client_number count

display(df1.groupBy("branch").agg(
sum("amout").alias("amout"), count("client_number").alias("qt")
).orderBy(desc("qt")))
This SQL version mirrors the explanation provided for PySpark:

df.createOrReplaceTempView("client")
df1 = spark.sql("""
select
branch,
sum(amount) amount,
count(client_number) client_number

from
client
group by branch
""")

df1.show()

+------+------+-------------+
|branch| amout|client_number|
+------+------+-------------+
| WA|8000.0| 4|
| BR|3000.0| 2|
| CA|5100.0| 3|
| AA|3000.0| 1|
+------+------+-------------+

Alternatively, in PySpark, you can use the show command to display the
results of these operations directly in the notebook or console.

Joining DataFrames
Joining operations are crucial for various data processing tasks such as data
normalization, data modeling, and ensuring data quality. Spark supports
joins using DataFrame join and SQL joins.

To demonstrate the join operation, we need an additional DataFrame. I’ll

create it similarly so you can easily replicate my steps, or you can load data
from a file or table.

trans = [
[str(uuid.uuid4()), 1, 100.00, '2024-02-01'],
[str(uuid.uuid4()), 1, 200.00, '2024-02-03'],
[str(uuid.uuid4()), 1, 130.00, '2024-02-04'],
[str(uuid.uuid4()), 2, 110.00, '2024-02-05'],
[str(uuid.uuid4()), 3, 200.00, '2024-02-01'],
[str(uuid.uuid4()), 2, 300.00, '2024-02-02'],
[str(uuid.uuid4()), 2, 50.00, '2024-02-03'],

schama = "id string, client_id int, value double, tran_date string"

df_tran = spark.createDataFrame(trans, schama)
display(df_tran)

The code below illustrates how to join two DataFrames. Spark supports
various join types, including:

Inner Join

Left / Left Outer Join

Right / Right Outer Join

Outer / Full Join

Cross Join

Left Anti Join

Left Semi Join

# column name index style

display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'inner'))

# column name property style

display(df.join(df_tran, df.client_number == df_tran.client_id, 'inner'))

These joins function similarly to their counterparts in SQL. Additionally,

Spark provides a few new types of joins that you may not have seen in SQL:

- Left Anti join: Retrieves records from the left DataFrame that do not exist in
the right DataFrame.

display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'left_anti

Left Semi-join: Retrieves records and columns from the left DataFrame
where records match records in the right DataFrame.

display(df.join(df_tran, df['client_number'] == df_tran['client_id'], 'leftsemi'

Naturally, Spark SQL allows us to use SQL-like joins in our code. We can join
existing tables or create views and join these views using SQL syntax.

df.createOrReplaceTempView("clients")
df_tran.createOrReplaceTempView("tran")

display(spark.sql("""
select *
from
clients a
inner join
tran b on a.client_number = b.client_id
"""))

Union DataFrames
Spark facilitates union operations on DataFrames in several ways. We can
union DataFrames using column names, as shown in example 1.
Alternatively, we can union DataFrames without checking column names, as
demonstrated in example 2. Additionally, Spark allows for merging two
DataFrames while allowing missing columns.

df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])

df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])

# example 1
display(df1.unionByName(df2))

# example 2
display(df1.union(df2))

df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])

# example 3
display(df1.unionByName(df2, allowMissingColumns=True))

When function
A useful function in PySpark is when , which is employed in cases where we
need to translate or map a value to another value based on specified
conditions.

from pyspark.sql.functions import expr, col, column, lit, exp, current_date, whe

data = [("Robert", "Smith","M",40000),("Ana","Novak","M",60000),

("Carl","Xyz",None,500000),("Maria","Free","F",500000),
("Johm","Az","",None), ("Steve","Smith","",1000)]

columns = ["name","surname","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)

df2 = df.withColumn("new_gender", when(df.gender == "M","Male")

.when(df.gender == "F","Female")
.when(df.gender.isNull() ,"")
.otherwise(df.gender))

Databricks Auto loader

Auto Loader is a feature that enables the incremental and efficient
processing of new files as they arrive in cloud storage. It can load data files
from Azure Account Storage, AWS S3, and Google GCS.

Data live tables are a great extension for Auto Loader to implement ETL
(Extract, Transform, Load) processes. Auto Loader automatically stores
information about processed files, eliminating the need for additional
maintenance steps. In case of failure, it will resume processing from the last
successful step.
For instructions on how to configure Auto Loader, you can refer to another
post.

Azure Databricks Auto Loader

Databricks Auto Loader is a feature that allows us to quickly ingest
data from Azure Storage Account, AWS S3, or GCP…
medium.com

Python Functions and Modules

In Databricks, we can organize our code using either notebooks or Python
modules. When using notebooks, the process is straightforward. We create a
notebook, for example, named “Utils,” where we define all the functions that
we commonly use in our solution. Then, we can call the %run command to
include the defined functions in other notebooks.

%run fun

print(add(1,2))

Alternatively, we can use a module by creating a folder named “utils” with

the following structure:

.
├── utils
│ ├── __init__.py
│ └── fun.py
└── test_notebook

# fun.py
def add(a,b):
return a+b

We can import this module using well-known Python syntax:

from utils.fun import add

print(add(1,2))

You can install custom .whl files onto a cluster and then import them into a
notebook. For code that is frequently updated, this process might be
inconvenient and error-prone.

User-defined functions
If we can’t find a function for our use case, it’s possible to create our custom
function. However, to use it with DataFrames or SQL, we need to register it.

from pyspark.sql.types import LongType

def squared(s):
return s * s
spark.udf.register("squaredWithPython", squared_typed, LongType())
df1.createOrReplaceTempView("df1")

SQL

%sql
select id, squaredWithPython(col1) as id_squared from df1

PySpark

from pyspark.sql.functions import udf

@udf("long")
def squared_udf(s):
return s * s

df1.createOrReplaceTempView("df1")
display(df.select("col1", squared_udf("col1").alias("col1_squared")))

Docstring and Typing

Using docstrings and typing in Python is crucial for well-documented code.
Docstrings and typing provide the ability to document our classes, functions,
etc., improving the readability and usability of the code. Information
provided in docstrings can be utilized by code intelligence tools, the help()

function, or accessed via the doc attribute.

def exmplae(param0: type, param1: type)-> type:

"""
Does what

Args:
param0:
param1:

Returns:
Describe the return value

For example, let’s create a function that returns a date based on the number
of days from 1900–01–01. To ensure clarity and correctness, we can write the
function with a docstring and typing as shown above.

def to_date(number_of_days: int) -> str:

"""
Converts days from 1900-01-01 to date.

Args:
number_of_days: numbers of days
Returns:
date: date in string format.

"""
date_ = datetime.date(1900, 1, 1)

end_date = date_ + datetime.timedelta(days=number_of_days)

return end_date.strftime("%Y-%m-%d")

To access the documentation of this function use the help() command or

IntelliSense:
Unity Catalog
Databricks Unity Catalog is a unified governance solution that centralizes
access control, auditing, lineage, and data discovery capabilities across
Databricks workspaces. Key features of Unity Catalog include security, data
discovery, lineage, and data sharing.

Security: Unity Catalog offers a single place to administer data access

policies that apply across all workspaces.

Data Discovery: Unity Catalog lets you tag and document data assets and
provides a search interface to help data consumers find data.

Data Lineage: Data lineage supports use cases such as tracking and
monitoring jobs, debugging failures, understanding complex workflows,
and tracing transformation rules. It presents the flow of data in user-
friendly diagrams.

Data Sharing: Unity Catalog gives businesses more control over how, why,
and what data is being shared with whom.

The Unity Catalog Object Model

The Unity Catalog object model uses a tree-level namespace to address
various types of data assets in the catalog. The top-level meta store acts as a
global container for assets such as catalogs, followed by schemas for all your
data entities like tables, views, models, and functions.

Interesting Functions in Preview

In the current Public Preview, several noteworthy features are worth
mentioning:

Volumes: Provide access to non-tabular data stored in cloud object

storage.

Lakehouse Federation: Provides direct access to relational database

engines such as MySQL, PostgreSQL, Amazon Redshift, Snowflake,
Microsoft SQL Server, Azure Synapse (SQL Data Warehouse), and Google
BigQuery via Databricks.

Declaration of Primary Key and Foreign Key Relationships.

Orchestration
For process orchestration, several tools are available, including Azure Data Top highlight

Factory, Synapse Pipelines, Databricks Workflows, and Airflow. If you are

working with Azure services, it is recommended to use Azure Data Factory
or Synapse

Azure Synapse and Data Factory

Azure Data Factory or Synapse pipelines allow you to incorporate Notebook
activities, configure parameters, and establish dependencies between other
activities. Below, you can see an example of how to configure a pipeline. It
requires a linked service configuration for the Databricks cluster. While I
won’t delve into detailed explanations here, you can find more information
in the documentation.
In the settings tab, you need to provide a path to the notebook and
parameters as shown in the screenshot below.

To test it, you can click the “Debug” button and check the execution
progress.

Databricks Workflows
Databricks Workspace provides “Workflows” functionality supporting job
orchestration and scheduling. If you prefer to work exclusively within the
Databricks environment or with cloud providers such as AWS or GCP, this
option is suitable. Workflows help create tasks and orchestrate steps in data
processing processes. For detailed configuration instructions, refer to the
documentation here.

Summary
In this post, I have compiled the most important information required to
start working with Databricks. While I have covered key aspects such as
environment setup, data processing, orchestration, and more, it’s important
to note that Databricks supports various powerful features beyond the scope
of this post. These include data live tables, MLflow, streaming processing,
maintenance, tuning, and monitoring.

I strongly recommend familiarizing yourself with these topics or feel free to

let me know in the comments which ones you would like me to cover in my
next post. Your feedback is valuable in shaping future content.

If you found this article insightful, please click the ‘clap’ button and follow me on
Medium and LinkedIn. For any questions or advice, feel free to reach out to me on
LinkedIn.

Databricks Data Engineering Data Science Data Azure

476 9

Written by Mariusz Kujawski Follow

3.1K followers · 2 following

Cloud Data Architect | Data Engineer https://www.linkedin.com/in/mariusz-

kujawski-812bb1103/

Responses (9)

Write a response

What are your thoughts?

Jouneid Raza
Feb 16

Great read, perfect step to kickstart ur journey to learn db.

2 Reply

Jakub Grafik
Jul 1, 2024

Great introduction, good job!

3 Reply

Kdanielug
Feb 13

This is a beautiful and comprehensive information. Thanks Mariusz

1 Reply

See all responses

More from Mariusz Kujawski

Mariusz Kujawski Mariusz Kujawski

Databricks Orchestration: Dimensional Data Modeling with

Databricks Workflows, Azure Dat… Databricks
Orchestration is a critical component of any Can we use Databricks for data warehousing
data processing pipeline. When developing… and dimensional modeling? Absolutely — a…

May 26 73 4d ago 68 2

Mariusz Kujawski Mariusz Kujawski

💡Why I Liked Delta Live Tables in Python for Data Engineering

Databricks Python plays a crucial role in the world of data
Delta Live Tables (DLT) is a declarative engineering, offering versatile and powerful…
framework designed to simplify data…

Feb 6 138 5 Dec 4, 2023 794 9

See all from Mariusz Kujawski

Recommended from Medium

Mayuran In Towards Data Enginee… by THE BRICK LEARN…

Databricks LakeBase: Is This the Databricks Open Sources

Future of OLTP on the Lakehouse? Declarative Pipelines: A New Era…
Understanding LakeBase: Databricks’ new Databricks just made a landmark move in the
solution for real-time, low-latency data… data engineering ecosystem — it open-…

6d ago 16 2 Jun 13 89

Nnaemezue Obi-Eyisi Mariusz Kujawski

Databricks Just Changed the Game 💡Why I Liked Delta Live Tables in
with LakeFlow — But Is It Too Sma… Databricks
It’s not often that a new tool makes me stop Delta Live Tables (DLT) is a declarative
mid-scroll and say, “Wait… this changes… framework designed to simplify data…

5d ago 32 1 Feb 6 138 5

In Data Engineer Things by Sohail Saifi In The BI Corner by Isabelle Bittar

The End of ETL: The Radical Shift in Smarter Python Visuals in Power
Data Processing That’s Coming… BI: 5 UX Tips for Better Insights
ETL is dying. Not slowly, not quietly, but in a From violin plots to field parameters — how to
spectacular blaze of irrelevance that most… make advanced data visuals both beautiful…

6d ago 379 20 6d ago 47 3

See more recommendations

Help Status About Careers Press Blog Privacy Rules Terms Text to speech

Azure Databricks
67% (6)
Azure Databricks
69 pages
Get Started With Databricks For Machine Learning
No ratings yet
Get Started With Databricks For Machine Learning
85 pages
Data Engineering With Databricks Da
100% (2)
Data Engineering With Databricks Da
232 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
De Mod 1 Get Started With Databricks Data Science and Engineering Workspace
No ratings yet
De Mod 1 Get Started With Databricks Data Science and Engineering Workspace
27 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Laboratory Exercise 3: Latches, Flip-Flops, and Registers
No ratings yet
Laboratory Exercise 3: Latches, Flip-Flops, and Registers
6 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
AZURE DATA BRICKS
No ratings yet
AZURE DATA BRICKS
8 pages
Databricks+Course+Deck
No ratings yet
Databricks+Course+Deck
134 pages
Databricks, An Introduction: Chuck Connell, Insight Digital Innovation
No ratings yet
Databricks, An Introduction: Chuck Connell, Insight Digital Innovation
36 pages
Databricks Clusters
No ratings yet
Databricks Clusters
29 pages
Data Bricks
No ratings yet
Data Bricks
115 pages
Course Notes
No ratings yet
Course Notes
11 pages
Databricks: Building and Operating A Big Data Service Based On Apache Spark
No ratings yet
Databricks: Building and Operating A Big Data Service Based On Apache Spark
32 pages
Azure Databricks Engineering 1746278570
No ratings yet
Azure Databricks Engineering 1746278570
96 pages
databricks
No ratings yet
databricks
131 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
Cluster in Databricks
No ratings yet
Cluster in Databricks
9 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
Databricks 2
No ratings yet
Databricks 2
22 pages
Azuredatabricks New
No ratings yet
Azuredatabricks New
22 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Databricks Guide
No ratings yet
Databricks Guide
31 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
DataEngineeringDatabricks
No ratings yet
DataEngineeringDatabricks
139 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
Databricks
No ratings yet
Databricks
36 pages
Module 3
No ratings yet
Module 3
51 pages
Spark 101
No ratings yet
Spark 101
25 pages
Evaluative Summary On Databricks' Value Propositions
No ratings yet
Evaluative Summary On Databricks' Value Propositions
2 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Module 4
No ratings yet
Module 4
29 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Spark Summit: June 2014
No ratings yet
Spark Summit: June 2014
32 pages
Compute
No ratings yet
Compute
56 pages
1.spark
No ratings yet
1.spark
2 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Python and Pyspark With Databricks, With Azure Project
No ratings yet
Python and Pyspark With Databricks, With Azure Project
9 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
DB For Data Engineering Solution Sheet
No ratings yet
DB For Data Engineering Solution Sheet
2 pages
DA U2
No ratings yet
DA U2
17 pages
Pyspark
No ratings yet
Pyspark
31 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
www-hireitpeople-com-resume-database-72-web-developer-resumes-109107-sr-azure-cl...
No ratings yet
www-hireitpeople-com-resume-database-72-web-developer-resumes-109107-sr-azure-cl...
12 pages
Microsoft Azure - Add Storage Blob Owner Role Assignment for Azure AD User
No ratings yet
Microsoft Azure - Add Storage Blob Owner Role Assignment for Azure AD User
1 page
Getting Started With Databricks
No ratings yet
Getting Started With Databricks
39 pages
How Microsoft Azure Works
No ratings yet
How Microsoft Azure Works
1 page
Abdul Rahman Jan 01st to 31st present
No ratings yet
Abdul Rahman Jan 01st to 31st present
32 pages
Copy of kirby_3160_26_05_2025_mrg_bill(1)-3-1
No ratings yet
Copy of kirby_3160_26_05_2025_mrg_bill(1)-3-1
3 pages
Microsoft Azure - Concept of Blob Versioning in Azure Storage
No ratings yet
Microsoft Azure - Concept of Blob Versioning in Azure Storage
1 page
Microsoft Azure- Disk Storage in Microsoft Azure
No ratings yet
Microsoft Azure- Disk Storage in Microsoft Azure
1 page
Microsoft Azure - Protecting Hybrid Cloud Workloads Using
No ratings yet
Microsoft Azure - Protecting Hybrid Cloud Workloads Using
1 page
Microsoft Azure - Mounting Azure Storage in a Container App
No ratings yet
Microsoft Azure - Mounting Azure Storage in a Container App
1 page
Microsoft Azure - Virtual Network
No ratings yet
Microsoft Azure - Virtual Network
1 page
Microsoft Azure - Protecting Hybrid Cloud Workloads using Azure Defender
No ratings yet
Microsoft Azure - Protecting Hybrid Cloud Workloads using Azure Defender
6 pages
Microsoft Azure Tutorial
No ratings yet
Microsoft Azure Tutorial
9 pages
Microsoft Azure - Find and Delete Orphaned Public IP
No ratings yet
Microsoft Azure - Find and Delete Orphaned Public IP
5 pages
Microsoft Azure - Finding the Right Load Balancing Service
No ratings yet
Microsoft Azure - Finding the Right Load Balancing Service
1 page
Microsoft Azure - Managing Multiple Virtual Machines
No ratings yet
Microsoft Azure - Managing Multiple Virtual Machines
9 pages
Microsoft Azure - Find Orphaned Network Interface Cards(NICs) (2)
No ratings yet
Microsoft Azure - Find Orphaned Network Interface Cards(NICs) (2)
1 page
Microsoft Azure - Azure Managed Disk State Details Using PowerShell
No ratings yet
Microsoft Azure - Azure Managed Disk State Details Using PowerShell
1 page
Introduction to Microsoft Azure a Cloud Computing Service
No ratings yet
Introduction to Microsoft Azure a Cloud Computing Service
13 pages
En Product-Flyer Axiocam 208-Color
No ratings yet
En Product-Flyer Axiocam 208-Color
4 pages
Compiler
No ratings yet
Compiler
7 pages
Brosur Vostro-3671-Desktop - Owners-Manual2 - En-Us
No ratings yet
Brosur Vostro-3671-Desktop - Owners-Manual2 - En-Us
14 pages
Sanim Hossain CV 2
No ratings yet
Sanim Hossain CV 2
1 page
Backend Developer JD
No ratings yet
Backend Developer JD
1 page
Result of Jseb Written Test Held On 13-02-2011
No ratings yet
Result of Jseb Written Test Held On 13-02-2011
15 pages
Canteen Ordering System
50% (2)
Canteen Ordering System
14 pages
Fundamentals of Clinical Data Science
100% (1)
Fundamentals of Clinical Data Science
217 pages
Website Login Manage Patient Information Manage Pharmacy: Feature
No ratings yet
Website Login Manage Patient Information Manage Pharmacy: Feature
8 pages
Ict 155 Intro To Programming
No ratings yet
Ict 155 Intro To Programming
86 pages
5 Ways To Earn As C++ Programmer
No ratings yet
5 Ways To Earn As C++ Programmer
25 pages
7.5 Arrays, Records, Pointers
No ratings yet
7.5 Arrays, Records, Pointers
7 pages
Algorithm Trading System Project
No ratings yet
Algorithm Trading System Project
9 pages
Micro Focus Smax Brochure
No ratings yet
Micro Focus Smax Brochure
11 pages
2022 - Smart Manufacturing Technology Adoption For Improving Productivity
No ratings yet
2022 - Smart Manufacturing Technology Adoption For Improving Productivity
16 pages
Demo Task Sheet - Orient 1
No ratings yet
Demo Task Sheet - Orient 1
2 pages
Engine Telegraph
No ratings yet
Engine Telegraph
4 pages
Computer-Hardware Grade 9
100% (1)
Computer-Hardware Grade 9
4 pages
Tcs HR Question and Answers
No ratings yet
Tcs HR Question and Answers
7 pages
Termo
No ratings yet
Termo
3 pages
EViews - Wikipedia PDF
No ratings yet
EViews - Wikipedia PDF
8 pages
4cp0 02 Pef 20220825
No ratings yet
4cp0 02 Pef 20220825
8 pages
Module 05 - Accounting and Information Systems
No ratings yet
Module 05 - Accounting and Information Systems
6 pages
Control Builder Startup
No ratings yet
Control Builder Startup
106 pages
Ghost by External Hard Disk - 1
No ratings yet
Ghost by External Hard Disk - 1
10 pages
Agency Services
No ratings yet
Agency Services
4 pages
MSDA MineSight Data Analyst 200510
100% (1)
MSDA MineSight Data Analyst 200510
2 pages
UDW
No ratings yet
UDW
6 pages
Test Bank Chapter 6
75% (4)
Test Bank Chapter 6
56 pages