Azure Etl 1741608374
Azure Etl 1741608374
Key Answers:
• Time travel is a Delta Lake feature that allows querying historical data
(snapshots).
• Example: SELECT * FROM table_name VERSION AS OF 5 or TIMESTAMP AS OF
'2023-01-15' .
• Example: ETL pipelines to ingest and transform raw data from Azure
Data Lake using Data Flows and Spark jobs.
• Mention specifics like copy data activities, data validation,
and orchestration of transformations.
• Partitioning: Divides the data into directories based on keys (e.g., year,
month).
• Bucketing: Hashes data into fixed-sized buckets, optimizing joins and
aggregations.
6️⃣ Medallion Architecture:
• Unity Catalog: Centralized data governance and access control for all
your Databricks workspaces.
• Hive Metastore: Manages metadata for Hive and Spark tables, but lacks
robust access control.
• Use For Each Activity with the Batch Count property set for
parallelism.
• Enable concurrent execution in pipeline settings.
• Use partitioned datasets for parallel reads/writes to optimize
execution.
• Narrow: Data is processed within the same partition (e.g., map, filter).
Minimal shuffling.
• Wide: Data is shuffled across partitions (e.g., groupBy, join). Higher
computational cost.
• groupByKey: Groups all key-value pairs by key and shuffles all data.
More memory-intensive.
• reduceByKey: Combines values at the mapper side before shuffling,
reducing network traffic. Preferred for better performance.
2️⃣2️⃣ What is Delta Lake? Key Features and Creating Delta Tables:
• Serverless Pool:
o Pay-per-query model.
o Used for ad-hoc queries on data lakes.
• Dedicated SQL Pool:
o Pre-provisioned resources with fixed cost.
o Designed for high-performance data warehousing.
secret_value = dbutils.secrets.get(scope="key_vault_scope",
key="secret_name")
o ON target.id = source.id
• Data Lake:
o Stores raw, unstructured data.
o Scalable, cost-effective.
o Example: Azure Data Lake.
• Data Warehouse:
o Stores structured, processed data for analytics.
o Schema-on-write.
o Example: Azure Synapse.
df = rdd.toDF(schema=["col1", "col2"])
• DataFrame to RDD:
rdd = df.rdd
• .option("cloudFiles.format", "json") \
.load("path")
LIMIT 4 OFFSET 3;
• FROM (
• FROM employee
• ) ranked
WHERE rank = 4;
• Other Formats:
o JSON: spark.read.json("path")
o Parquet: spark.read.parquet("path")
• Drop Nulls:
df = df.dropna()
• Fill Nulls:
• Remove Duplicates:
df = df.dropDuplicates(['col1', 'col2'])
df = df.withColumn("exploded_col", explode("array_col"))
df = spark.read.parquet("path/to/file.parquet")
df = spark.read.parquet("path/to/file.parquet")
df = df.withColumn("new_column", lit("value"))
df.write.parquet("path/to/updated_file.parquet")
• From a Collection:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
• From a File:
rdd = spark.sparkContext.textFile("path/to/file.txt")
• From RDD:
• from pyspark.sql import Row
df = rdd.toDF()
• From a File:
• From a List/Dictionary:
• data = [("Alice", 25), ("Bob", 30)]