vicsz / databricks-analyst-study-notes Public
Code Issues Pull requests Actions Projects Security Insights
databricks-analyst-study-notes / README.md
vicsz Update README.md bda37e7 · 5 months ago
831 lines (649 loc) · 33.6 KB
Preview Code Blame Raw
Databricks Certified Data Analyst
Associate – Study Notes
These are condensed study notes for the Databricks Certified Data Analyst Associate
exam, based on the official exam guide (Mar 1, 2025 version).
Study Notes: Exam Insights
Overall Difficulty and Format
The exam is entry-level, targeting data analysts comfortable with SQL and basic
analytics tasks.
Considered easier than Data Engineer Associate and more practical than theory-
heavy.
45 multiple-choice questions in 90 minutes — generous time allowance.
Candidates with solid SQL knowledge and light Databricks familiarity often pass
with minimal prep.
Key Topics to Understand (with Details)
Databricks SQL Interface & Dashboarding
SQL Editor
The SQL Editor allows users to write, run, and save SQL queries.
Features include:
Schema browser to explore available tables and columns.
Query results can be visualized (e.g., bar charts, line charts, pivot tables).
Queries can be scheduled to run at regular intervals or to trigger alerts.
Dashboards
Dashboards are built by adding visualizations based on one or more queries.
A single query can have multiple visualizations, and all can be added to the same
dashboard.
Visualizations can include filters or parameters to allow interactive control for
dashboard users.
Parameters and Dashboard Interactivity
You can assign query parameters to a Dashboard Parameter.
This links that parameter across multiple visualizations so they stay in sync when
the user changes the value on the dashboard.
This is useful for filtering all visuals (e.g., by region or product) using a single
dropdown control.
Sharing Dashboards Securely
Dashboards can be shared without granting access to the underlying data or
workspace using:
PDF export
PNG export of visualizations
Screenshots
Email subscriptions (via refresh schedule with recipients added)
Do not share Personal Access Tokens (PATs). These provide full authenticated
access and violate security requirements.
Visualization Types
Visualization What It Is When to Use
Vertical bars Compare values across categories
Bar Chart
comparing values (e.g., sales by region, product types)
Use in compact dashboards with
IBar Chart Inline-style bar chart
limited space
Visualization What It Is When to Use
Line connecting data Show trends over time (e.g., revenue
Line Chart
points over months)
Line chart with filled Emphasize total volume or magnitude
Area Chart
area over time
Circle divided into Show part-to-whole relationships with
Pie Chart
segments few categories (ideally less than 5)
Show frequency distribution of a
Histogram Distribution chart numeric variable (e.g., age, purchase
size)
Plot of two numeric Show relationships or correlations
Scatter Plot
axes between two numeric variables
Display intensity across two
Grid of values with
Heatmap categorical dimensions (e.g., hour of
color shading
day vs. day of week)
Table with groupings Summarize and drill into data by
Pivot Table
and aggregation multiple dimensions
Large number Display a single key metric (e.g., daily
Counter
representing a metric active users)
Flow diagram with Visualize paths or flows (e.g., user
Sankey
weighted connections journey through a website)
Words sized by Show frequency of terms (e.g., top
Word Cloud
frequency search queries)
Choropleth Map with shaded Show geographic distribution (e.g.,
Map regions sales by country or state)
SQL and Advanced Querying in Databricks
Window Functions
Allow you to perform row-wise calculations across partitions of data (like moving
averages or rankings).
Example:
SELECT name, region, revenue,
RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS rank
FROM sales;
Common window functions: RANK() , DENSE_RANK() , ROW_NUMBER() , LAG() ,
LEAD() , SUM() OVER(...)
Aggregations
Know how to use GROUP BY with functions like COUNT() , AVG() , MIN() , MAX() ,
SUM() .
Understand differences between scalar aggregation and group-level aggregation.
Subqueries
Queries nested inside SELECT , FROM , or WHERE .
Useful for filtering or isolating logic in complex queries.
Higher-Order Functions (Spark SQL Extensions)
Useful when working with arrays, maps, or nested fields.
Examples: transform , filter , exists , aggregate
SELECT transform(array(1, 2, 3), x -> x + 1);
-- returns [2, 3, 4]
Delta Lake & Table Management
Table Types
Managed Tables: Databricks stores both metadata and data; dropped table = data
deletion.
Unmanaged Tables: External data location; only metadata is deleted if table is
dropped.
Delta Lake Features
Delta Lake is Databricks' transaction layer on top of Parquet:
ACID Transactions: Ensures consistency.
Time Travel: Query older versions using VERSION AS OF or TIMESTAMP AS OF .
Schema Enforcement: Ensures data types match expectations.
Schema Evolution: Controlled changes to schema (with permissions).
Know key operations:
OPTIMIZE – Compacts small files into larger ones to improve read
performance.
ZORDER BY – Reorders data to improve performance on selective queries
(similar to indexing).
MERGE INTO – Performs upserts (insert/update) into a Delta table based on
conditions.
Visualization and Interactive Analytics
Visualization types supported:
Table, Counter, Pivot, Bar/Line charts
Visualizations are tied to query results, and can be reused across dashboards.
You can format visuals for readability (e.g., number precision, axis scaling).
Query parameters can drive dynamic filters, but only dropdowns are supported in
alerts.
Alerts and Scheduling
Alerts trigger when a query returns a value that meets a specified condition.
Example: Trigger an email if sales fall below $10,000.
Must be based on single-value numeric queries (e.g., SELECT COUNT(*) ).
Only dropdown parameters are supported — date pickers do not work with alerts.
Alerts can notify via email or webhook and are configured from the UI.
Query Scheduling:
Queries or dashboards can be set to auto-refresh at intervals.
Important: Ensure the warehouse used doesn't shut down before the refresh
triggers (align auto-stop and schedule).
Partner Connect and BI Integration
Partner Connect enables quick, UI-based connections to:
BI tools: Tableau, Power BI, Looker
Ingestion tools: Fivetran, Rivery, dbt
Reduces setup time and removes the need for manual config.
Also supports small file uploads (CSV) and external database connections (via
federation).
Security and Governance (Basic Analyst-Level)
Analysts can manage:
Sharing queries/dashboards
Transferring ownership of artifacts (e.g., if a colleague leaves).
Databricks supports fine-grained access control:
Viewers can refresh dashboards (if using owner’s credentials).
Editors can update or modify queries and visuals.
Know the difference between table/view/query access vs. dashboard-level
sharing.
Databricks SQL Language Study Guide
DDL – Data Definition Language
Create Tables
CREATE TABLE sales (
id INT,
region STRING,
amount DOUBLE
);
Creates a managed table stored by Databricks.
To create an external (unmanaged) table, use the LOCATION clause.
CREATE TABLE external_sales (
id INT,
region STRING,
amount DOUBLE
)
USING DELTA
LOCATION '/mnt/data/external_sales/';
Drop and Rename Tables
DROP TABLE IF EXISTS sales;
ALTER TABLE sales RENAME TO sales_2024;
DML – Data Manipulation Language
Insert, Update, Delete
INSERT INTO sales VALUES (1, 'East', 100.0);
UPDATE sales SET amount = amount * 1.1 WHERE region = 'East';
DELETE FROM sales WHERE region = 'West';
Views and Temporary Views
Permanent Views
Stored in the metastore and accessible across sessions.
CREATE OR REPLACE VIEW top_regions AS
SELECT region, SUM(amount) AS total FROM sales GROUP BY region;
Temporary Views
Session-scoped. Ideal for intermediate results or testing.
CREATE OR REPLACE TEMP VIEW temp_sales AS
SELECT * FROM sales WHERE amount > 100;
Temporary Tables (via Views)
Databricks supports temp views instead of temp tables. Use temp views to simulate
session-local temporary tables:
CREATE OR REPLACE TEMP VIEW temp_table AS
SELECT * FROM sales WHERE amount > 100;
Subqueries and CTEs
Subqueries
Used to filter or derive values inside another query.
SELECT * FROM sales
WHERE region IN (SELECT region FROM blacklist);
Common Table Expressions (CTEs)
Named, reusable subqueries defined with WITH . Good for readability and modular SQL
logic.
WITH high_sales AS (
SELECT region, SUM(amount) AS total
FROM sales
GROUP BY region
HAVING total > 1000
)
SELECT * FROM high_sales;
Joins (Standard and Extended)
Join Type Description
INNER JOIN Matches rows in both tables
LEFT OUTER JOIN All rows from left + matched rows from right
RIGHT OUTER JOIN All rows from right + matched rows from left
FULL OUTER JOIN All rows from both sides
CROSS JOIN All possible combinations
LEFT SEMI JOIN Keeps left-side rows with matches in right (like a filter)
LEFT ANTI JOIN Keeps left-side rows without matches in right
ANTI JOIN Example
SELECT * FROM sales
LEFT ANTI JOIN blacklist ON sales.region = blacklist.region;
Returns only sales rows where region is NOT in the blacklist.
Window Functions
Enable ranking, row-wise calculations, and running totals within partitions of data.
SELECT id, region, amount,
RANK() OVER (PARTITION BY region ORDER BY amount DESC) AS rank
FROM sales;
Common window functions:
ROW_NUMBER()
RANK() , DENSE_RANK()
LAG() , LEAD()
SUM() OVER (...)
CUBE and ROLLUP
Input Table: sales
region product amount
East A 100
East B 200
West A 150
West B 150
ROLLUP
The ROLLUP operator creates subtotals that roll up from the most granular level to a
grand total, following the column order.
SELECT region, product, SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP(region, product);
or
SELECT region, product, SUM(amount) AS total_sales
FROM sales
GROUP BY region, product WITH ROLLUP;
Explanation:
Subtotals are calculated in a hierarchical manner.
Aggregation moves from (region, product) → (region) → () (grand total).
ROLLUP Output
region product total_sales
East A 100
East B 200
East NULL 300
West A 150
West B 150
West NULL 300
NULL NULL 600
CUBE
The CUBE operator generates all combinations of the specified grouping columns,
including subtotals across each dimension and the grand total.
SELECT region, product, SUM(amount) AS total_sales
FROM sales
GROUP BY CUBE(region, product);
or
SELECT region, product, SUM(amount) AS total_sales
FROM sales
GROUP region, product WITH CUBE;
Explanation:
Returns subtotals for:
(region, product)
(region)
(product)
() (grand total)
Useful for multi-dimensional analysis and pivot-style reporting.
CUBE Output
region product total_sales
East A 100
East B 200
East NULL 300
West A 150
West B 150
West NULL 300
NULL A 250
NULL B 350
NULL NULL 600
Summary
Use ROLLUP when you want hierarchical subtotals, moving from detailed to
summary levels (e.g., region → total). Returns fewer rows with only valid rollup
paths.
Use CUBE when you want all possible subtotals across every grouping combination
(like a pivot table). Returns more rows including all subtotal combinations.
Delta Lake Features
MERGE INTO (Upserts)
Performs update or insert based on match condition.
MERGE INTO target USING source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET amount = source.amount
WHEN NOT MATCHED THEN
INSERT (id, region, amount) VALUES (source.id, source.region, source.amou
OPTIMIZE
Combines many small files into larger ones to improve query performance.
OPTIMIZE sales;
ZORDER
Sorts data to accelerate queries that filter on specific columns.
OPTIMIZE sales ZORDER BY (region);
Higher-Order and Array Functions
Databricks supports advanced functional programming-like syntax.
EXPLODE (unnest arrays into rows)
SELECT id, explode(items) AS item
FROM orders;
TRANSFORM (map over array)
SELECT transform(array(1, 2, 3), x -> x + 1);
-- Result: [2, 3, 4]
Other useful functions:
FILTER , AGGREGATE , EXISTS
ARRAY_CONTAINS , SIZE , MAP_KEYS
Inspecting Table Metadata
DESCRIBE and DESCRIBE EXTENDED
Use DESCRIBE to view a table or view’s column structure (name, type, and comment).
DESCRIBE sales;
Use DESCRIBE EXTENDED to retrieve both schema information and detailed metadata
such as:
Table location
Provider (e.g., Delta)
Table type (Managed/External)
Storage details
DESCRIBE EXTENDED sales;
Sample Output (DESCRIBE)
col_name data_type comment
id int
region string
amount double
Sample Output (DESCRIBE EXTENDED excerpt)
After the schema rows, you’ll also see:
col_name data_type comment
# Detailed Table Information
Location dbfs:/user/hive/warehouse/sales
Provider delta
Table Type MANAGED
... ...
Rule of Thumb
Use DESCRIBE to check schema quickly.
Use DESCRIBE EXTENDED when you need to confirm storage details, table type, or
format (especially for Delta tables).
Section 1 – Databricks SQL
Audience and Usage
Primary audience: Data analysts who query and visualize data using Databricks
SQL.
Side audiences: Business users, engineers, and data scientists consuming
dashboards.
Dashboards can be viewed and run by a broad range of stakeholders.
Advantages of Using Databricks SQL
Enables in-Lakehouse querying, avoiding data movement to third-party tools.
Streamlines BI workflows and simplifies data access across roles.
Ideal for exploratory analysis, dashboarding, and embedded analytics.
Querying in Databricks SQL
SQL is written and executed inside the Query Editor.
The Schema Browser helps explore available databases, tables, columns, and data
types.
Supports all standard SQL operations: SELECT , JOIN , WHERE , GROUP BY , etc.
Dashboards
Databricks SQL dashboards aggregate the output of multiple queries into a single
interface.
Dashboards can be built by selecting visualizations for individual query outputs.
Dashboards support scheduled auto-refresh to stay current with underlying data.
SQL Endpoints / Warehouses vs. Clusters
SQL Endpoints (Warehouses)
Purpose-built for SQL queries, dashboards, and BI tool integrations.
Support Partner Connect integrations with tools like Fivetran, Tableau, Power BI,
and Looker.
Typically the recommended destination for tools like Fivetran, which use SQL to
ingest or visualize data.
Serverless SQL warehouses offer fast startup, cost efficiency, and are easy to
manage.
Choose based on concurrency and performance needs—larger or multi-cluster
endpoints support higher workloads but at higher cost.
Auto Stop: Automatically shuts down the warehouse after inactivity to save costs,
but may introduce startup delay for scheduled queries.
Clusters
Used for notebooks, ETL jobs, data engineering, and interactive Spark workloads.
Support Python, Scala, R, and SQL in a more general-purpose environment.
While some integrations (like Fivetran) allow clusters as a destination, Databricks
recommends SQL warehouses for most partner workflows due to better support
for SQL-based access.
Clusters are not typically used directly for powering dashboards or BI tools.
Cluster size affects performance and concurrency, not startup speed. Larger
clusters run queries faster but cost more.
Data Integration and Ingestion
Partner Connect simplifies integration with tools like Fivetran, requiring partner-
side setup.
Small-file upload supports importing files like CSVs for quick testing or lookups.
Supports ingesting data from object storage (S3, ADLS) including directories of
files of the same type.
Visualization Tool Integration
Native connectors available for Tableau, Power BI, and Looker.
Databricks SQL complements these tools by serving as a fast, scalable backend.
Can be used to manage transformations and deliver clean tables for visual
consumption.
Medallion Architecture
Data is structured into bronze (raw), silver (cleaned), and gold (aggregated) layers.
Gold layer is most relevant to analysts and dashboards.
Promotes clarity, reliability, and reusability of datasets across teams.
Streaming Data
The Lakehouse architecture supports combining batch and streaming data.
Benefits: Enables real-time dashboards and timely alerting.
Cautions:
Streaming requires proper handling of data latency and order.
More complex to design and maintain than batch pipelines.
Section 2 – Data Management
Delta Lake Overview
Delta Lake is a storage layer that provides ACID transactions, scalable metadata
handling, and unified batch/streaming data processing.
Manages table metadata (schema, versions, history) automatically.
Maintains history of table changes, enabling time travel and rollback features.
Benefits of Delta Lake in the Lakehouse
Combines the reliability of data warehouses with the scalability of data lakes.
Supports schema enforcement and evolution, making it easier to work with
changing data.
Enables efficient reads and writes through data versioning and indexing.
Table Persistence and Scope
Tables in Databricks can be managed or unmanaged:
Managed tables: Databricks manages both metadata and data location.
Unmanaged tables: Only metadata is managed; data lives in external storage.
Use the LOCATION keyword to specify a custom path for table data when creating a
database or table.
Managed vs. Unmanaged Tables
Managed:
Data is deleted when the table or database is dropped.
Stored in the workspace-managed location.
Unmanaged:
Data remains in place even if the table is dropped.
Suitable for shared storage or external ingestion pipelines.
Use metadata inspection or tools like Data Explorer to determine table type.
Creating and Managing Database Objects
Use SQL commands or UI to:
CREATE , DROP , and RENAME databases, tables, and views.
Databricks supports:
Permanent views: Logical definitions that reference underlying tables.
Temporary views (temp views): Exist only for the session, not persisted to the
metastore.
Views vs. Temp Views
Views:
Persist across sessions.
Accessible by all users with appropriate permissions.
Temp views:
Exist only for the session and user that created them.
Useful for ad hoc or intermediate queries.
Data Explorer Capabilities
Used to:
Explore, preview, and secure data.
Identify the owner of a table.
Change access permissions for users and groups.
Supports access control via Unity Catalog (where available) or workspace-level
permissions.
Access Management
Table owners have full privileges to manage table metadata and data.
Responsibilities include managing access, updating schemas, and sharing data
appropriately.
Use Data Explorer or SQL commands to grant/revoke privileges.
PII and Organizational Considerations
Personal Identifiable Information (PII) must be handled according to organizational
policies.
Data analysts should be aware of:
Data classification and masking policies.
Access control rules for sensitive data.
Audit and compliance requirements tied to PII usage.
Section 3 – SQL in the Lakehouse
Querying Fundamentals
Identify queries that retrieve data using SELECT with specific WHERE conditions.
Understand and interpret the output of a SELECT query based on its columns and
clauses.
Data Insertion and Merging
Know when to use:
MERGE INTO : For upserts (insert/update based on match conditions).
INSERT INTO : To append new rows into a table.
COPY INTO : For loading external files (e.g., CSV, JSON, Parquet) into Delta
tables.
Query Simplification
Use subqueries (nested SELECTs) to simplify complex queries and improve
modularity.
Joins
Compare and contrast different types of joins:
INNER JOIN : Only matching records from both sides.
LEFT OUTER JOIN : All records from the left, and matched ones from the right.
RIGHT OUTER JOIN : All records from the right, and matched ones from the left.
FULL OUTER JOIN : All records when there is a match in either left or right.
CROSS JOIN : Cartesian product of both tables (all combinations).
Aggregations
Use aggregate functions like SUM() , AVG() , COUNT() , MIN() , MAX() to produce
summary metrics.
Group results using GROUP BY to structure the output.
Nested Data
Handle nested data formats (like structs or arrays) using dot notation and
functions like explode() .
Cube and Roll-Up
Use ROLLUP to aggregate hierarchically from the most detailed to the grand total.
Use CUBE to compute all combinations of groupings.
Know the difference:
ROLLUP is hierarchical and includes subtotals.
CUBE includes all possible subtotal combinations.
Window Functions
Use windowing (analytic) functions to calculate metrics across time or partitions:
Examples: ROW_NUMBER() , RANK() , LAG() , LEAD() , SUM() OVER (...) .
ANSI SQL Benefits
Having ANSI SQL support ensures consistency, portability, and familiarity.
Reduces the learning curve and enables analysts to apply standard SQL practices.
Silver-Level Data
Silver data is cleaned and joined, ready for consumption or further enrichment.
Know how to identify, access, and clean silver-layer data for analysis.
Query Optimization
Use query history to review and improve past queries.
Caching frequently used datasets or results helps reduce query latency and
development time.
Spark SQL Performance Tuning
Use higher-order functions (e.g., transform , filter , aggregate ) for more
efficient operations on arrays and maps.
Optimize queries for performance using these functional programming-like
constructs.
User-Defined Functions (UDFs)
Create and apply UDFs when built-in SQL functions are insufficient.
UDFs allow for reusable custom logic in common or complex transformations.
Understand appropriate scenarios for UDFs, especially when scaling across large
datasets.
Section 4 – Data Visualization and Dashboarding
Creating Visualizations
Build basic visualizations directly from query results in Databricks SQL.
Visualizations are schema-specific (based on query output structure).
Supported visualization types include:
Table
Details
Counter
Pivot table
Formatting and Storytelling
Visualization formatting significantly impacts how insights are perceived.
Enhance clarity and impact through:
Proper labeling
Axis scaling
Color usage
Consistent formatting
Use formatting techniques to add visual appeal and guide interpretation.
Customize visualizations to support data storytelling, emphasizing key trends or
comparisons.
Dashboard Composition
Build dashboards by combining multiple existing visualizations from saved
Databricks SQL queries.
Adjust color schemes across visualizations for a consistent appearance.
Use query parameters to allow dynamic updates to dashboard content based on
user input.
Dashboard Parameters
Understand how parameters affect the underlying query output:
Example: A dropdown that filters data by region or date.
Use "Query-Based Dropdown List" to populate a parameter from the distinct
results of another query.
This enables dynamic, user-driven filtering across visualizations.
Dashboard Sharing and Refreshing
Share dashboards in multiple ways:
With edit or view access
Public or private links (workspace dependent)
Evaluate pros and cons of each sharing method:
Public links are easy to distribute but reduce control.
Private dashboards maintain control but require user access setup.
Credential Behavior
Dashboards can be refreshed using the owner's credentials, allowing access for
users without direct permissions to all underlying objects.
This supports safe and controlled dashboard sharing.
Refresh Scheduling
Configure automatic refresh intervals to keep dashboards updated.
Be aware of potential issues:
If a dashboard’s refresh rate is shorter than the warehouse's Auto Stop
setting, the warehouse may shut down before a refresh occurs.
Consider aligning refresh schedules with warehouse lifecycle settings.
Alerts
Alerts monitor the result of a query and trigger notifications when a specified
condition is met.
Use cases include monitoring thresholds, anomalies, or status flags (e.g., "value
exceeds 100", "row count is zero").
Alerts are configured by:
Selecting a saved query
Defining a condition (e.g., value > X )
Choosing a target column and row (if applicable)
Specifying one or more notification channels (email, webhook, Slack, etc.)
Key Limitations
Alerts only work on queries that return a single numeric value (e.g., row count,
sum, or a calculated metric).
Alerts do not work with queries that return multiple rows or complex result sets.
Alerts are not compatible with date-type query parameters.
Alerts only support dropdown-based query parameters—these can be static lists
or populated from a query.
For example, you can filter by region or category using a dropdown, but you
cannot pass in a date picker.
Best Practices
Use alerts on queries specifically designed to return a single value for evaluation.
Avoid complex aggregations or queries with joins unless they simplify to a single
numeric result.
Use parameterized queries with dropdowns for dynamic alerting across categories
(e.g., per region or per status).
Align alert check frequency with your dashboard/warehouse refresh schedules.
Section 5 – Analytics Applications
Statistics and Distributions
Types of Variables
Discrete variables: Represent countable values, such as number of transactions or
logins. Typically whole numbers.
Continuous variables: Represent measurable values on a continuum, such as
temperature, revenue, or time. Can take on infinitely fine values.
Descriptive vs. Inferential Statistics
Descriptive statistics: Focus on summarizing and describing a dataset using
numerical metrics and visualizations.
Do not attempt to draw conclusions beyond the data.
Example: Calculating the average order value from a dataset.
Inferential statistics: Use sample data to infer or generalize about a population.
Includes hypothesis testing, confidence intervals, and regression.
Not heavily tested on the Databricks exam.
Rule of Thumb:
If it summarizes data without making predictions, it's descriptive statistics.
Measures in Descriptive Statistics
Central Tendency:
Mean: Average value.
Median: Middle value (less sensitive to outliers).
Mode: Most frequently occurring value.
Dispersion:
Standard deviation: How much values deviate from the mean.
Variance: Square of standard deviation.
Range: Difference between max and min.
Interquartile range (IQR): Spread of the middle 50% of the data (Q3 − Q1).
Moments of a Distribution
Moment What It Describes Use Case Example
1st Mean – central location Overall average
Variance/Standard Deviation – How tightly values cluster around
2nd
spread the mean
Detecting left/right tilt in data
3rd Skewness – asymmetry
distribution
Kurtosis – tail weight and High kurtosis = heavy tails, more
4th
outlier risk extreme values
Kurtosis Explained:
Low kurtosis: Data have light tails; few outliers.
High kurtosis: Data have heavy tails; more prone to extreme values or outliers.
Kurtosis is about the "tailedness" of a distribution, not its peak.
Kurtosis measures how likely a distribution is to produce outliers — not how tall or
flat the peak is. Think: "How fat are the tails?"
Comparing Statistical Measures
Concept Description and Comparison
Mean vs. Median Mean is affected by outliers; median is more robust.
Concept Description and Comparison
Standard Deviation SD uses all values; IQR focuses on the middle 50% (less
vs. IQR sensitive to outliers).
Data Enhancement
Data enhancement refers to enriching existing datasets by adding new attributes,
calculations, or contextual information.
This is a common step in analytics workflows to improve model accuracy or
business relevance.
Examples include:
Adding demographic features
Calculating customer lifetime value
Generating derived fields (e.g., revenue per user)
Identify scenarios where data enhancement is beneficial:
Improving dashboard insights
Supporting more granular segmentation
Enabling better forecasting or prediction
Data Blending
Data blending involves combining data from two or more source applications.
Typically used when joining internal and external datasets that are not in the same
system.
Scenarios where blending is useful:
Merging CRM data with support ticket logs
Joining product data from an ERP with marketing campaign performance
Last-Mile ETL
Last-mile ETL refers to project-specific transformations performed near the end of
a data pipeline.
Often involves:
Cleaning or reshaping gold-layer data
Formatting results for a specific dashboard or report
Applying business rules or mappings for final delivery
Supports the specific analytical needs of a team, stakeholder, or use case.