0% found this document useful (0 votes)

20 views25 pages

Databricks Analyst Study Notes GitHub

Uploaded by

M H

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views25 pages

Databricks Analyst Study Notes GitHub

Uploaded by

M H

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

vicsz / databricks-analyst-study-notes Public

Code Issues Pull requests Actions Projects Security Insights

databricks-analyst-study-notes / README.md

vicsz Update README.md bda37e7 · 5 months ago

831 lines (649 loc) · 33.6 KB

Preview Code Blame Raw

Databricks Certified Data Analyst

Associate – Study Notes
These are condensed study notes for the Databricks Certified Data Analyst Associate
exam, based on the official exam guide (Mar 1, 2025 version).

Study Notes: Exam Insights

Overall Difficulty and Format

The exam is entry-level, targeting data analysts comfortable with SQL and basic
analytics tasks.
Considered easier than Data Engineer Associate and more practical than theory-
heavy.
45 multiple-choice questions in 90 minutes — generous time allowance.
Candidates with solid SQL knowledge and light Databricks familiarity often pass
with minimal prep.

Key Topics to Understand (with Details)

Databricks SQL Interface & Dashboarding

SQL Editor
The SQL Editor allows users to write, run, and save SQL queries.
Features include:
Schema browser to explore available tables and columns.
Query results can be visualized (e.g., bar charts, line charts, pivot tables).
Queries can be scheduled to run at regular intervals or to trigger alerts.

Dashboards
Dashboards are built by adding visualizations based on one or more queries.
A single query can have multiple visualizations, and all can be added to the same
dashboard.
Visualizations can include filters or parameters to allow interactive control for
dashboard users.

Parameters and Dashboard Interactivity

You can assign query parameters to a Dashboard Parameter.
This links that parameter across multiple visualizations so they stay in sync when
the user changes the value on the dashboard.
This is useful for filtering all visuals (e.g., by region or product) using a single
dropdown control.

Sharing Dashboards Securely

Dashboards can be shared without granting access to the underlying data or
workspace using:
PDF export
PNG export of visualizations
Screenshots
Email subscriptions (via refresh schedule with recipients added)
Do not share Personal Access Tokens (PATs). These provide full authenticated
access and violate security requirements.

Visualization Types

Visualization What It Is When to Use

Vertical bars Compare values across categories

Bar Chart
comparing values (e.g., sales by region, product types)

Use in compact dashboards with

IBar Chart Inline-style bar chart
limited space
Visualization What It Is When to Use

Line connecting data Show trends over time (e.g., revenue

Line Chart
points over months)

Line chart with filled Emphasize total volume or magnitude

Area Chart
area over time

Circle divided into Show part-to-whole relationships with

Pie Chart
segments few categories (ideally less than 5)

Show frequency distribution of a

Histogram Distribution chart numeric variable (e.g., age, purchase
size)

Plot of two numeric Show relationships or correlations

Scatter Plot
axes between two numeric variables

Display intensity across two

Grid of values with
Heatmap categorical dimensions (e.g., hour of
color shading
day vs. day of week)

Table with groupings Summarize and drill into data by

Pivot Table
and aggregation multiple dimensions

Large number Display a single key metric (e.g., daily

Counter
representing a metric active users)

Flow diagram with Visualize paths or flows (e.g., user

Sankey
weighted connections journey through a website)

Words sized by Show frequency of terms (e.g., top

Word Cloud
frequency search queries)

Choropleth Map with shaded Show geographic distribution (e.g.,

Map regions sales by country or state)

SQL and Advanced Querying in Databricks

Window Functions

Allow you to perform row-wise calculations across partitions of data (like moving
averages or rankings).
Example:

SELECT name, region, revenue,

RANK() OVER (PARTITION BY region ORDER BY revenue DESC) AS rank
FROM sales;

Common window functions: RANK() , DENSE_RANK() , ROW_NUMBER() , LAG() ,

LEAD() , SUM() OVER(...)

Aggregations

Know how to use GROUP BY with functions like COUNT() , AVG() , MIN() , MAX() ,
SUM() .

Understand differences between scalar aggregation and group-level aggregation.

Subqueries

Queries nested inside SELECT , FROM , or WHERE .

Useful for filtering or isolating logic in complex queries.

Higher-Order Functions (Spark SQL Extensions)

Useful when working with arrays, maps, or nested fields.

Examples: transform , filter , exists , aggregate

SELECT transform(array(1, 2, 3), x -> x + 1);

-- returns [2, 3, 4]

Delta Lake & Table Management

Table Types

Managed Tables: Databricks stores both metadata and data; dropped table = data
deletion.
Unmanaged Tables: External data location; only metadata is deleted if table is
dropped.

Delta Lake Features

Delta Lake is Databricks' transaction layer on top of Parquet:

ACID Transactions: Ensures consistency.

Time Travel: Query older versions using VERSION AS OF or TIMESTAMP AS OF .
Schema Enforcement: Ensures data types match expectations.
Schema Evolution: Controlled changes to schema (with permissions).

Know key operations:

OPTIMIZE – Compacts small files into larger ones to improve read
performance.
ZORDER BY – Reorders data to improve performance on selective queries
(similar to indexing).
MERGE INTO – Performs upserts (insert/update) into a Delta table based on
conditions.

Visualization and Interactive Analytics

Visualization types supported:
Table, Counter, Pivot, Bar/Line charts
Visualizations are tied to query results, and can be reused across dashboards.
You can format visuals for readability (e.g., number precision, axis scaling).
Query parameters can drive dynamic filters, but only dropdowns are supported in
alerts.

Alerts and Scheduling

Alerts trigger when a query returns a value that meets a specified condition.
Example: Trigger an email if sales fall below $10,000.
Must be based on single-value numeric queries (e.g., SELECT COUNT(*) ).
Only dropdown parameters are supported — date pickers do not work with alerts.
Alerts can notify via email or webhook and are configured from the UI.
Query Scheduling:
Queries or dashboards can be set to auto-refresh at intervals.
Important: Ensure the warehouse used doesn't shut down before the refresh
triggers (align auto-stop and schedule).

Partner Connect and BI Integration

Partner Connect enables quick, UI-based connections to:
BI tools: Tableau, Power BI, Looker
Ingestion tools: Fivetran, Rivery, dbt
Reduces setup time and removes the need for manual config.
Also supports small file uploads (CSV) and external database connections (via
federation).
Security and Governance (Basic Analyst-Level)
Analysts can manage:
Sharing queries/dashboards
Transferring ownership of artifacts (e.g., if a colleague leaves).
Databricks supports fine-grained access control:
Viewers can refresh dashboards (if using owner’s credentials).
Editors can update or modify queries and visuals.
Know the difference between table/view/query access vs. dashboard-level
sharing.

Databricks SQL Language Study Guide

DDL – Data Definition Language

Create Tables

CREATE TABLE sales (

id INT,
region STRING,
amount DOUBLE
);

Creates a managed table stored by Databricks.

To create an external (unmanaged) table, use the LOCATION clause.

CREATE TABLE external_sales (

id INT,
region STRING,
amount DOUBLE
)
USING DELTA
LOCATION '/mnt/data/external_sales/';

Drop and Rename Tables

DROP TABLE IF EXISTS sales;

ALTER TABLE sales RENAME TO sales_2024;
DML – Data Manipulation Language

Insert, Update, Delete

INSERT INTO sales VALUES (1, 'East', 100.0);

UPDATE sales SET amount = amount * 1.1 WHERE region = 'East';

DELETE FROM sales WHERE region = 'West';

Views and Temporary Views

Permanent Views
Stored in the metastore and accessible across sessions.

CREATE OR REPLACE VIEW top_regions AS

SELECT region, SUM(amount) AS total FROM sales GROUP BY region;

Temporary Views
Session-scoped. Ideal for intermediate results or testing.

CREATE OR REPLACE TEMP VIEW temp_sales AS

SELECT * FROM sales WHERE amount > 100;

Temporary Tables (via Views)

Databricks supports temp views instead of temp tables. Use temp views to simulate
session-local temporary tables:

CREATE OR REPLACE TEMP VIEW temp_table AS

SELECT * FROM sales WHERE amount > 100;
Subqueries and CTEs

Subqueries
Used to filter or derive values inside another query.

SELECT * FROM sales

WHERE region IN (SELECT region FROM blacklist);

Common Table Expressions (CTEs)

Named, reusable subqueries defined with WITH . Good for readability and modular SQL
logic.

WITH high_sales AS (
SELECT region, SUM(amount) AS total
FROM sales
GROUP BY region
HAVING total > 1000
)
SELECT * FROM high_sales;

Joins (Standard and Extended)

Join Type Description

INNER JOIN Matches rows in both tables

LEFT OUTER JOIN All rows from left + matched rows from right

RIGHT OUTER JOIN All rows from right + matched rows from left

FULL OUTER JOIN All rows from both sides

CROSS JOIN All possible combinations

LEFT SEMI JOIN Keeps left-side rows with matches in right (like a filter)

LEFT ANTI JOIN Keeps left-side rows without matches in right

ANTI JOIN Example

SELECT * FROM sales

LEFT ANTI JOIN blacklist ON sales.region = blacklist.region;

Returns only sales rows where region is NOT in the blacklist.

Window Functions
Enable ranking, row-wise calculations, and running totals within partitions of data.

SELECT id, region, amount,

RANK() OVER (PARTITION BY region ORDER BY amount DESC) AS rank
FROM sales;

Common window functions:

ROW_NUMBER()

RANK() , DENSE_RANK()

LAG() , LEAD()

SUM() OVER (...)

CUBE and ROLLUP

Input Table: sales

region product amount

East A 100

East B 200

West A 150

West B 150

ROLLUP
The ROLLUP operator creates subtotals that roll up from the most granular level to a
grand total, following the column order.

SELECT region, product, SUM(amount) AS total_sales

FROM sales
GROUP BY ROLLUP(region, product);

SELECT region, product, SUM(amount) AS total_sales

FROM sales
GROUP BY region, product WITH ROLLUP;

Explanation:

Subtotals are calculated in a hierarchical manner.

Aggregation moves from (region, product) → (region) → () (grand total).

ROLLUP Output

region product total_sales

East A 100

East B 200

East NULL 300

West A 150

West B 150

West NULL 300

NULL NULL 600

CUBE
The CUBE operator generates all combinations of the specified grouping columns,
including subtotals across each dimension and the grand total.

SELECT region, product, SUM(amount) AS total_sales

FROM sales
GROUP BY CUBE(region, product);

SELECT region, product, SUM(amount) AS total_sales

FROM sales
GROUP region, product WITH CUBE;
Explanation:

Returns subtotals for:

(region, product)

(region)

(product)

() (grand total)

Useful for multi-dimensional analysis and pivot-style reporting.

CUBE Output

region product total_sales

East A 100

East B 200

East NULL 300

West A 150

West B 150

West NULL 300

NULL A 250

NULL B 350

NULL NULL 600

Summary
Use ROLLUP when you want hierarchical subtotals, moving from detailed to
summary levels (e.g., region → total). Returns fewer rows with only valid rollup
paths.
Use CUBE when you want all possible subtotals across every grouping combination
(like a pivot table). Returns more rows including all subtotal combinations.

Delta Lake Features

MERGE INTO (Upserts)

Performs update or insert based on match condition.
MERGE INTO target USING source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET amount = source.amount
WHEN NOT MATCHED THEN
INSERT (id, region, amount) VALUES (source.id, source.region, source.amou

 

OPTIMIZE
Combines many small files into larger ones to improve query performance.

OPTIMIZE sales;

ZORDER
Sorts data to accelerate queries that filter on specific columns.

OPTIMIZE sales ZORDER BY (region);

Higher-Order and Array Functions

Databricks supports advanced functional programming-like syntax.

EXPLODE (unnest arrays into rows)

SELECT id, explode(items) AS item

FROM orders;

TRANSFORM (map over array)

SELECT transform(array(1, 2, 3), x -> x + 1);

-- Result: [2, 3, 4]

Other useful functions:

FILTER , AGGREGATE , EXISTS

ARRAY_CONTAINS , SIZE , MAP_KEYS

Inspecting Table Metadata

DESCRIBE and DESCRIBE EXTENDED

Use DESCRIBE to view a table or view’s column structure (name, type, and comment).

DESCRIBE sales;

Use DESCRIBE EXTENDED to retrieve both schema information and detailed metadata
such as:

Table location
Provider (e.g., Delta)
Table type (Managed/External)
Storage details

DESCRIBE EXTENDED sales;

Sample Output (DESCRIBE)

col_name data_type comment

id int

region string

amount double

Sample Output (DESCRIBE EXTENDED excerpt)

After the schema rows, you’ll also see:

col_name data_type comment

# Detailed Table Information

Location dbfs:/user/hive/warehouse/sales

Provider delta

Table Type MANAGED

... ...

Rule of Thumb
Use DESCRIBE to check schema quickly.
Use DESCRIBE EXTENDED when you need to confirm storage details, table type, or
format (especially for Delta tables).

Section 1 – Databricks SQL

Audience and Usage

Primary audience: Data analysts who query and visualize data using Databricks
SQL.
Side audiences: Business users, engineers, and data scientists consuming
dashboards.
Dashboards can be viewed and run by a broad range of stakeholders.

Advantages of Using Databricks SQL

Enables in-Lakehouse querying, avoiding data movement to third-party tools.
Streamlines BI workflows and simplifies data access across roles.
Ideal for exploratory analysis, dashboarding, and embedded analytics.

Querying in Databricks SQL

SQL is written and executed inside the Query Editor.
The Schema Browser helps explore available databases, tables, columns, and data
types.
Supports all standard SQL operations: SELECT , JOIN , WHERE , GROUP BY , etc.

Dashboards
Databricks SQL dashboards aggregate the output of multiple queries into a single
interface.
Dashboards can be built by selecting visualizations for individual query outputs.
Dashboards support scheduled auto-refresh to stay current with underlying data.

SQL Endpoints / Warehouses vs. Clusters

SQL Endpoints (Warehouses)

Purpose-built for SQL queries, dashboards, and BI tool integrations.

Support Partner Connect integrations with tools like Fivetran, Tableau, Power BI,
and Looker.
Typically the recommended destination for tools like Fivetran, which use SQL to
ingest or visualize data.
Serverless SQL warehouses offer fast startup, cost efficiency, and are easy to
manage.
Choose based on concurrency and performance needs—larger or multi-cluster
endpoints support higher workloads but at higher cost.
Auto Stop: Automatically shuts down the warehouse after inactivity to save costs,
but may introduce startup delay for scheduled queries.

Clusters

Used for notebooks, ETL jobs, data engineering, and interactive Spark workloads.
Support Python, Scala, R, and SQL in a more general-purpose environment.
While some integrations (like Fivetran) allow clusters as a destination, Databricks
recommends SQL warehouses for most partner workflows due to better support
for SQL-based access.
Clusters are not typically used directly for powering dashboards or BI tools.
Cluster size affects performance and concurrency, not startup speed. Larger
clusters run queries faster but cost more.

Data Integration and Ingestion

Partner Connect simplifies integration with tools like Fivetran, requiring partner-
side setup.
Small-file upload supports importing files like CSVs for quick testing or lookups.
Supports ingesting data from object storage (S3, ADLS) including directories of
files of the same type.

Visualization Tool Integration

Native connectors available for Tableau, Power BI, and Looker.
Databricks SQL complements these tools by serving as a fast, scalable backend.
Can be used to manage transformations and deliver clean tables for visual
consumption.

Medallion Architecture
Data is structured into bronze (raw), silver (cleaned), and gold (aggregated) layers.
Gold layer is most relevant to analysts and dashboards.
Promotes clarity, reliability, and reusability of datasets across teams.

Streaming Data
The Lakehouse architecture supports combining batch and streaming data.
Benefits: Enables real-time dashboards and timely alerting.
Cautions:
Streaming requires proper handling of data latency and order.
More complex to design and maintain than batch pipelines.

Section 2 – Data Management

Delta Lake Overview

Delta Lake is a storage layer that provides ACID transactions, scalable metadata
handling, and unified batch/streaming data processing.
Manages table metadata (schema, versions, history) automatically.
Maintains history of table changes, enabling time travel and rollback features.

Benefits of Delta Lake in the Lakehouse

Combines the reliability of data warehouses with the scalability of data lakes.
Supports schema enforcement and evolution, making it easier to work with
changing data.
Enables efficient reads and writes through data versioning and indexing.

Table Persistence and Scope

Tables in Databricks can be managed or unmanaged:
Managed tables: Databricks manages both metadata and data location.
Unmanaged tables: Only metadata is managed; data lives in external storage.
Use the LOCATION keyword to specify a custom path for table data when creating a
database or table.

Managed vs. Unmanaged Tables

Managed:
Data is deleted when the table or database is dropped.
Stored in the workspace-managed location.
Unmanaged:
Data remains in place even if the table is dropped.
Suitable for shared storage or external ingestion pipelines.
Use metadata inspection or tools like Data Explorer to determine table type.

Creating and Managing Database Objects

Use SQL commands or UI to:
CREATE , DROP , and RENAME databases, tables, and views.

Databricks supports:
Permanent views: Logical definitions that reference underlying tables.
Temporary views (temp views): Exist only for the session, not persisted to the
metastore.

Views vs. Temp Views

Views:
Persist across sessions.
Accessible by all users with appropriate permissions.
Temp views:
Exist only for the session and user that created them.
Useful for ad hoc or intermediate queries.

Data Explorer Capabilities

Used to:
Explore, preview, and secure data.
Identify the owner of a table.
Change access permissions for users and groups.
Supports access control via Unity Catalog (where available) or workspace-level
permissions.

Access Management
Table owners have full privileges to manage table metadata and data.
Responsibilities include managing access, updating schemas, and sharing data
appropriately.
Use Data Explorer or SQL commands to grant/revoke privileges.

PII and Organizational Considerations

Personal Identifiable Information (PII) must be handled according to organizational
policies.
Data analysts should be aware of:
Data classification and masking policies.
Access control rules for sensitive data.
Audit and compliance requirements tied to PII usage.
Section 3 – SQL in the Lakehouse

Querying Fundamentals
Identify queries that retrieve data using SELECT with specific WHERE conditions.
Understand and interpret the output of a SELECT query based on its columns and
clauses.

Data Insertion and Merging

Know when to use:
MERGE INTO : For upserts (insert/update based on match conditions).

INSERT INTO : To append new rows into a table.

COPY INTO : For loading external files (e.g., CSV, JSON, Parquet) into Delta
tables.

Query Simplification
Use subqueries (nested SELECTs) to simplify complex queries and improve
modularity.

Joins
Compare and contrast different types of joins:
INNER JOIN : Only matching records from both sides.

LEFT OUTER JOIN : All records from the left, and matched ones from the right.

RIGHT OUTER JOIN : All records from the right, and matched ones from the left.

FULL OUTER JOIN : All records when there is a match in either left or right.

CROSS JOIN : Cartesian product of both tables (all combinations).

Aggregations
Use aggregate functions like SUM() , AVG() , COUNT() , MIN() , MAX() to produce
summary metrics.
Group results using GROUP BY to structure the output.

Nested Data
Handle nested data formats (like structs or arrays) using dot notation and
functions like explode() .

Cube and Roll-Up

Use ROLLUP to aggregate hierarchically from the most detailed to the grand total.
Use CUBE to compute all combinations of groupings.
Know the difference:
ROLLUP is hierarchical and includes subtotals.
CUBE includes all possible subtotal combinations.

Window Functions
Use windowing (analytic) functions to calculate metrics across time or partitions:
Examples: ROW_NUMBER() , RANK() , LAG() , LEAD() , SUM() OVER (...) .

ANSI SQL Benefits

Having ANSI SQL support ensures consistency, portability, and familiarity.
Reduces the learning curve and enables analysts to apply standard SQL practices.

Silver-Level Data
Silver data is cleaned and joined, ready for consumption or further enrichment.
Know how to identify, access, and clean silver-layer data for analysis.

Query Optimization
Use query history to review and improve past queries.
Caching frequently used datasets or results helps reduce query latency and
development time.

Spark SQL Performance Tuning

Use higher-order functions (e.g., transform , filter , aggregate ) for more
efficient operations on arrays and maps.
Optimize queries for performance using these functional programming-like
constructs.

User-Defined Functions (UDFs)

Create and apply UDFs when built-in SQL functions are insufficient.
UDFs allow for reusable custom logic in common or complex transformations.
Understand appropriate scenarios for UDFs, especially when scaling across large
datasets.
Section 4 – Data Visualization and Dashboarding

Creating Visualizations
Build basic visualizations directly from query results in Databricks SQL.
Visualizations are schema-specific (based on query output structure).
Supported visualization types include:
Table
Details
Counter
Pivot table

Formatting and Storytelling

Visualization formatting significantly impacts how insights are perceived.
Enhance clarity and impact through:
Proper labeling
Axis scaling
Color usage
Consistent formatting
Use formatting techniques to add visual appeal and guide interpretation.
Customize visualizations to support data storytelling, emphasizing key trends or
comparisons.

Dashboard Composition
Build dashboards by combining multiple existing visualizations from saved
Databricks SQL queries.
Adjust color schemes across visualizations for a consistent appearance.
Use query parameters to allow dynamic updates to dashboard content based on
user input.

Dashboard Parameters
Understand how parameters affect the underlying query output:
Example: A dropdown that filters data by region or date.
Use "Query-Based Dropdown List" to populate a parameter from the distinct
results of another query.
This enables dynamic, user-driven filtering across visualizations.

Dashboard Sharing and Refreshing

Share dashboards in multiple ways:
With edit or view access
Public or private links (workspace dependent)
Evaluate pros and cons of each sharing method:
Public links are easy to distribute but reduce control.
Private dashboards maintain control but require user access setup.

Credential Behavior
Dashboards can be refreshed using the owner's credentials, allowing access for
users without direct permissions to all underlying objects.
This supports safe and controlled dashboard sharing.

Refresh Scheduling
Configure automatic refresh intervals to keep dashboards updated.
Be aware of potential issues:
If a dashboard’s refresh rate is shorter than the warehouse's Auto Stop
setting, the warehouse may shut down before a refresh occurs.
Consider aligning refresh schedules with warehouse lifecycle settings.

Alerts
Alerts monitor the result of a query and trigger notifications when a specified
condition is met.
Use cases include monitoring thresholds, anomalies, or status flags (e.g., "value
exceeds 100", "row count is zero").
Alerts are configured by:
Selecting a saved query
Defining a condition (e.g., value > X )
Choosing a target column and row (if applicable)
Specifying one or more notification channels (email, webhook, Slack, etc.)

Key Limitations

Alerts only work on queries that return a single numeric value (e.g., row count,
sum, or a calculated metric).
Alerts do not work with queries that return multiple rows or complex result sets.
Alerts are not compatible with date-type query parameters.
Alerts only support dropdown-based query parameters—these can be static lists
or populated from a query.
For example, you can filter by region or category using a dropdown, but you
cannot pass in a date picker.

Best Practices

Use alerts on queries specifically designed to return a single value for evaluation.
Avoid complex aggregations or queries with joins unless they simplify to a single
numeric result.
Use parameterized queries with dropdowns for dynamic alerting across categories
(e.g., per region or per status).
Align alert check frequency with your dashboard/warehouse refresh schedules.

Section 5 – Analytics Applications

Statistics and Distributions

Types of Variables
Discrete variables: Represent countable values, such as number of transactions or
logins. Typically whole numbers.
Continuous variables: Represent measurable values on a continuum, such as
temperature, revenue, or time. Can take on infinitely fine values.

Descriptive vs. Inferential Statistics

Descriptive statistics: Focus on summarizing and describing a dataset using
numerical metrics and visualizations.

Do not attempt to draw conclusions beyond the data.

Example: Calculating the average order value from a dataset.

Inferential statistics: Use sample data to infer or generalize about a population.

Includes hypothesis testing, confidence intervals, and regression.

Not heavily tested on the Databricks exam.

Rule of Thumb:

If it summarizes data without making predictions, it's descriptive statistics.

Measures in Descriptive Statistics

Central Tendency:

Mean: Average value.

Median: Middle value (less sensitive to outliers).
Mode: Most frequently occurring value.

Dispersion:

Standard deviation: How much values deviate from the mean.

Variance: Square of standard deviation.
Range: Difference between max and min.
Interquartile range (IQR): Spread of the middle 50% of the data (Q3 − Q1).

Moments of a Distribution

Moment What It Describes Use Case Example

1st Mean – central location Overall average

Variance/Standard Deviation – How tightly values cluster around

2nd
spread the mean

Detecting left/right tilt in data

3rd Skewness – asymmetry
distribution

Kurtosis – tail weight and High kurtosis = heavy tails, more

4th
outlier risk extreme values

Kurtosis Explained:

Low kurtosis: Data have light tails; few outliers.

High kurtosis: Data have heavy tails; more prone to extreme values or outliers.
Kurtosis is about the "tailedness" of a distribution, not its peak.
Kurtosis measures how likely a distribution is to produce outliers — not how tall or
flat the peak is. Think: "How fat are the tails?"

Comparing Statistical Measures

Concept Description and Comparison

Mean vs. Median Mean is affected by outliers; median is more robust.

Concept Description and Comparison

Standard Deviation SD uses all values; IQR focuses on the middle 50% (less
vs. IQR sensitive to outliers).

Data Enhancement
Data enhancement refers to enriching existing datasets by adding new attributes,
calculations, or contextual information.
This is a common step in analytics workflows to improve model accuracy or
business relevance.
Examples include:
Adding demographic features
Calculating customer lifetime value
Generating derived fields (e.g., revenue per user)
Identify scenarios where data enhancement is beneficial:
Improving dashboard insights
Supporting more granular segmentation
Enabling better forecasting or prediction

Data Blending
Data blending involves combining data from two or more source applications.
Typically used when joining internal and external datasets that are not in the same
system.
Scenarios where blending is useful:
Merging CRM data with support ticket logs
Joining product data from an ERP with marketing campaign performance

Last-Mile ETL
Last-mile ETL refers to project-specific transformations performed near the end of
a data pipeline.
Often involves:
Cleaning or reshaping gold-layer data
Formatting results for a specific dashboard or report
Applying business rules or mappings for final delivery
Supports the specific analytical needs of a team, stakeholder, or use case.

DV Question Asn
No ratings yet
DV Question Asn
21 pages
DS Lab 10
No ratings yet
DS Lab 10
16 pages
Data Visualization Is A Graphical Representation of Informat
No ratings yet
Data Visualization Is A Graphical Representation of Informat
11 pages
Tableau 3
No ratings yet
Tableau 3
9 pages
Chapter 6
No ratings yet
Chapter 6
13 pages
Dashboard Designer PDF
No ratings yet
Dashboard Designer PDF
53 pages
Data Visualization Techniques 1
No ratings yet
Data Visualization Techniques 1
27 pages
Data-Visualization Intro
No ratings yet
Data-Visualization Intro
7 pages
M4
No ratings yet
M4
14 pages
Tableau Course and Exam Guide
No ratings yet
Tableau Course and Exam Guide
26 pages
Tableau Classes
No ratings yet
Tableau Classes
51 pages
Power BI Manual 04
No ratings yet
Power BI Manual 04
16 pages
DV HarshExamAns
No ratings yet
DV HarshExamAns
125 pages
Power BI Manual - 04
No ratings yet
Power BI Manual - 04
16 pages
Dsbda Ut6
No ratings yet
Dsbda Ut6
11 pages
DS Unit 4
No ratings yet
DS Unit 4
37 pages
Notes
No ratings yet
Notes
116 pages
Ba Theory
No ratings yet
Ba Theory
24 pages
Visualization
No ratings yet
Visualization
20 pages
Fundamentals of Data Visualization
No ratings yet
Fundamentals of Data Visualization
17 pages
Power Bi Documents
100% (1)
Power Bi Documents
12 pages
Data Visualization Techniques
No ratings yet
Data Visualization Techniques
20 pages
4 Power BI Visualization and Animation
No ratings yet
4 Power BI Visualization and Animation
30 pages
BAIL504 Lab Manual
No ratings yet
BAIL504 Lab Manual
99 pages
Presentation 1
No ratings yet
Presentation 1
6 pages
Comprehensive Study Guide
No ratings yet
Comprehensive Study Guide
2 pages
Task 20B
No ratings yet
Task 20B
9 pages
Tableau Data Visualization Guide
No ratings yet
Tableau Data Visualization Guide
25 pages
Understanding Data and Metrics in Dashboards
No ratings yet
Understanding Data and Metrics in Dashboards
9 pages
Ba Sem 5 Answers
No ratings yet
Ba Sem 5 Answers
9 pages
Excel Data Visualization Essentials
100% (3)
Excel Data Visualization Essentials
24 pages
Advanced Grafana Dashboard Guide
No ratings yet
Advanced Grafana Dashboard Guide
36 pages
Cheat - Sheet - PDF Power BI
No ratings yet
Cheat - Sheet - PDF Power BI
2 pages
5.1 Data Visualization I
No ratings yet
5.1 Data Visualization I
91 pages
Analyzing and Visualizing Data
No ratings yet
Analyzing and Visualizing Data
42 pages
DAV ESX Answer
No ratings yet
DAV ESX Answer
58 pages
The Beauty of Dashboards
No ratings yet
The Beauty of Dashboards
3 pages
Slide Deck Aibi For Data Analysts
No ratings yet
Slide Deck Aibi For Data Analysts
71 pages
Lecture Notes 1 - Introduction To Data Analysis and Visualization-1718780831207
No ratings yet
Lecture Notes 1 - Introduction To Data Analysis and Visualization-1718780831207
11 pages
Tableau 2018.1
100% (1)
Tableau 2018.1
127 pages
Data Visualization Seminar Report4.docx 11
No ratings yet
Data Visualization Seminar Report4.docx 11
40 pages
(602107) - Introduction To Data Analytics - Tuáº N 2 - 3 - Chapter02 - Updated
No ratings yet
(602107) - Introduction To Data Analytics - Tuáº N 2 - 3 - Chapter02 - Updated
32 pages
Unit 2 Visualization
No ratings yet
Unit 2 Visualization
36 pages
Data Visualization CAE-1
No ratings yet
Data Visualization CAE-1
8 pages
Lecture 1
No ratings yet
Lecture 1
10 pages
Business Analytics Overview and Techniques
No ratings yet
Business Analytics Overview and Techniques
37 pages
BA Unit 1
No ratings yet
BA Unit 1
38 pages
Unit 2 - Knowledge Delivery
No ratings yet
Unit 2 - Knowledge Delivery
31 pages
105-106 Data Visualization Techniques Tools and Best Practices
No ratings yet
105-106 Data Visualization Techniques Tools and Best Practices
25 pages
BAIL504 Lab Manual
No ratings yet
BAIL504 Lab Manual
89 pages
Lecture3434 - CAP792 - UNIT 5
No ratings yet
Lecture3434 - CAP792 - UNIT 5
25 pages
Unit 6
No ratings yet
Unit 6
12 pages
Unit V-Data Visualization
No ratings yet
Unit V-Data Visualization
5 pages
Big Dataf8
No ratings yet
Big Dataf8
7 pages
DV Important Notes
No ratings yet
DV Important Notes
60 pages
Unit 5
No ratings yet
Unit 5
6 pages
Written Report - Chapter 3 - Visualizing Data
100% (1)
Written Report - Chapter 3 - Visualizing Data
5 pages
Data Visualization
No ratings yet
Data Visualization
10 pages
Unit 3 - LIVES Approach and Visualization - Live Session
No ratings yet
Unit 3 - LIVES Approach and Visualization - Live Session
16 pages
Dev List
No ratings yet
Dev List
8 pages
Introduction To Algorithmic Government 1st Ed. 2021 Edition Rajan Gupta Full Access
No ratings yet
Introduction To Algorithmic Government 1st Ed. 2021 Edition Rajan Gupta Full Access
78 pages
Vancouver School Board CAP for IEP
No ratings yet
Vancouver School Board CAP for IEP
5 pages
Chapter 4 Software Project Planning
No ratings yet
Chapter 4 Software Project Planning
206 pages
Blockchain Data Mesh for Enterprises
No ratings yet
Blockchain Data Mesh for Enterprises
15 pages
Summerinternshipreport
No ratings yet
Summerinternshipreport
20 pages
CBSE CLass 6 A QP Half Yearly
No ratings yet
CBSE CLass 6 A QP Half Yearly
3 pages
Assignment 5 Solutions
No ratings yet
Assignment 5 Solutions
4 pages
ANSYS Installation Guide for Win7
100% (1)
ANSYS Installation Guide for Win7
2 pages
IDOC Configuration for SAP Users
No ratings yet
IDOC Configuration for SAP Users
36 pages
Hardware Development Guide For The MIMXRT1020 Processor: Application Note
No ratings yet
Hardware Development Guide For The MIMXRT1020 Processor: Application Note
24 pages
Lecture 3 Computer Networks
No ratings yet
Lecture 3 Computer Networks
30 pages
Five Steps For Moving Away From Tape Backup
No ratings yet
Five Steps For Moving Away From Tape Backup
13 pages
Differences Between Traditional & Digital Libraries
No ratings yet
Differences Between Traditional & Digital Libraries
2 pages
Delta PSC3 Configuration Guide
100% (3)
Delta PSC3 Configuration Guide
11 pages
AIOWPSEC Firewall Configuration Guide
No ratings yet
AIOWPSEC Firewall Configuration Guide
2 pages
Symmetric Cryptography Course Syllabus
No ratings yet
Symmetric Cryptography Course Syllabus
3 pages
Nokia 6100 LCD Driver Guide
No ratings yet
Nokia 6100 LCD Driver Guide
55 pages
Master of Engineering 2017
100% (1)
Master of Engineering 2017
21 pages
Auxiliary Engine Control Course
100% (2)
Auxiliary Engine Control Course
35 pages
Manual Ne40 Huawei
No ratings yet
Manual Ne40 Huawei
474 pages
Barcode Scanner User Guide
50% (2)
Barcode Scanner User Guide
68 pages
st12575 Brautoivi1118 Web
No ratings yet
st12575 Brautoivi1118 Web
12 pages
WatchDox Guide for Legal Professionals
No ratings yet
WatchDox Guide for Legal Professionals
11 pages
Intel 8086 Assembly Programming Guide
No ratings yet
Intel 8086 Assembly Programming Guide
16 pages
LFX Mentorship Cover Letter
No ratings yet
LFX Mentorship Cover Letter
2 pages
June 2023 (v1) QP
No ratings yet
June 2023 (v1) QP
16 pages
Shivani Patil: Academic Details
No ratings yet
Shivani Patil: Academic Details
2 pages
Export ARCHICAD To Rhino-User Guide
No ratings yet
Export ARCHICAD To Rhino-User Guide
8 pages
Unit 3 DWH
No ratings yet
Unit 3 DWH
14 pages