DataUnderstandingAndPreparation DOM304

The document outlines a generic approach to problem-solving using analytics, including steps such as defining the problem, collecting and understanding data, and communicating results. It discusses various types of data sets and data objects, as well as statistical methods for analyzing data, including measures of central tendency and dispersion. Additionally, it presents an in-class activity focused on estimating the price of second-hand cars using existing transaction data.

Uploaded by

bqqj5qbt8n

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views19 pages

DataUnderstandingAndPreparation DOM304

Uploaded by

bqqj5qbt8n

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Problem Solving using Analytics

Analytics Problem Solving– Generic

Approach
• Define the problem
• Collect data
• Understand data
• Clean data
• Build reports / dashboards / models
• Communicate the results
• Deploy
• Monitor
• Update
Understanding Data
• Data objects and attribute types
• Statistical summaries of data
Types of Data Sets
• Record • Ordered
– Relational records – Video data: sequence of
– Data matrix, e.g., images
numerical matrix, – Temporal data: time-
crosstabs series
– Document data: text – Sequential Data:
documents: term- transaction sequences
frequency vector – Genetic sequence data
– Transaction data

• Graph and network • Spatial, image and

– World Wide Web multimedia:
– Social or information – Spatial data: maps
networks – Image data:
– Molecular Structures – Video data: 4
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points, objects,
tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
5
Attributes
• Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Ordinal
– Binary
– Numeric
– Dates
6
Understanding Data
• Data objects and attribute types
• Statistical summaries of data
Basic Statistical Descriptions of
Data
• Motivation
– To better understand the data: central tendency, variation
and spread

• Key Concepts
– Central tendency
– Dispersion
– Distribution and skewness
– Correlation

8
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population):
1 n
x   xi   x
Note: n is sample size and N is population size.
n i 1 N
– Weighted arithmetic mean: n

– Trimmed mean: chopping extreme values w x i i

x i 1
n
• Median:
w i
– Middle value if odd number of values, or average of the i 1

middle two values otherwise

– Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median L1  ( ) width
freq median
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
9
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric

symmetric, positively and

negatively skewed data

negatively
positively skewed
skewed

10
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)

– Inter-quartile range: IQR = Q3 – Q1

– Five number summary: min, Q1, median, Q3, max

– Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic,
n
scalable
n
computation)
n
1 1 1 1 n
1 n

  
2
  xi   2
2 2 2
s  ( xi  x )  [ xi  ( xi ]
)  
2
( xi  
2
) 
2

n  1 i 1 n  1 i 1 n i 1 N i 1 N i 1

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)

11
Boxplot Analysis
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to
Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually

12
Boxplot

13
Histogram Analysis
• Histogram: Graph display of tabulated 40

frequencies, shown as bars 35

30
• It shows what proportion of cases fall 25
into each of several categories 20

• Differs from a bar chart in that it is the 15

10
area of the bar that denotes the value,
5
not the height as in bar charts, a 0
crucial distinction when the categories 10000 30000 50000 70000 90000

are not of uniform width

• The categories are usually specified as
non-overlapping intervals of some
variable. The categories (bars) must be
adjacent

14
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane

15
Positively and Negatively
Correlated Data

• The left half fragment is

positively correlated
• The right half is negative
correlated 16
Uncorrelated Data

17
Summary
• Data understanding is about getting a “feel” of the data
• Key points to observe:
– Size and type of data
– Meaning / definitions of different columns
– Missing values
– Mean / Median, Dispersion
– Distribution, skewness (if relevant)
– Bi-variate relationships – Correlation

18
In-class activity – Second hand cars
(EDA)
Gregory has been hired as a data scientist in GoldenSeconds, an
upcoming startup which intends to be a marketplace for second
hand cars. He has been asked to build a tool, which would help
estimate the price of a second hand car. GoldenSeconds wants to
integrate the tool with their website, so that interested sellers can
get a quick and fair estimate of their car’s price. Thanks to a
previous market research initiative, the company already has
details of ~ 10,000 second hand car transactions. Gregory starts
by taking a closer look at the data.
Put yourself in Gregory’s place and answer the following:
1. What is the average price of a car sold? What is the median
price?
2. What percentage of values in the variable ‘cert’ are null values?
How should they be handled?

Chapter 2
No ratings yet
Chapter 2
53 pages
Data Exploration and Histogram Analysis
No ratings yet
Data Exploration and Histogram Analysis
56 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
02 Data
No ratings yet
02 Data
24 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Lect 3
No ratings yet
Lect 3
51 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
About Data
No ratings yet
About Data
25 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Module 1
No ratings yet
Module 1
64 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
02 Data
No ratings yet
02 Data
65 pages
02 Data
No ratings yet
02 Data
62 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
02 Data
No ratings yet
02 Data
65 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
02 Data
No ratings yet
02 Data
41 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
02 Data
No ratings yet
02 Data
66 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
02 Data
No ratings yet
02 Data
64 pages
Data Understanding in Data Mining
No ratings yet
Data Understanding in Data Mining
35 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
Data Mining Concepts and Techniques
100% (1)
Data Mining Concepts and Techniques
63 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Data Visualization Techniques and Tools
No ratings yet
Data Visualization Techniques and Tools
195 pages
Data Exploration and Preprocessing Guide
No ratings yet
Data Exploration and Preprocessing Guide
81 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
4 DataUnderstanding
No ratings yet
4 DataUnderstanding
51 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
GRE Verbal Class 2 Exercises
No ratings yet
GRE Verbal Class 2 Exercises
25 pages
CR - Strengthen Weaken
No ratings yet
CR - Strengthen Weaken
67 pages
RC Classroom Session 1
No ratings yet
RC Classroom Session 1
6 pages
ShagnikRoy ECO373 Assignment
No ratings yet
ShagnikRoy ECO373 Assignment
15 pages
Text To Multi Modal Text Exercise
No ratings yet
Text To Multi Modal Text Exercise
3 pages
Colour Coding of Piping Material
No ratings yet
Colour Coding of Piping Material
2 pages
Chapter 4-1
No ratings yet
Chapter 4-1
16 pages
Automation Stations Modular Model PXC... - U
No ratings yet
Automation Stations Modular Model PXC... - U
12 pages
Brief History of The Kombolcha Textile Share Company
No ratings yet
Brief History of The Kombolcha Textile Share Company
7 pages
Shardmind 5e Race
No ratings yet
Shardmind 5e Race
1 page
GCC Halal Food Standards Guide
No ratings yet
GCC Halal Food Standards Guide
10 pages
Free Mental Health Apps for Veterans
No ratings yet
Free Mental Health Apps for Veterans
5 pages
CM2279 @ptitudexchange
No ratings yet
CM2279 @ptitudexchange
4 pages
Gasket Parameters
No ratings yet
Gasket Parameters
9 pages
Jubilate Agno
No ratings yet
Jubilate Agno
116 pages
Beauty Therapy 1
No ratings yet
Beauty Therapy 1
6 pages
Birsa College UG SECOND Merit List
No ratings yet
Birsa College UG SECOND Merit List
43 pages
2020 Multi-Mode Resource Constrained Project Scheduling Problem Along With Contractor Selection
No ratings yet
2020 Multi-Mode Resource Constrained Project Scheduling Problem Along With Contractor Selection
21 pages
Mechanical Properties of Materials Explained
No ratings yet
Mechanical Properties of Materials Explained
10 pages
Chronic Kidney Disease Overview
100% (1)
Chronic Kidney Disease Overview
9 pages
Vault (Strong Room) Doors: Indian Standard
No ratings yet
Vault (Strong Room) Doors: Indian Standard
16 pages
Understanding Trait Theory in Psychology
No ratings yet
Understanding Trait Theory in Psychology
9 pages
Catalog of Carbide Inserts by Jacky (CN CNC Tools)
No ratings yet
Catalog of Carbide Inserts by Jacky (CN CNC Tools)
79 pages
Happy Defense Feb 6 2025
No ratings yet
Happy Defense Feb 6 2025
28 pages
Anesthesia in Special Surgeries
No ratings yet
Anesthesia in Special Surgeries
18 pages
Kinematic Waves and Freeway Bottlenecks
No ratings yet
Kinematic Waves and Freeway Bottlenecks
15 pages
W 3 Dzzslides
No ratings yet
W 3 Dzzslides
19 pages
Nursing Care Plan For Osteomyelitis: Nursing Diagnosis For Osteomyelitis and Nursing Interventions For Osteomyelitis
No ratings yet
Nursing Care Plan For Osteomyelitis: Nursing Diagnosis For Osteomyelitis and Nursing Interventions For Osteomyelitis
43 pages
CT WKM Ball Trun 370d4 02
No ratings yet
CT WKM Ball Trun 370d4 02
40 pages
Features: Econoline
No ratings yet
Features: Econoline
3 pages
M32R M en
No ratings yet
M32R M en
35 pages
User Instruction Manual: AP16 Precharged Pneumatic Air Pistols
No ratings yet
User Instruction Manual: AP16 Precharged Pneumatic Air Pistols
14 pages
WM FBT 65
No ratings yet
WM FBT 65
147 pages
High Speed Sand Filters Manual
No ratings yet
High Speed Sand Filters Manual
60 pages
Service Manual: MF6600/D1100 Series
No ratings yet
Service Manual: MF6600/D1100 Series
227 pages