0% found this document useful (0 votes)
13 views19 pages

DataUnderstandingAndPreparation DOM304

The document outlines a generic approach to problem-solving using analytics, including steps such as defining the problem, collecting and understanding data, and communicating results. It discusses various types of data sets and data objects, as well as statistical methods for analyzing data, including measures of central tendency and dispersion. Additionally, it presents an in-class activity focused on estimating the price of second-hand cars using existing transaction data.

Uploaded by

bqqj5qbt8n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

DataUnderstandingAndPreparation DOM304

The document outlines a generic approach to problem-solving using analytics, including steps such as defining the problem, collecting and understanding data, and communicating results. It discusses various types of data sets and data objects, as well as statistical methods for analyzing data, including measures of central tendency and dispersion. Additionally, it presents an in-class activity focused on estimating the price of second-hand cars using existing transaction data.

Uploaded by

bqqj5qbt8n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Problem Solving using Analytics

Analytics Problem Solving– Generic


Approach
• Define the problem
• Collect data
• Understand data
• Clean data
• Build reports / dashboards / models
• Communicate the results
• Deploy
• Monitor
• Update
Understanding Data
• Data objects and attribute types
• Statistical summaries of data
Types of Data Sets
• Record • Ordered
– Relational records – Video data: sequence of
– Data matrix, e.g., images
numerical matrix, – Temporal data: time-
crosstabs series
– Document data: text – Sequential Data:
documents: term- transaction sequences
frequency vector – Genetic sequence data
– Transaction data

• Graph and network • Spatial, image and


– World Wide Web multimedia:
– Social or information – Spatial data: maps
networks – Image data:
– Molecular Structures – Video data: 4
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points, objects,
tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
5
Attributes
• Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Ordinal
– Binary
– Numeric
– Dates
6
Understanding Data
• Data objects and attribute types
• Statistical summaries of data
Basic Statistical Descriptions of
Data
• Motivation
– To better understand the data: central tendency, variation
and spread

• Key Concepts
– Central tendency
– Dispersion
– Distribution and skewness
– Correlation

8
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population):
1 n
x   xi   x
Note: n is sample size and N is population size.
n i 1 N
– Weighted arithmetic mean: n

– Trimmed mean: chopping extreme values w x i i


x i 1
n
• Median:
w i
– Middle value if odd number of values, or average of the i 1

middle two values otherwise


– Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median L1  ( ) width
freq median
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
9
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric

symmetric, positively and


negatively skewed data

negatively
positively skewed
skewed

10
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)

– Inter-quartile range: IQR = Q3 – Q1

– Five number summary: min, Q1, median, Q3, max


– Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic,
n
scalable
n
computation)
n
1 1 1 1 n
1 n

  
2
  xi   2
2 2 2
s  ( xi  x )  [ xi  ( xi ]
)  
2
( xi  
2
) 
2

n  1 i 1 n  1 i 1 n i 1 N i 1 N i 1

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)


11
Boxplot Analysis
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to
Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually

12
Boxplot

13
Histogram Analysis
• Histogram: Graph display of tabulated 40

frequencies, shown as bars 35


30
• It shows what proportion of cases fall 25
into each of several categories 20

• Differs from a bar chart in that it is the 15


10
area of the bar that denotes the value,
5
not the height as in bar charts, a 0
crucial distinction when the categories 10000 30000 50000 70000 90000

are not of uniform width


• The categories are usually specified as
non-overlapping intervals of some
variable. The categories (bars) must be
adjacent

14
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane

15
Positively and Negatively
Correlated Data

• The left half fragment is


positively correlated
• The right half is negative
correlated 16
Uncorrelated Data

17
Summary
• Data understanding is about getting a “feel” of the data
• Key points to observe:
– Size and type of data
– Meaning / definitions of different columns
– Missing values
– Mean / Median, Dispersion
– Distribution, skewness (if relevant)
– Bi-variate relationships – Correlation

18
In-class activity – Second hand cars
(EDA)
Gregory has been hired as a data scientist in GoldenSeconds, an
upcoming startup which intends to be a marketplace for second
hand cars. He has been asked to build a tool, which would help
estimate the price of a second hand car. GoldenSeconds wants to
integrate the tool with their website, so that interested sellers can
get a quick and fair estimate of their car’s price. Thanks to a
previous market research initiative, the company already has
details of ~ 10,000 second hand car transactions. Gregory starts
by taking a closer look at the data.
Put yourself in Gregory’s place and answer the following:
1. What is the average price of a car sold? What is the median
price?
2. What percentage of values in the variable ‘cert’ are null values?
How should they be handled?

You might also like