Problem Solving using Analytics
Analytics Problem Solving– Generic
Approach
• Define the problem
• Collect data
• Understand data
• Clean data
• Build reports / dashboards / models
• Communicate the results
• Deploy
• Monitor
• Update
Understanding Data
• Data objects and attribute types
• Statistical summaries of data
Types of Data Sets
• Record • Ordered
– Relational records – Video data: sequence of
– Data matrix, e.g., images
numerical matrix, – Temporal data: time-
crosstabs series
– Document data: text – Sequential Data:
documents: term- transaction sequences
frequency vector – Genetic sequence data
– Transaction data
• Graph and network • Spatial, image and
– World Wide Web multimedia:
– Social or information – Spatial data: maps
networks – Image data:
– Molecular Structures – Video data: 4
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points, objects,
tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
5
Attributes
• Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Ordinal
– Binary
– Numeric
– Dates
6
Understanding Data
• Data objects and attribute types
• Statistical summaries of data
Basic Statistical Descriptions of
Data
• Motivation
– To better understand the data: central tendency, variation
and spread
• Key Concepts
– Central tendency
– Dispersion
– Distribution and skewness
– Correlation
8
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population):
1 n
x xi x
Note: n is sample size and N is population size.
n i 1 N
– Weighted arithmetic mean: n
– Trimmed mean: chopping extreme values w x i i
x i 1
n
• Median:
w i
– Middle value if odd number of values, or average of the i 1
middle two values otherwise
– Estimated by interpolation (for grouped data):
n / 2 ( freq )l
median L1 ( ) width
freq median
• Mode
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
9
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data
negatively
positively skewed
skewed
10
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic,
n
scalable
n
computation)
n
1 1 1 1 n
1 n
2
xi 2
2 2 2
s ( xi x ) [ xi ( xi ]
)
2
( xi
2
)
2
n 1 i 1 n 1 i 1 n i 1 N i 1 N i 1
– Standard deviation s (or σ) is the square root of variance s2 (or σ2)
11
Boxplot Analysis
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to
Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually
12
Boxplot
13
Histogram Analysis
• Histogram: Graph display of tabulated 40
frequencies, shown as bars 35
30
• It shows what proportion of cases fall 25
into each of several categories 20
• Differs from a bar chart in that it is the 15
10
area of the bar that denotes the value,
5
not the height as in bar charts, a 0
crucial distinction when the categories 10000 30000 50000 70000 90000
are not of uniform width
• The categories are usually specified as
non-overlapping intervals of some
variable. The categories (bars) must be
adjacent
14
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane
15
Positively and Negatively
Correlated Data
• The left half fragment is
positively correlated
• The right half is negative
correlated 16
Uncorrelated Data
17
Summary
• Data understanding is about getting a “feel” of the data
• Key points to observe:
– Size and type of data
– Meaning / definitions of different columns
– Missing values
– Mean / Median, Dispersion
– Distribution, skewness (if relevant)
– Bi-variate relationships – Correlation
18
In-class activity – Second hand cars
(EDA)
Gregory has been hired as a data scientist in GoldenSeconds, an
upcoming startup which intends to be a marketplace for second
hand cars. He has been asked to build a tool, which would help
estimate the price of a second hand car. GoldenSeconds wants to
integrate the tool with their website, so that interested sellers can
get a quick and fair estimate of their car’s price. Thanks to a
previous market research initiative, the company already has
details of ~ 10,000 second hand car transactions. Gregory starts
by taking a closer look at the data.
Put yourself in Gregory’s place and answer the following:
1. What is the average price of a car sold? What is the median
price?
2. What percentage of values in the variable ‘cert’ are null values?
How should they be handled?