Data Mining: Data
Dr. Lov Kumar
Assistant Professor, BITS Pilani, Hyderabad Campus
NIT
kurukshetra
What is Data?
• Collection of data objects and Attributes
their attributes
Tid Refund Marital Taxable
• An attribute is a property or Status Income Cheat
characteristic of an object 1 Yes Single 125K No
– Examples: eye color of a 2 No Married 100K No
person, temperature, etc.
3 No Single 70K No
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
or feature Objects 5 No Divorced 95K Yes
• A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample,
9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
Types of Attributes
There are different types of attributes
– Nominal: Data are neither measured nor ordered but subjects are merely
allocated to distinct categories
Examples: ID numbers, eye color, zip codes
– Ordinal: Ordinal data is a categorical where the variables have natural, ordered
categories and the distances between the categories is not known.
Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium, short}
– Interval: In interval measurement the distance between attributes does have
meaning.
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio:
Examples: temperature in Kelvin, length, time, counts
The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: =
– Order: < >
– Addition: + -
– Multiplication: */
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countable infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
• Data that consists of a Tid Refund Marital Taxable
Status Income Cheat
collection of records, each
1 Yes Single 125K No
of which consists of a fixed
2 No Married 100K No
set of attributes. 3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
• If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix, where there
are m rows, one for each object, and n columns, one for each
attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Document Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding
term occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased by a
customer during one shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
2
5 1
2
5
Molecular Structures : Chemical Data
Ordered Data
Sequences of transactions
An element of the
sequence
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen
Two Sine Waves Two Sine Waves + Noise
• Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set
• Effectiveness of outliers is
examined by using the following
equation:
Standard deviation(σ): how much
the members of a group differ from
the mean value for the group=
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values
– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan,
strategy='mean')
Mean, median, most_frequent
Duplicate Data
• Data set may include data objects that are
duplicates, or almost duplicates of one another
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
Combining two or more attributes (or objects) into a single
attribute (or object)
Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
Reducing the possible values for date from 365 days to 12 months.
This type of aggregation is commonly used in Online Analytical Processing
(OLAP).
Arithmetic mean:
Standard deviation: how much the members of a group
differ from the mean value for the group.
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation of the data
and the final data analysis.
• Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item.
• Sampling without replacement
– As each item is selected, it is removed from the population.
• Sampling with replacement
– Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked up more than
once
• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
8000 points 2000 Points 500 Points
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining
algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
• Techniques:
– Principle Component Analysis
– Others: supervised and non-linear techniques
Feature Subset Selection
• Redundant features
– duplicate much or all of the information contained in one or more
other attributes
– Example: purchase price of a product and the amount of sales
tax paid
• Irrelevant features
– contain no information that is useful for the data mining task at
hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA
• Techniques:
– Brute-force approaches:
• Try all possible feature subsets as input to data mining
algorithm
– Filter approaches:
• Features are selected before data mining algorithm is
run
– Wrapper approaches:
• Use the data mining algorithm as a black box to find
best subset of attributes
Feature Subset Selection
Filter approaches
Pearson’s Correlation: It is used as a measure for
quantifying linear dependence between two continuous
variables X and Y. Its value varies from -1 to +1.
Pearson’s correlation is given as:
Wrapper approaches
• we try to use a subset of features and train a model
using them.
Sequential Forward Selection (SFS)
Sequential Forward Selection (SFS)
• Start with the empty set, X=0
• Repeatedly add the most significant feature with respect
to X
Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than
the original attributes.
• Three general methodologies:
– Feature Extraction:
– Mapping Data to New Space
– Feature Construction
Attribute Transformation
• A function that maps the entire set of values of a given
attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Similarity/Dissimilarity for
Simple Attributes
Euclidean Distance
• Euclidean Distance
n 2
dist = ( pk − qk )
k =1
• Where n is the number of dimensions (attributes) and pk
and qk are, respectively, the kth attributes (components) or
data objects p and q.
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some
well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
• where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
• Measures that satisfy all three properties are known as
metrics.
Non-metric Dissimilarities:
A: { I ,2,3,4} and B : {2,3,4}
A- B: {1}
B-A:φ
d(A,B): size(A- B) + size(B - A)
• where size is a function returning the number of
elements in a set
Common Properties of a
Similarity
Similarities, also have some well known properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data
objects), p and q.
Similarity Between Binary
Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example
p= 1000000000
q= 0000001001
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /
(2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
Extended Jaccard Coefficient
• The extended Jaccard coefficient can be used for
document data.
Pearson's Correlation
• Correlation measures the linear relationship between
objects
• To compute correlation, we standardize data objects, p
and q, and then take their dot product
pk = ( pk − mean( p)) / std ( p)
qk = ( qk − mean( q)) / std ( q)
correlation( p, q) = p • q
Visually Evaluating Correlation
Perfect Correlation.
• Correlation is always in the range -1 to 1.
• A correlation of 1 (-1) means that x and y have a
perfect positive (negative) linear relationship.
x: (-3, 6, 0, 3, -6)
y: ( 1, -2, 0,-7, 2 )
x: ( 3 , 6 , 0 , 3 , 6 )
y : ( 1 , 2 , 0 , 1 , 2)
General Approach for Combining Similarities
• Sometimes attributes are of many different types, but an
overall similarity is needed.
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
– Use weights wk which are between 0 and 1 and
sum to 1.