0% found this document useful (0 votes)

11 views56 pages

Data Mining2

The document provides an overview of data mining concepts, focusing on the definitions and types of data, attributes, and data quality issues. It explains various attribute types, such as nominal, ordinal, interval, and ratio, as well as the importance of data preprocessing techniques like aggregation, sampling, and dimensionality reduction. Additionally, it discusses the significance of similarity and dissimilarity measures in data analysis.

Uploaded by

suyash.shukla1312

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views56 pages

Data Mining2

Uploaded by

suyash.shukla1312

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Data Mining: Data

Dr. Lov Kumar

Assistant Professor, BITS Pilani, Hyderabad Campus

NIT
kurukshetra
What is Data?
• Collection of data objects and Attributes
their attributes
Tid Refund Marital Taxable
• An attribute is a property or Status Income Cheat

characteristic of an object 1 Yes Single 125K No

– Examples: eye color of a 2 No Married 100K No
person, temperature, etc.
3 No Single 70K No
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
or feature Objects 5 No Divorced 95K Yes

• A collection of attributes 6 No Married 60K No

describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample,
9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
Attribute Values

• Attribute values are numbers or symbols assigned

to an attribute

• Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
Types of Attributes

There are different types of attributes

– Nominal: Data are neither measured nor ordered but subjects are merely
allocated to distinct categories
Examples: ID numbers, eye color, zip codes
– Ordinal: Ordinal data is a categorical where the variables have natural, ordered
categories and the distances between the categories is not known.
Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium, short}
– Interval: In interval measurement the distance between attributes does have
meaning.
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio:
Examples: temperature in Kelvin, length, time, counts
The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */

– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countable infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Types of data sets

• Record
– Data Matrix
– Document Data
– Transaction Data

• Graph
– World Wide Web
– Molecular Structures

• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data

• Data that consists of a Tid Refund Marital Taxable

Status Income Cheat
collection of records, each
1 Yes Single 125K No
of which consists of a fixed
2 No Married 100K No
set of attributes. 3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
• If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there

are m rows, one for each object, and n columns, one for each
attribute

Projection Projection Distance Load Thickness

of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data

• Each document becomes a `term' vector,

– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding
term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

• A special type of record data, where

– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased by a
customer during one shopping trip constitute a transaction, while the
individual products that were purchased are the items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

2
5 1
2
5
Molecular Structures : Chemical Data
Ordered Data

Sequences of transactions

An element of the
sequence
Data Quality

• What kinds of data quality problems?

• How can we detect problems with the data?
• What can we do about these problems?

Examples of data quality problems:

– Noise and outliers
– missing values
– duplicate data
Noise

• Noise refers to modification of original values

– Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen

Two Sine Waves Two Sine Waves + Noise

• Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set
• Effectiveness of outliers is
examined by using the following
equation:

Standard deviation(σ): how much

the members of a group differ from
the mean value for the group=
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)

from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan,
strategy='mean')
Mean, median, most_frequent
Duplicate Data

• Data set may include data objects that are

duplicates, or almost duplicates of one another

Examples:
– Same person with multiple email addresses

Data cleaning
– Process of dealing with duplicate data issues
Data Preprocessing

• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation

Combining two or more attributes (or objects) into a single

attribute (or object)

Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
Reducing the possible values for date from 365 days to 12 months.
This type of aggregation is commonly used in Online Analytical Processing
(OLAP).
Arithmetic mean:

Standard deviation: how much the members of a group

differ from the mean value for the group.
Sampling

• Sampling is the main technique employed for data

selection.
– It is often used for both the preliminary investigation of the data
and the final data analysis.

• Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.
Types of Sampling

• Simple Random Sampling

– There is an equal probability of selecting any particular item.

• Sampling without replacement

– As each item is selected, it is removed from the population.

• Sampling with replacement

– Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked up more than
once

• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
8000 points 2000 Points 500 Points
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining
algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise

• Techniques:
– Principle Component Analysis
– Others: supervised and non-linear techniques
Feature Subset Selection
• Redundant features
– duplicate much or all of the information contained in one or more
other attributes
– Example: purchase price of a product and the amount of sales
tax paid

• Irrelevant features
– contain no information that is useful for the data mining task at
hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA
• Techniques:
– Brute-force approaches:
• Try all possible feature subsets as input to data mining
algorithm
– Filter approaches:
• Features are selected before data mining algorithm is
run
– Wrapper approaches:
• Use the data mining algorithm as a black box to find
best subset of attributes
Feature Subset Selection
Filter approaches

Pearson’s Correlation: It is used as a measure for

quantifying linear dependence between two continuous
variables X and Y. Its value varies from -1 to +1.
Pearson’s correlation is given as:
Wrapper approaches

• we try to use a subset of features and train a model

using them.
Sequential Forward Selection (SFS)

Sequential Forward Selection (SFS)

• Start with the empty set, X=0
• Repeatedly add the most significant feature with respect
to X
Feature Creation

• Create new attributes that can capture the important

information in a data set much more efficiently than
the original attributes.
• Three general methodologies:
– Feature Extraction:
– Mapping Data to New Space
– Feature Construction
Attribute Transformation

• A function that maps the entire set of values of a given

attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization
Similarity and Dissimilarity

• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Similarity/Dissimilarity for
Simple Attributes
Euclidean Distance

• Euclidean Distance

n 2
dist =  ( pk − qk )
k =1

• Where n is the number of dimensions (attributes) and pk

and qk are, respectively, the kth attributes (components) or
data objects p and q.
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some
well known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
• where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.

• Measures that satisfy all three properties are known as

metrics.
Non-metric Dissimilarities:

A: { I ,2,3,4} and B : {2,3,4}

A- B: {1}
B-A:φ

d(A,B): size(A- B) + size(B - A)

• where size is a function returning the number of
elements in a set
Common Properties of a
Similarity
Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data

objects), p and q.
Similarity Between Binary
Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example

p= 1000000000
q= 0000001001
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /

(2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.

Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Extended Jaccard Coefficient

• The extended Jaccard coefficient can be used for

document data.
Pearson's Correlation

• Correlation measures the linear relationship between

objects
• To compute correlation, we standardize data objects, p
and q, and then take their dot product

pk = ( pk − mean( p)) / std ( p)

qk = ( qk − mean( q)) / std ( q)

correlation( p, q) = p • q
Visually Evaluating Correlation
Perfect Correlation.

• Correlation is always in the range -1 to 1.

• A correlation of 1 (-1) means that x and y have a
perfect positive (negative) linear relationship.
x: (-3, 6, 0, 3, -6)
y: ( 1, -2, 0,-7, 2 )

x: ( 3 , 6 , 0 , 3 , 6 )
y : ( 1 , 2 , 0 , 1 , 2)
General Approach for Combining Similarities

• Sometimes attributes are of many different types, but an

overall similarity is needed.
Using Weights to Combine Similarities

• May not want to treat all attributes the same.

– Use weights wk which are between 0 and 1 and
sum to 1.

Lecture 2
No ratings yet
Lecture 2
27 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Data Preprocessing & Attributes
No ratings yet
Data Preprocessing & Attributes
33 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
12 pages
Full
No ratings yet
Full
367 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Data
No ratings yet
Data
84 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Attributes
No ratings yet
Attributes
66 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
53 pages
Data Mining Process Overview
100% (1)
Data Mining Process Overview
51 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Lect 2
No ratings yet
Lect 2
77 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Lecture 2 Notes
No ratings yet
Lecture 2 Notes
39 pages
Session 3-1
No ratings yet
Session 3-1
83 pages
Data Mining for Beginners
No ratings yet
Data Mining for Beginners
63 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Correlation in Data Mining Explained
No ratings yet
Correlation in Data Mining Explained
12 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Data Preprocessing PDF
No ratings yet
Data Preprocessing PDF
57 pages
Chapter 2.1 2.2
No ratings yet
Chapter 2.1 2.2
40 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Kuliah 2 - Data Dan Eksplorasi Data
No ratings yet
Kuliah 2 - Data Dan Eksplorasi Data
61 pages
Machine Learning Lecture 4 Data Types
No ratings yet
Machine Learning Lecture 4 Data Types
21 pages
Communication
No ratings yet
Communication
12 pages
Convert
No ratings yet
Convert
21 pages
Convert
No ratings yet
Convert
21 pages
Information System
No ratings yet
Information System
12 pages
Data Mining4
No ratings yet
Data Mining4
69 pages
9 React Lifecycle
No ratings yet
9 React Lifecycle
19 pages
8 Components
No ratings yet
8 Components
20 pages
11 React Hooks
No ratings yet
11 React Hooks
23 pages
HTML Dom
No ratings yet
HTML Dom
27 pages
React JSX
No ratings yet
React JSX
22 pages
React
No ratings yet
React
20 pages
10 ReactJS Function Components
No ratings yet
10 ReactJS Function Components
21 pages
Javascript - 1
No ratings yet
Javascript - 1
39 pages
JS Regular Expressions
No ratings yet
JS Regular Expressions
23 pages
5 Oops JS
No ratings yet
5 Oops JS
18 pages
XML and Json
No ratings yet
XML and Json
25 pages
AI ML DL Full Report
No ratings yet
AI ML DL Full Report
4 pages
Understanding AI ML DL Report Expanded
No ratings yet
Understanding AI ML DL Report Expanded
5 pages
Understanding AI ML DL Report
No ratings yet
Understanding AI ML DL Report
4 pages
Electronic Centralised Aircraft Monitor
No ratings yet
Electronic Centralised Aircraft Monitor
2 pages
SAP Best Practices Guide
No ratings yet
SAP Best Practices Guide
16 pages
मजदुर २०७७-९-६ बर्ष २२ अंक ३३४
No ratings yet
मजदुर २०७७-९-६ बर्ष २२ अंक ३३४
8 pages
Bus Ticket Instructions
No ratings yet
Bus Ticket Instructions
5 pages
Guide To Listening Comprehension: Part A: Dialogs (Short Conversation) Questions Types
No ratings yet
Guide To Listening Comprehension: Part A: Dialogs (Short Conversation) Questions Types
2 pages
OUTPUT#4
No ratings yet
OUTPUT#4
2 pages
Mastering ChatGPT: Effective Usage Guide
No ratings yet
Mastering ChatGPT: Effective Usage Guide
10 pages
Johannes Martin Berg PDF
No ratings yet
Johannes Martin Berg PDF
55 pages
Javascript Developer I - 1
No ratings yet
Javascript Developer I - 1
11 pages
Binder 7
No ratings yet
Binder 7
8 pages
Arabic Origins of Cryptology Vol. 1
100% (7)
Arabic Origins of Cryptology Vol. 1
206 pages
TH Vco Maximus
No ratings yet
TH Vco Maximus
6 pages
Mobile Computing Fundamentals Explained
No ratings yet
Mobile Computing Fundamentals Explained
17 pages
Snucee Login - Google Search
No ratings yet
Snucee Login - Google Search
3 pages
Introduction To Communication Lab Manual Using Multisim
No ratings yet
Introduction To Communication Lab Manual Using Multisim
40 pages
1 LapDatTa
No ratings yet
1 LapDatTa
30 pages
337337a Midi Operator Station Datasheet
No ratings yet
337337a Midi Operator Station Datasheet
2 pages
DS WhitePapers Working With Derived Format Converter
No ratings yet
DS WhitePapers Working With Derived Format Converter
58 pages
Baby Lock Sofia2 BL137A2 Sewing Machine Instruction Manual
100% (2)
Baby Lock Sofia2 BL137A2 Sewing Machine Instruction Manual
188 pages
CETAC Technologies: C-Term™ Users' Guide
No ratings yet
CETAC Technologies: C-Term™ Users' Guide
3 pages
Amith Kumar
No ratings yet
Amith Kumar
2 pages
Barani Institute of Science Sahiwal: Information and Communication Technoligy
No ratings yet
Barani Institute of Science Sahiwal: Information and Communication Technoligy
6 pages
Summer Internship TCL
No ratings yet
Summer Internship TCL
21 pages
Print - Udyam Registration Certificate
No ratings yet
Print - Udyam Registration Certificate
1 page
VSX-S520 - Manual Receiver Pioneer
No ratings yet
VSX-S520 - Manual Receiver Pioneer
485 pages
SAP MRP Configuration
100% (1)
SAP MRP Configuration
32 pages
CM4101 EN SKF Microlog Analyzer Series Product Launch - RevE PDF
0% (1)
CM4101 EN SKF Microlog Analyzer Series Product Launch - RevE PDF
601 pages
Marketing Cloud Connect Guide
No ratings yet
Marketing Cloud Connect Guide
68 pages
Number Systems and IP Addressing Overview
No ratings yet
Number Systems and IP Addressing Overview
54 pages
Determinants and Matrix Properties Guide
No ratings yet
Determinants and Matrix Properties Guide
7 pages

Data Mining2

Uploaded by

Data Mining2

Uploaded by

Data Mining: Data

Dr. Lov Kumar

characteristic of an object 1 Yes Single 125K No

• A collection of attributes 6 No Married 60K No

• Attribute values are numbers or symbols assigned

• Distinction between attributes and attribute values

– Different attributes can be mapped to the same set of values

There are different types of attributes

– Nominal attribute: distinctness

• Data that consists of a Tid Refund Marital Taxable

• Such data set can be represented by an m by n matrix, where there

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

• Each document becomes a `term' vector,

• A special type of record data, where

• What kinds of data quality problems?

Examples of data quality problems:

• Noise refers to modification of original values

Two Sine Waves Two Sine Waves + Noise

Standard deviation(σ): how much

• Handling missing values

from sklearn.impute import SimpleImputer

• Data set may include data objects that are

Combining two or more attributes (or objects) into a single

Standard deviation: how much the members of a group

• Sampling is the main technique employed for data

• Sampling is used in data mining because processing the

• Simple Random Sampling

• Sampling without replacement

• Sampling with replacement

Pearson’s Correlation: It is used as a measure for

• we try to use a subset of features and train a model

Sequential Forward Selection (SFS)

• Create new attributes that can capture the important

• A function that maps the entire set of values of a given

• Where n is the number of dimensions (attributes) and pk

• Measures that satisfy all three properties are known as

A: { I ,2,3,4} and B : {2,3,4}

d(A,B): size(A- B) + size(B - A)

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data

• Simple Matching and Jaccard Coefficients

J = number of 11 matches / number of not-both-zero attributes values

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

cos( d1, d2 ) = .3150

• The extended Jaccard coefficient can be used for

• Correlation measures the linear relationship between

pk = ( pk − mean( p)) / std ( p)

qk = ( qk − mean( q)) / std ( q)

• Correlation is always in the range -1 to 1.

• Sometimes attributes are of many different types, but an

• May not want to treat all attributes the same.

You might also like