0% found this document useful (0 votes)

12 views40 pages

Chapter 2.1 2.2

The presentation covers various types of data and attributes, including nominal, ordinal, interval, and ratio attributes, along with their properties and operations. It discusses different datasets such as record data, graph data, and ordered data, and highlights the importance of data quality, addressing issues like measurement errors, missing values, and duplicates. Additionally, it emphasizes the significance of timeliness, relevance, and documentation in ensuring high-quality data for analysis.

Uploaded by

kun85060pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views40 pages

Chapter 2.1 2.2

Uploaded by

kun85060pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 40

Data Mining

Presentation
Presented by Group No: 15
Aditya Sinha(2002004)
Mehak Dixit(2002038)
Prashant Yadav(2002047)
Types of Data
• A dataset is a collection of data objects (records
, events , case , sample , observation , entity).
Attributes are basic characteristics of an object
(also known as field , feature , dimension) for
eg .. in dataset of students each row
corresponds to a student and each column
describes some aspects of a student such as
student id , name , address , cgpa , etc)
Properties of Numeric Attributes

• Distinctness
• Order
• Addition / Subtraction
• Multiplication / Divison
On the basis of these properties
we have 4 types of attributes
• Nominal
• Ordinal
• Interval
• Ratio
Nominal Attribute
The values of a nominal attribute are just different
names; i.e., nominal values provide only enough
information to distinguish one object from another. (=,
=)
Eg.. zip codes, employee ID numbers, eye color, gender
Operations : mode, entropy, contingency correlation, χ2
test
Ordinal Attribute
The values of an ordinal attribute provide enough
information to order objects. (<, >)
Eg.. hardness of minerals, {good, better, best},
grades, street numbers
Operations : median, percentiles, rank correlation,
run tests, sign tests
Interval Type of Attribute
For interval attributes, the differences between
values are meaningful, i.e., a unit of measurement
exists. (+ , - )
Eg.. calendar dates, temperature in Celsius or
Fahrenheit
Operations : mean, standard deviation, Pearson’s
correlation, t and F tests
Ratios
For ratio variables, both differences and ratios are
meaningful. (*, /)
Eg.. temperature in Kelvin, monetary quantities,
counts, age, mass,
length, electrical current
Operations : geometric mean, harmonic mean,
Percent variation
Describing attributes on the basis of
number of values
1. Discrete : - It has a finite set of values for eg.. categorical
attributes such as zip codes , id no. , etc. They generally
have 2 values that is TRUE or FALSE
• Continuous : - It’s values are real numbers and are
represented by floating point variables.
• Eg.. Temperature , height , weight , etc
General Characteristics of Datasets
1. Dimensionality:- is the no of attributes that the objects in a dateset possess. Data
with lesser no. of dimensions tend to be qualitatively better than the moderate of
high dimensional data. The difficulties allocated with analyzing high. dimensional
data.
2. Sparsity :- In some data sets in asymmetric features most attributes values of the
object have 0 practically it helps in saving computation time and storage because
only non 0 values need to be stored and manipulated.
3. Resolution:- The properties of data are different at different resolutions. eg-
surface. of earth seems very uneven at a resolution of a few meters but in
relatively smooth at the resolution of a few Km’s. If the resolution is too fine a
pattern may not be visible or may be buries in noise and if the resolution is too
coarse the pattern may disappear.
Types of Datasets

1.Record Data 3. Ordered

i.Transaction / Market Basket Data Data
ii.Data Matrix
iii.Space Data Matrix i.Sequential Data
ii.Sequence Data
2. Graph Data iii.Time Series Data
iv.Spatial Data
Record Data

Record data set is a collection of

records(data objects), each of which consists
of a fixed set of data fields (attributes).
Record data is usually stored either in flat
files or in relational databases.
1.Transaction or Market
Basket Data
Transaction data is a special type of record data,
where each record (transaction) involves a set of
items. Con-sider a grocery store. The set of products
purchased by a customer during one shopping trip
constitutes a transaction, while the individual products
that were purchased are the items. This type of data is
called market basket data because the items in each
record are the products in a person’s “market basket.”
2. The Data Matrix

A set of such data objects can be interpreted as an

m by n matrix, where there are m rows, one for
each object, and n columns, one for each attribute.
(A representation that has data objects as columns
and attributes as rows is also fine.) This matrix is
called a datamatrix or a pattern matrix.
3. The Sparse Data Matrix

A sparse data matrix is a special case of a data

matrix in which the attributes are of the same type
and are asymmetric; i.e., only non-zero values are
important.
Transaction data is an example of a sparse data
matrix that has only 0–1 entries. Only the non-zero
entries of sparse data matrices are stored.
Graph-Based Data

A graph can sometimes be a convenient and

powerful representation for data. We
consider two specific cases: (1) the graph
captures relationships among data objects
and (2) the data objects themselves are
represented as graphs.
1. Data with Relationships
among Objects

The relationships among objects frequently

convey important information. In such cases,
the data is often represented as a graph. In
particular, the data objects are mapped to
nodes of the graph, while the relationships
among objects are captured by the links
between objects and link properties, such as
direction and weight. Consider Web pages
on the World Wide Web, which contain both
text and links to other pages.
2. Data with Objects That
Are Graphs

If objects have structure, that is, the

objects contain subobjects that have
relationships, then such objects are
frequently represented as graphs.
For example, the structure of
chemical compounds can be
represented by a graph, where the
nodes are atoms and the links
between nodes are chemical bonds.
Ordered Data

For some types of data, the attributes have

relationships that involve order in time or
space. We can also say ordered data is when
data is collected over time.
1. Sequential Data

Sequential data, also referred to as temporal data,

can be thought of as an extension of record data,
where each record has a time associated with it.
For example a retail transaction data set that also
stores thetime at which the transaction took place.
2. Sequence Data

Sequence data consists of a data set that is a

sequence of individual entities, such as a sequence
of words or letters. It is quite similar to sequential
data, except that there are no time stamps; instead,
there are positions in an ordered sequence.
.
3. Time Series Data

Time series data is a special type of sequential data

in which each record is a time series, i.e., a series
of measurements taken over time.
For example, a financial data set might contain
objects that are time series of the daily prices of
various stocks.
4. Spatial Data

Some objects have spatial attributes, such as

positions or areas, as well as other types of
attributes.
An example of spatial data is weather data
(precipitation, temperature, pressure) that is
collected for a variety of geographical locations.
Handling Non Record Data

Record oriented techniques can be applied to non

record data by extracting features from the data
objects and using these features to create a record
corresponding to each objects for eg.. Given a set of
common substructures each compound can be
represented as a record with binary attributes that
indicate whether a compound contains a specific
substructure.
DATA QUALITY
• It refers to the overall utility of a dataset as a function of its ability to be easily processed and analyzed for other uses.
• Data mining focuses on:-
• (1) the detection and correction of data quality problems and
• (2) the use of algorithms that can tolerate poor data quality.

• In the slides the focus is on measurement and data collection issues and some application related issues.
1. Measurement and Data
Collection Issues
The data is never perfect. There may be problems due to:-
a) Human error.
b) Limitations of measuring devices.
c) Flaws in data collection process.
d) Values or data objects may be missing.
e) Spurious or duplicate objects.
Measurement and Data Collection Errors

Measurement errors
It refers to any problem resulting from the measurement process. A common
problem is that the value recorded differs from the true value to some extent.
Note : the numerical difference of the measured value and true value is called
error.

Data collection errors

It refers to errors such as omitting data objects or attribute values, or
inappropriately including a data object.
Noise and Artifacts
Noise is the random component of a measurement error. It may involve the
distortion of a value or the addition of spurious objects. It is used in connection
with data that has a spatial or temporal component.
Example :
NOTE: the elimination of noise is frequently difficult, and much work in data
mining focuses on devising robust algorithms that produce acceptable
results even when noise is present.

Data errors may be the result of a more deterministic phenomenon, such

deterministic distortions of the data are often referred to as ARTIFACTS.
Precision, Bias, and Accuracy
Precision: The closeness of repeated measurements (of the same quantity) to
one another.
Bias: A systematic variation of measurements from the quantity being
measured.
Precision is often measured by the standard deviation of a set of values, while
bias is measured by taking the difference between the mean of the set of values
and the known value of the quantity being measured. Bias can only be
determined for objects whose measured quantity is known by means external to
the current situation .Accuracy: The closeness of measurements to the true
value of the quantity being measured.
Accuracy depends on precision and bias, but since it is a general concept, there
is no specific formula for accuracy. NOTE: one important aspect of accuracy is
the use of Significant Digits.
Issues such as significant digits, precision, bias, and accuracy are sometimes
overlooked, without some understanding of these aspects in the data, an analyst
Outliers

Outliers are either

•data objects that have characteristics that are different from most of the other
data objects in the data set.
Or
•values of an attribute that are unusual with respect to the typical values for
that attribute.

Outliers can be legitimate data objects and values, thus they are may
sometimes be of interest.
Missing Values
It is not unusual for an object to be missing one or more attribute values.
Some reasons for missing values:
a)The information was not collected.
b)Some attributes are not applicable to all objects.

There are several strategies for dealing with missing data, each of which is
appropriate in certain circumstances.
•Eliminate data objects or attributes.
b) Estimate missing values.
c) Ignore the missing value during analysis.
Eliminate Data Objects or Attributes
Advantages
•A simple and effective strategy is to eliminate objects with missing values.
•A related strategy is to eliminate attributes that have missing values.
Disadvantages
•if many objects have missing values, then a reliable analysis can be difficult or
impossible.
•Sometimes the eliminated attributes may be the ones that are critical to the
analysis.

Estimate Missing Values

Sometimes missing data can be reliably estimated.
Eg:- consider a time series that changes in a reasonably smooth fashion, but
has a few, widely scattered missing values. In such cases, the missing values
can be estimated (interpolated) by using the remaining values.
Ignore the Missing Value during Analysis

Many data mining approaches can be modified to ignore missing values.

Eg:- Suppose that objects are being clustered and the similarity between pairs
of data objects needs to be calculated. If one or both objects of a pair have
missing values for some attributes, then the similarity can be calculated by
using only the attributes that do not have missing values.
Inconsistent values
Data can contain inconsistent values.
Eg:- Consider an address field, where both a zip code and city are listed, but the
specified zip code area is not contained in that city. Regardless of the cause of the
missing value, it is important to detect and, if possible, correct such problems.

Some types of inconsistences are easy to detect. For instance, a person’s height
should not be negative.In other cases, it can be necessary to consult an external
source of information.

Once an inconsistency has been detected, it is sometimes possible to correct the

data. The correction of an inconsistency requires additional or redundant
information.
Duplicate Data
A data set may include data objects that are duplicates, or almost duplicates, of one
another.
To avoid duplication, two main issues must be addressed:
1.if there are two objects that actually represent a single object, then the values of
corresponding attributes may differ, and these inconsistent values must be resolved .
2.care needs to be taken to avoid accidentally combining data objects that are
similar, but not duplicates.

NOTE: The term deduplication is often used to refer to the process of dealing with
these issues.
2. Issues Related to
Applications
Few issues related to applications are :
a)Timeliness
b)Relevance
c)Knowledge about the data

NOTE: “data is of high quality if it is suitable for its intended use”.

Timeliness
•Some data starts to age as soon as it has been collected.
•if the data provides a snapshot of some ongoing phenomenon or process, then
this snapshot represents reality for only a limited time.
•If the data is out of date, then so are the models and patterns that are based
on it.
Relevance
The available data must contain the information necessary for the application.
Eg:- Consider the task of building a model that predicts the accident rate for
drivers. If information about the age and gender of the driver is omitted, then
it is likely that the model will have limited accuracy unless this information is
indirectly available through other attribute.
A common problem is sampling bias, which occurs when a sample does not
contain different types of objects in proportion to their actual occurrence in the
population. Sampling bias will result in an erroneous analysis because the
results of a data analysis can reflect only the data that is present.
Knowledge about the data

Data sets are accompanied by documentation that describes different aspects

of the data. The quality of this documentation can either aid or hinder the
subsequent analysis.
If the documentation is poor and fails to tell us the required information, then
our analysis of the data turns out faulty.
Other important characteristics are the precision of the data, the type of
features (nominal, ordinal, interval, ratio), the scale of measurement (e.g.,
meters or feet for length), and the origin of the data.
References

•https://www.javatpoint.com/dat
a-mining
• https://en.wikipedia.org/wiki/
Data_mining
• Introduction to Data Mining by
Pang-Ning Tan , Michael
Steinbach , Vipin Kumar
• Data Mining Notes

Session 3-1
No ratings yet
Session 3-1
83 pages
Session 3
No ratings yet
Session 3
81 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Data Preprocessing & Attributes
No ratings yet
Data Preprocessing & Attributes
33 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Lecture2 IntroData
No ratings yet
Lecture2 IntroData
16 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Attributes
No ratings yet
Attributes
66 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data Mining Process Overview
100% (1)
Data Mining Process Overview
51 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
9 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Data
No ratings yet
Data
84 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Lect 2
No ratings yet
Lect 2
77 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
25 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
12 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
No ratings yet
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
11 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Unit 2
No ratings yet
Unit 2
37 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
Data and Its Types in Data Mining
No ratings yet
Data and Its Types in Data Mining
4 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
CAC 428 Topic 1 - Introduction To Data
No ratings yet
CAC 428 Topic 1 - Introduction To Data
24 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
No ratings yet
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
94 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
53 pages
Full
No ratings yet
Full
367 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Exception Handling & Multithreading in Java
No ratings yet
Exception Handling & Multithreading in Java
24 pages
How To Solve The ProgrammingError - Column Does Not Exist Error in Odoo - Ngasturi Notes
No ratings yet
How To Solve The ProgrammingError - Column Does Not Exist Error in Odoo - Ngasturi Notes
4 pages
Individual Assignment - BRM - Keyd M
No ratings yet
Individual Assignment - BRM - Keyd M
15 pages
Economics Note SS1 2nd Term
No ratings yet
Economics Note SS1 2nd Term
46 pages
Xu 1996
No ratings yet
Xu 1996
17 pages
Project Report Of: Leo Multiple District 306, Sri Lanka
No ratings yet
Project Report Of: Leo Multiple District 306, Sri Lanka
16 pages
The Mitochondria Is The Powerhouse of The Cell
No ratings yet
The Mitochondria Is The Powerhouse of The Cell
4 pages
11) Building Code of Pakistan
No ratings yet
11) Building Code of Pakistan
267 pages
Sonali
No ratings yet
Sonali
1 page
Proven Fixed Ops Marketing Tactics You're Not Using: Jeff Clark EVP Business Development
No ratings yet
Proven Fixed Ops Marketing Tactics You're Not Using: Jeff Clark EVP Business Development
18 pages
5 Traction PPT Prakash - 1
No ratings yet
5 Traction PPT Prakash - 1
51 pages
Waves and Optics Module 1
No ratings yet
Waves and Optics Module 1
12 pages
15 - 516x Week0 1 Program Overview en
No ratings yet
15 - 516x Week0 1 Program Overview en
2 pages
30mld TTP Bid Document PDF
No ratings yet
30mld TTP Bid Document PDF
52 pages
Microsoft Word - A Comprehensive Overview
No ratings yet
Microsoft Word - A Comprehensive Overview
5 pages
One To One Meetings - S
No ratings yet
One To One Meetings - S
5 pages
Microwave Devices and Systems: Introduction To Microwave Engineering
No ratings yet
Microwave Devices and Systems: Introduction To Microwave Engineering
66 pages
Emaar South: Luxury Living in Dubai
No ratings yet
Emaar South: Luxury Living in Dubai
1 page
Size Structured Population Models of Dap
No ratings yet
Size Structured Population Models of Dap
8 pages
Book 2
No ratings yet
Book 2
164 pages
Computer Science Course Details
No ratings yet
Computer Science Course Details
11 pages
Liquidity Management and Profitability: A Case Study of Listed Manufacturing Companies in Sri Lanka
100% (1)
Liquidity Management and Profitability: A Case Study of Listed Manufacturing Companies in Sri Lanka
5 pages
Mercedes-Maybach GLS 600 4MATIC First Class Night Series MUXH22F6
No ratings yet
Mercedes-Maybach GLS 600 4MATIC First Class Night Series MUXH22F6
5 pages
Hema Poultry Farming Project Report
No ratings yet
Hema Poultry Farming Project Report
10 pages
SL7810E Parts Manual
100% (1)
SL7810E Parts Manual
210 pages
Indian Chemical Industry Ebook
100% (1)
Indian Chemical Industry Ebook
63 pages
Chapter 10 Audit Reports
No ratings yet
Chapter 10 Audit Reports
7 pages
Human Resource Champions
88% (8)
Human Resource Champions
12 pages
Automotive Electrical Assembly NC II
No ratings yet
Automotive Electrical Assembly NC II
70 pages
Gr11 FinancialStatements MEMO
No ratings yet
Gr11 FinancialStatements MEMO
21 pages

Chapter 2.1 2.2

Uploaded by

Chapter 2.1 2.2

Uploaded by

Data Mining

1.Record Data 3. Ordered

Record data set is a collection of

A set of such data objects can be interpreted as an

A sparse data matrix is a special case of a data

A graph can sometimes be a convenient and

The relationships among objects frequently

If objects have structure, that is, the

For some types of data, the attributes have

Sequential data, also referred to as temporal data,

Sequence data consists of a data set that is a

Time series data is a special type of sequential data

Some objects have spatial attributes, such as

Record oriented techniques can be applied to non

Data collection errors

Data errors may be the result of a more deterministic phenomenon, such

Outliers are either

Estimate Missing Values

Many data mining approaches can be modified to ignore missing values.

Once an inconsistency has been detected, it is sometimes possible to correct the

NOTE: “data is of high quality if it is suitable for its intended use”.

Data sets are accompanied by documentation that describes different aspects

You might also like