0% found this document useful (0 votes)
12 views40 pages

Chapter 2.1 2.2

The presentation covers various types of data and attributes, including nominal, ordinal, interval, and ratio attributes, along with their properties and operations. It discusses different datasets such as record data, graph data, and ordered data, and highlights the importance of data quality, addressing issues like measurement errors, missing values, and duplicates. Additionally, it emphasizes the significance of timeliness, relevance, and documentation in ensuring high-quality data for analysis.

Uploaded by

kun85060pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views40 pages

Chapter 2.1 2.2

The presentation covers various types of data and attributes, including nominal, ordinal, interval, and ratio attributes, along with their properties and operations. It discusses different datasets such as record data, graph data, and ordered data, and highlights the importance of data quality, addressing issues like measurement errors, missing values, and duplicates. Additionally, it emphasizes the significance of timeliness, relevance, and documentation in ensuring high-quality data for analysis.

Uploaded by

kun85060pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 40

Data Mining

Presentation
Presented by Group No: 15
Aditya Sinha(2002004)
Mehak Dixit(2002038)
Prashant Yadav(2002047)
Types of Data
• A dataset is a collection of data objects (records
, events , case , sample , observation , entity).
Attributes are basic characteristics of an object
(also known as field , feature , dimension) for
eg .. in dataset of students each row
corresponds to a student and each column
describes some aspects of a student such as
student id , name , address , cgpa , etc)
Properties of Numeric Attributes

• Distinctness
• Order
• Addition / Subtraction
• Multiplication / Divison
On the basis of these properties
we have 4 types of attributes
• Nominal
• Ordinal
• Interval
• Ratio
Nominal Attribute
The values of a nominal attribute are just different
names; i.e., nominal values provide only enough
information to distinguish one object from another. (=,
=)
Eg.. zip codes, employee ID numbers, eye color, gender
Operations : mode, entropy, contingency correlation, χ2
test
Ordinal Attribute
The values of an ordinal attribute provide enough
information to order objects. (<, >)
Eg.. hardness of minerals, {good, better, best},
grades, street numbers
Operations : median, percentiles, rank correlation,
run tests, sign tests
Interval Type of Attribute
For interval attributes, the differences between
values are meaningful, i.e., a unit of measurement
exists. (+ , - )
Eg.. calendar dates, temperature in Celsius or
Fahrenheit
Operations : mean, standard deviation, Pearson’s
correlation, t and F tests
Ratios
For ratio variables, both differences and ratios are
meaningful. (*, /)
Eg.. temperature in Kelvin, monetary quantities,
counts, age, mass,
length, electrical current
Operations : geometric mean, harmonic mean,
Percent variation
Describing attributes on the basis of
number of values
1. Discrete : - It has a finite set of values for eg.. categorical
attributes such as zip codes , id no. , etc. They generally
have 2 values that is TRUE or FALSE
• Continuous : - It’s values are real numbers and are
represented by floating point variables.
• Eg.. Temperature , height , weight , etc
General Characteristics of Datasets
1. Dimensionality:- is the no of attributes that the objects in a dateset possess. Data
with lesser no. of dimensions tend to be qualitatively better than the moderate of
high dimensional data. The difficulties allocated with analyzing high. dimensional
data.
2. Sparsity :- In some data sets in asymmetric features most attributes values of the
object have 0 practically it helps in saving computation time and storage because
only non 0 values need to be stored and manipulated.
3. Resolution:- The properties of data are different at different resolutions. eg-
surface. of earth seems very uneven at a resolution of a few meters but in
relatively smooth at the resolution of a few Km’s. If the resolution is too fine a
pattern may not be visible or may be buries in noise and if the resolution is too
coarse the pattern may disappear.
Types of Datasets

1.Record Data 3. Ordered


i.Transaction / Market Basket Data Data
ii.Data Matrix
iii.Space Data Matrix i.Sequential Data
ii.Sequence Data
2. Graph Data iii.Time Series Data
iv.Spatial Data
Record Data

Record data set is a collection of


records(data objects), each of which consists
of a fixed set of data fields (attributes).
Record data is usually stored either in flat
files or in relational databases.
1.Transaction or Market
Basket Data
Transaction data is a special type of record data,
where each record (transaction) involves a set of
items. Con-sider a grocery store. The set of products
purchased by a customer during one shopping trip
constitutes a transaction, while the individual products
that were purchased are the items. This type of data is
called market basket data because the items in each
record are the products in a person’s “market basket.”
2. The Data Matrix

A set of such data objects can be interpreted as an


m by n matrix, where there are m rows, one for
each object, and n columns, one for each attribute.
(A representation that has data objects as columns
and attributes as rows is also fine.) This matrix is
called a datamatrix or a pattern matrix.
3. The Sparse Data Matrix

A sparse data matrix is a special case of a data


matrix in which the attributes are of the same type
and are asymmetric; i.e., only non-zero values are
important.
Transaction data is an example of a sparse data
matrix that has only 0–1 entries. Only the non-zero
entries of sparse data matrices are stored.
Graph-Based Data

A graph can sometimes be a convenient and


powerful representation for data. We
consider two specific cases: (1) the graph
captures relationships among data objects
and (2) the data objects themselves are
represented as graphs.
1. Data with Relationships
among Objects

The relationships among objects frequently


convey important information. In such cases,
the data is often represented as a graph. In
particular, the data objects are mapped to
nodes of the graph, while the relationships
among objects are captured by the links
between objects and link properties, such as
direction and weight. Consider Web pages
on the World Wide Web, which contain both
text and links to other pages.
2. Data with Objects That
Are Graphs

If objects have structure, that is, the


objects contain subobjects that have
relationships, then such objects are
frequently represented as graphs.
For example, the structure of
chemical compounds can be
represented by a graph, where the
nodes are atoms and the links
between nodes are chemical bonds.
Ordered Data

For some types of data, the attributes have


relationships that involve order in time or
space. We can also say ordered data is when
data is collected over time.
1. Sequential Data

Sequential data, also referred to as temporal data,


can be thought of as an extension of record data,
where each record has a time associated with it.
For example a retail transaction data set that also
stores thetime at which the transaction took place.
2. Sequence Data

Sequence data consists of a data set that is a


sequence of individual entities, such as a sequence
of words or letters. It is quite similar to sequential
data, except that there are no time stamps; instead,
there are positions in an ordered sequence.
.
3. Time Series Data

Time series data is a special type of sequential data


in which each record is a time series, i.e., a series
of measurements taken over time.
For example, a financial data set might contain
objects that are time series of the daily prices of
various stocks.
4. Spatial Data

Some objects have spatial attributes, such as


positions or areas, as well as other types of
attributes.
An example of spatial data is weather data
(precipitation, temperature, pressure) that is
collected for a variety of geographical locations.
Handling Non Record Data

Record oriented techniques can be applied to non


record data by extracting features from the data
objects and using these features to create a record
corresponding to each objects for eg.. Given a set of
common substructures each compound can be
represented as a record with binary attributes that
indicate whether a compound contains a specific
substructure.
DATA QUALITY
• It refers to the overall utility of a dataset as a function of its ability to be easily processed and analyzed for other uses.
• Data mining focuses on:-
• (1) the detection and correction of data quality problems and
• (2) the use of algorithms that can tolerate poor data quality.

• In the slides the focus is on measurement and data collection issues and some application related issues.
1. Measurement and Data
Collection Issues
The data is never perfect. There may be problems due to:-
a) Human error.
b) Limitations of measuring devices.
c) Flaws in data collection process.
d) Values or data objects may be missing.
e) Spurious or duplicate objects.
Measurement and Data Collection Errors

Measurement errors
It refers to any problem resulting from the measurement process. A common
problem is that the value recorded differs from the true value to some extent.
Note : the numerical difference of the measured value and true value is called
error.

Data collection errors


It refers to errors such as omitting data objects or attribute values, or
inappropriately including a data object.
Noise and Artifacts
Noise is the random component of a measurement error. It may involve the
distortion of a value or the addition of spurious objects. It is used in connection
with data that has a spatial or temporal component.
Example :
NOTE: the elimination of noise is frequently difficult, and much work in data
mining focuses on devising robust algorithms that produce acceptable
results even when noise is present.

Data errors may be the result of a more deterministic phenomenon, such


deterministic distortions of the data are often referred to as ARTIFACTS.
Precision, Bias, and Accuracy
Precision: The closeness of repeated measurements (of the same quantity) to
one another.
Bias: A systematic variation of measurements from the quantity being
measured.
Precision is often measured by the standard deviation of a set of values, while
bias is measured by taking the difference between the mean of the set of values
and the known value of the quantity being measured. Bias can only be
determined for objects whose measured quantity is known by means external to
the current situation .Accuracy: The closeness of measurements to the true
value of the quantity being measured.
Accuracy depends on precision and bias, but since it is a general concept, there
is no specific formula for accuracy. NOTE: one important aspect of accuracy is
the use of Significant Digits.
Issues such as significant digits, precision, bias, and accuracy are sometimes
overlooked, without some understanding of these aspects in the data, an analyst
Outliers

Outliers are either


•data objects that have characteristics that are different from most of the other
data objects in the data set.
Or
•values of an attribute that are unusual with respect to the typical values for
that attribute.

Outliers can be legitimate data objects and values, thus they are may
sometimes be of interest.
Missing Values
It is not unusual for an object to be missing one or more attribute values.
Some reasons for missing values:
a)The information was not collected.
b)Some attributes are not applicable to all objects.

There are several strategies for dealing with missing data, each of which is
appropriate in certain circumstances.
•Eliminate data objects or attributes.
b) Estimate missing values.
c) Ignore the missing value during analysis.
Eliminate Data Objects or Attributes
Advantages
•A simple and effective strategy is to eliminate objects with missing values.
•A related strategy is to eliminate attributes that have missing values.
Disadvantages
•if many objects have missing values, then a reliable analysis can be difficult or
impossible.
•Sometimes the eliminated attributes may be the ones that are critical to the
analysis.

Estimate Missing Values


Sometimes missing data can be reliably estimated.
Eg:- consider a time series that changes in a reasonably smooth fashion, but
has a few, widely scattered missing values. In such cases, the missing values
can be estimated (interpolated) by using the remaining values.
Ignore the Missing Value during Analysis

Many data mining approaches can be modified to ignore missing values.


Eg:- Suppose that objects are being clustered and the similarity between pairs
of data objects needs to be calculated. If one or both objects of a pair have
missing values for some attributes, then the similarity can be calculated by
using only the attributes that do not have missing values.
Inconsistent values
Data can contain inconsistent values.
Eg:- Consider an address field, where both a zip code and city are listed, but the
specified zip code area is not contained in that city. Regardless of the cause of the
missing value, it is important to detect and, if possible, correct such problems.

Some types of inconsistences are easy to detect. For instance, a person’s height
should not be negative.In other cases, it can be necessary to consult an external
source of information.

Once an inconsistency has been detected, it is sometimes possible to correct the


data. The correction of an inconsistency requires additional or redundant
information.
Duplicate Data
A data set may include data objects that are duplicates, or almost duplicates, of one
another.
To avoid duplication, two main issues must be addressed:
1.if there are two objects that actually represent a single object, then the values of
corresponding attributes may differ, and these inconsistent values must be resolved .
2.care needs to be taken to avoid accidentally combining data objects that are
similar, but not duplicates.

NOTE: The term deduplication is often used to refer to the process of dealing with
these issues.
2. Issues Related to
Applications
Few issues related to applications are :
a)Timeliness
b)Relevance
c)Knowledge about the data

NOTE: “data is of high quality if it is suitable for its intended use”.


Timeliness
•Some data starts to age as soon as it has been collected.
•if the data provides a snapshot of some ongoing phenomenon or process, then
this snapshot represents reality for only a limited time.
•If the data is out of date, then so are the models and patterns that are based
on it.
Relevance
The available data must contain the information necessary for the application.
Eg:- Consider the task of building a model that predicts the accident rate for
drivers. If information about the age and gender of the driver is omitted, then
it is likely that the model will have limited accuracy unless this information is
indirectly available through other attribute.
A common problem is sampling bias, which occurs when a sample does not
contain different types of objects in proportion to their actual occurrence in the
population. Sampling bias will result in an erroneous analysis because the
results of a data analysis can reflect only the data that is present.
Knowledge about the data

Data sets are accompanied by documentation that describes different aspects


of the data. The quality of this documentation can either aid or hinder the
subsequent analysis.
If the documentation is poor and fails to tell us the required information, then
our analysis of the data turns out faulty.
Other important characteristics are the precision of the data, the type of
features (nominal, ordinal, interval, ratio), the scale of measurement (e.g.,
meters or feet for length), and the origin of the data.
References

•https://www.javatpoint.com/dat
a-mining
• https://en.wikipedia.org/wiki/
Data_mining
• Introduction to Data Mining by
Pang-Ning Tan , Michael
Steinbach , Vipin Kumar
• Data Mining Notes

You might also like