DM Introduction

Uploaded by

S190579 YALLA VENKATA SURESH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

DM Introduction

Uploaded by

S190579 YALLA VENKATA SURESH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

DATA MINING

R. Siva Narayana

RGUKT Nuzvid
Data Mining
• Data mining is the process of discovering
interesting patterns and knowledge from large
amounts of data.
– The data sources can include databases, data
warehouses, the Web, other information
repositories, or data that are streamed into the
system dynamically.
Data
Mining
Evolution
KDD
What kinds of data can be mined?
• Database data
• Data warehouses
• Transactional data
• Other kinds of data
– Time-related or sequence data(stock-exchange)
– Data-streams(Video surveillance and sensor data)
– Spatial data(maps)
– Hypertext and multimedia
Getting to Know Your Data
• Real-world data are typically noisy, enormous in volume, and
may originate from heterogenous sources.
• Knowledge about your data is useful for Data Preprocessing.
– What are the types of attributes?
– What kind of values does each attribute have?
– Which attributes are discrete and which are continuous valued?
– What do the data look like? How are the values distributed?
– What are the ways we visualize the data to get better sense?
– Can we spot any outliers?
– Can we measure the similarity of some data objects with respect
to others?
Data Objects and Attribute Types
• Datasets are made up of data objects
• Data objects are typically describes by attributes
• Attribute is a field, representing a characteristic or feature
of data object
• Observed values for a given attribute are known as
observations
• A set of attributes used to describe a given object is called
an attribute vector

• Type of an attribute can be determined by the set of

possible values. We have
– Nominal - Binary
Nominal Attributes
• The values of a nominal attribute are symbols
or names of things
• Each value represents some kind of category,
code, or state, and so nominal attributes are
also referred to as categorical
Nominal Attributes
• It is possible to represent such symbols or “names”
with numbers
– , we can assign a code of 0 for black, 1 for brown, and so
on.
– , with possible values that are all numeric, here the
numbers are not intended to be used quantitatively.
– Mathematical operations on values of nominal
attributes are not meaningful.
– One thing that is of interest, the attribute’s most
commonly occurring value. This value, known as the
mode
Binary Attributes
• A binary attribute is a nominal attribute with only two
categories or states: 0 or 1
• Binary attributes are referred to as Boolean if the two states
correspond to true and false.
– , 1 is smokes, 0 is does not
– , 1 is +ve, 0 is -ve
• A binary attribute is symmetric if both of its states are
equally valuable and carry the same weight;
• A binary attribute is asymmetric if the outcomes of the
states are not equally important, such as the positive and
negative outcomes of a medical test for HIV/Corona
Ordinal Attributes
• An ordinal attribute is an attribute with possible
values that have a meaningful order or ranking
among them, but the magnitude between successive
values is not known.
– , the values have a meaningful sequence, but we
cannot tell from the values how much bigger.

• Ordinal attributes are often used in surveys for

ratings.
Ordinal Attributes
• Ordinal attributes may also be obtained from the
discretization of numeric quantities by splitting the value
range into a ﬁnite number of ordered categories
• The central tendency of an ordinal attribute can be
represented by its mode and its median (the middle
value in an ordered sequence), but the mean cannot be
deﬁned.
Note
• Nominal, binary, and ordinal attributes are qualitative,
i.e., they describe a feature of an object without giving
an actual size or quantity.
Numerical Attributes
• A numeric attribute is quantitative; that is, it is
a measurable quantity, represented in integer
or real values.
• Numeric attributes can be interval-scaled or
ratio-scaled.
Interval-Scaled Attributes
• Interval-scaled attributes are measured on a
scale of equal-size units.
• The values of interval-scaled attributes have
order and can be positive, 0, or negative
Basic Statistical Description of Data
• Used to identify properties of the data
• Identify which data values should be treated as
Noise or Outliers
• Measuring Central Tendency – Measures the
location of the middle or center of a data
distribution
• Measuring the Dispersion of Data – How data
are spread out? Used to identify outliers.
• Graphic Displays – Visually inspect our data.
Measuring of central tendency
(Mean, Median, Mode and Midrange)
• Attribute X(salary), which has been recorded for
a set of objects.
• X1,x2,x3, . . . , xN be the set of N observations for X.

• Mean: Most common effective numeric

measure for measuring the CENTER of the data

• Built in aggregate function, average (avg() in

SQL) in RDBMS
Weighted arithmetic mean or Weighted
average
• Sometimes, each value in a set may be
associated with a weight for i=1,…,N
• The weight reflect the significance,
importance, or occurrence frequency attached
to their respective values.
Problems and Solution
• Problem with mean is its sensitivity to
extreme values either large or small(Outliers)
– Ex: Mean score of a class in a exam is decreased
by a few very low scores.
• To offset the effect, we use Trimmed Mean
• Trimmed Mean is obtained after chopping of
(2%) values at the high and low extremes.
Median
• For skewed(Asymmetric) data, median is the
better measure for center of data
• Median is the middle value in a set of ordered
data values.
• It separates the data set into two halves.
• If N is odd, then the median is the middle value
of the ordered set
• If N is even, then the median is the average of
middle two values.
Cont.,
• If data are grouped in the intervals and the
frequency is known, then
N / 2  ( freq )l
median  L1  ( ) width
freq median

– L1 is the lower boundary of the median interval

– N is the number of values in the entire dataset
( freq)l
– is the sum of frequencies of all the intervals
that are lower than the median interval
freq median
– is the frequency of the median interval
– Width is the width of the median interval
Mode and Midrange
• Most frequency occurred value in the set
• Determined for both qualitative and
quantitative attributes
• Data set with one, two or more modes called
Unimode, Bimode and Trimode.
• Unimode empirical formula
mean  mode  3  (mean  median)
• Midrange is the avg of Min + Max
Symmetric vs Skewed Symmetric data
data
• Symmetric data distribution- mean, median
and mode all are the same center value or
nearly to the center.

Positively skewed negatively skewed

• Positively skewed- mode occurs at a value that
is smaller than the median
• Negatively skewed- mode occurs at a value
greater than the median
Dispersion of Data
• Spread of numeric data distribution.
• Range, Quartiles and Interquartile Range
• Five-Number summary, Boxplots and Outliers
• Variance and Standard Deviation
Range, Quartiles & IQR
• Range: Difference between the
largest and smallest values.
• Quantiles: Data points that spilt the data distribution into
equal-size consecutive sets.
• 1-quantile is the data point dividing the lower and upper
halves of the data distribution
• 4-quartiles are the 3 data points that split the data
distribution into 4 equal parts
• Each equal part commonly referred to as quartiles.
• 100 quartiles commonly referred to as Percentiles
• Interquartile range (IQR)=Q3-Q1
Five-Number Summary and Boxplots
• In Symmetric distribution, the median splits
the data into equal-size halves.
• In Skewed distribution, it having two quartiles
Q1 and Q3 along with median distribution.
• To identifying suspected OUTLIERS is to finding
the values falling at least 1.5xIQR above the
third quartile or below the first quartile.
Five-Number Summary and Boxplots,
Cont.,
• Five-number summary of distribution consists
– Minimum
– Quartile Q1(25th Quantile)
– Median Q2
– Quartile Q3(75th quantile)
– Maximum
• Boxplots are popular way of visualizing a
distribution. It incorporates the five-number
summary.
Boxplot representation
Variance and Standard Deviation
• Both indicate how data spread or distributed.
• Low standard deviation means that the data
observations tend to very close to mean
• High standard deviation indicates that the data
are spread out over a large range of values.
• Variance:

• Standard Deviation:
Graphic Displays
• Graphs are helpful to visual description of
data, which is useful in data preprocessing by
identifying the noise and outliers
– Histograms
– Quantile Plot
– Quantile-Quantile plot
– Scatter Plots
ra m s
to g
His

• Histos means pole, gram means chart, so histogram is a

chart of poles
• Histogram is used to summarizing the given attribute X
• Height of the bar indicates the frquency(count) of the
attribute X value
• Numeric attributes are preferred for plot the histograms
• The range of values for X is partitioned into disjoint
consecutive equal subranges called buckets or bins
• Range of bin is known as width
o t
l e Pl
nti
a
Qu
• A simple and effective way to have a first look at a
one variable data distribution.
• It displays all of the data for the given attribute
• It plots quantile information
• Each observation, is paired with a percentage
which indicates that approximately % of the data
are below the value .
• =
o t
-q pl
q
• It graphs the quantiles of one univariate
distribution againest another.
• If M=N, then plot the quantile plot
• If M<N, there only M points on the q-q plot
l o t
tte rP
S ca
• To construct a scatter plot, each pair of values
is treated as a pair of coordinates in an
algebraic sense and plotted as points in the
plane.
• Provides first look at bivariate data to see
clusters and outliers or to explore the
possibility of correlations.
Types of Datasets
• General characteristics of datasets
– Dimensionality (Preprocesssing require
dimensionality reduction)
– Sparsity (Saves the storage and computational
time)
– Resolution (seems to be different with
corresponding difference)
Types of Datasets
• Record data
– Transaction or market basket data
– Data matrix or pattern matrix
– Sparse data matrix or Document term matrix
• Graph based data
• Ordered data
– Sequential data
– Sequence data
– Time series data
– Spatial data
Record Data
• Dataset is a collection of records
(Data objects) each of which
consists of a fixed set of data fields
(attributes)
• There is no explicit relationship
among records or data fields and
every record has same set of
attributes.
• Record data usually stored either in
flat-files or in relational DBs or DB
Servers.
Transaction or Market Basket data
• It is a special type of record data, where each
record(transaction) includes a set of items.
• Transaction data is a collection of set of items,
but it can be viewed as, a set of records whose
fields are asymmetric attributes.
• The attributes can be
– Discrete: No.of items purchased
– Continuous: Amount spent
Data Matrix or Pattern Matrix
• Set of data objects can be
interpreted as an mxn matrix
is called Data matrix.
• Data matrix is a variation of
record data, but we can
apply standard matrix
operations to transform.
• It is the standard data format
for most statistical data.
Sparse Data Matrix
• It is a spl.case of Data matrix in which the attributes
are of the same type and are asymmetric(Only non-
zero values are imp)
• Transaction data is an example of a Sparse matrix
has only 0-1 entries.
• Ex: Document data
• Document can be a term vector, where each term is
a component(attribute) of the vector and the values
of each component is the no.of times that
corresponding term occurs.
Graph-Based Data
• Two cases
– The graph captures relationships among data
objects
– The data objects themselves are represented as
graphs.
Different variations of Graph Data
Ordered Data
• Different types of ordered data are:
– Sequential Data(Transaction)
– Sequence data
– Time series data
– Spatial data
Sequential Data
• Sequential data:
Extension of record data,
where each record has a
time associated with it.
– Ex: Retail transaction
Dataset
Sequence Data
• contains a sequence of
individual entities, such
as seq.of letters or
words.
– Ex: Genetic sequence
data to predict
similarities in the
structure and functions
of genes
Time Series Data
• Time series data is a special
type of sequential data in
which each record is a time
series over time.
• Ex: Stocks, temperature
• Temporal autocorrelation:
When working with temporal
data, if two measurements are
close in time, then the values
of those measurements are
often very similar
Spatial Data
• Some objects have
special attributes, such
as position or areas and
other types of
attributes
• Ex: Weather data
collected form various
geographical areas.
Technologies adopted by Data Mining
Technologies adopted by Data Mining

2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
CPSC 4830 2025Summer Lecture 2
No ratings yet
CPSC 4830 2025Summer Lecture 2
42 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Ch 2 (2)
No ratings yet
Ch 2 (2)
35 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
DWDM UNIT-2
No ratings yet
DWDM UNIT-2
19 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
CH 2
No ratings yet
CH 2
68 pages
lec2-data
No ratings yet
lec2-data
51 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
02 Data
No ratings yet
02 Data
64 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
01 Data
No ratings yet
01 Data
100 pages
02 Data
No ratings yet
02 Data
35 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
41 pages
Lect 3
No ratings yet
Lect 3
51 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
data mining 2
No ratings yet
data mining 2
64 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Week2_UnderstandingData
No ratings yet
Week2_UnderstandingData
27 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Lec 2
No ratings yet
Lec 2
26 pages
02 Data
No ratings yet
02 Data
64 pages
02 Data
No ratings yet
02 Data
62 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
02Data
No ratings yet
02Data
65 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
02Data
No ratings yet
02Data
66 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
CHP 2
No ratings yet
CHP 2
52 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
02Data
No ratings yet
02Data
24 pages
Data Distribution
No ratings yet
Data Distribution
26 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Types of Data
No ratings yet
Types of Data
68 pages

DM Introduction

Uploaded by

DM Introduction

Uploaded by

DATA MINING

• Type of an attribute can be determined by the set of

• Ordinal attributes are often used in surveys for

• Mean: Most common effective numeric

• Built in aggregate function, average (avg() in

– L1 is the lower boundary of the median interval

Positively skewed negatively skewed

• Histos means pole, gram means chart, so histogram is a

You might also like