0% found this document useful (0 votes)
22 views

DM Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

DM Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

DATA MINING

By

R. Siva Narayana

RGUKT Nuzvid
Data Mining
• Data mining is the process of discovering
interesting patterns and knowledge from large
amounts of data.
– The data sources can include databases, data
warehouses, the Web, other information
repositories, or data that are streamed into the
system dynamically.
Data
Mining
Evolution
KDD
What kinds of data can be mined?
• Database data
• Data warehouses
• Transactional data
• Other kinds of data
– Time-related or sequence data(stock-exchange)
– Data-streams(Video surveillance and sensor data)
– Spatial data(maps)
– Hypertext and multimedia
Getting to Know Your Data
• Real-world data are typically noisy, enormous in volume, and
may originate from heterogenous sources.
• Knowledge about your data is useful for Data Preprocessing.
– What are the types of attributes?
– What kind of values does each attribute have?
– Which attributes are discrete and which are continuous valued?
– What do the data look like? How are the values distributed?
– What are the ways we visualize the data to get better sense?
– Can we spot any outliers?
– Can we measure the similarity of some data objects with respect
to others?
Data Objects and Attribute Types
• Datasets are made up of data objects
• Data objects are typically describes by attributes
• Attribute is a field, representing a characteristic or feature
of data object
• Observed values for a given attribute are known as
observations
• A set of attributes used to describe a given object is called
an attribute vector

• Type of an attribute can be determined by the set of


possible values. We have
– Nominal - Binary
Nominal Attributes
• The values of a nominal attribute are symbols
or names of things
• Each value represents some kind of category,
code, or state, and so nominal attributes are
also referred to as categorical
Nominal Attributes
• It is possible to represent such symbols or “names”
with numbers
– , we can assign a code of 0 for black, 1 for brown, and so
on.
– , with possible values that are all numeric, here the
numbers are not intended to be used quantitatively.
– Mathematical operations on values of nominal
attributes are not meaningful.
– One thing that is of interest, the attribute’s most
commonly occurring value. This value, known as the
mode
Binary Attributes
• A binary attribute is a nominal attribute with only two
categories or states: 0 or 1
• Binary attributes are referred to as Boolean if the two states
correspond to true and false.
– , 1 is smokes, 0 is does not
– , 1 is +ve, 0 is -ve
• A binary attribute is symmetric if both of its states are
equally valuable and carry the same weight;
• A binary attribute is asymmetric if the outcomes of the
states are not equally important, such as the positive and
negative outcomes of a medical test for HIV/Corona
Ordinal Attributes
• An ordinal attribute is an attribute with possible
values that have a meaningful order or ranking
among them, but the magnitude between successive
values is not known.
– , the values have a meaningful sequence, but we
cannot tell from the values how much bigger.

• Ordinal attributes are often used in surveys for


ratings.
Ordinal Attributes
• Ordinal attributes may also be obtained from the
discretization of numeric quantities by splitting the value
range into a finite number of ordered categories
• The central tendency of an ordinal attribute can be
represented by its mode and its median (the middle
value in an ordered sequence), but the mean cannot be
defined.
Note
• Nominal, binary, and ordinal attributes are qualitative,
i.e., they describe a feature of an object without giving
an actual size or quantity.
Numerical Attributes
• A numeric attribute is quantitative; that is, it is
a measurable quantity, represented in integer
or real values.
• Numeric attributes can be interval-scaled or
ratio-scaled.
Interval-Scaled Attributes
• Interval-scaled attributes are measured on a
scale of equal-size units.
• The values of interval-scaled attributes have
order and can be positive, 0, or negative
Basic Statistical Description of Data
• Used to identify properties of the data
• Identify which data values should be treated as
Noise or Outliers
• Measuring Central Tendency – Measures the
location of the middle or center of a data
distribution
• Measuring the Dispersion of Data – How data
are spread out? Used to identify outliers.
• Graphic Displays – Visually inspect our data.
Measuring of central tendency
(Mean, Median, Mode and Midrange)
• Attribute X(salary), which has been recorded for
a set of objects.
• X1,x2,x3, . . . , xN be the set of N observations for X.

• Mean: Most common effective numeric


measure for measuring the CENTER of the data

• Built in aggregate function, average (avg() in


SQL) in RDBMS
Weighted arithmetic mean or Weighted
average
• Sometimes, each value in a set may be
associated with a weight for i=1,…,N
• The weight reflect the significance,
importance, or occurrence frequency attached
to their respective values.
Problems and Solution
• Problem with mean is its sensitivity to
extreme values either large or small(Outliers)
– Ex: Mean score of a class in a exam is decreased
by a few very low scores.
• To offset the effect, we use Trimmed Mean
• Trimmed Mean is obtained after chopping of
(2%) values at the high and low extremes.
Median
• For skewed(Asymmetric) data, median is the
better measure for center of data
• Median is the middle value in a set of ordered
data values.
• It separates the data set into two halves.
• If N is odd, then the median is the middle value
of the ordered set
• If N is even, then the median is the average of
middle two values.
Cont.,
• If data are grouped in the intervals and the
frequency is known, then
N / 2  ( freq )l
median  L1  ( ) width
freq median

– L1 is the lower boundary of the median interval


– N is the number of values in the entire dataset
( freq)l
– is the sum of frequencies of all the intervals
that are lower than the median interval
freq median
– is the frequency of the median interval
– Width is the width of the median interval
Mode and Midrange
• Most frequency occurred value in the set
• Determined for both qualitative and
quantitative attributes
• Data set with one, two or more modes called
Unimode, Bimode and Trimode.
• Unimode empirical formula
mean  mode  3  (mean  median)
• Midrange is the avg of Min + Max
Symmetric vs Skewed Symmetric data
data
• Symmetric data distribution- mean, median
and mode all are the same center value or
nearly to the center.

Positively skewed negatively skewed


• Positively skewed- mode occurs at a value that
is smaller than the median
• Negatively skewed- mode occurs at a value
greater than the median
Dispersion of Data
• Spread of numeric data distribution.
• Range, Quartiles and Interquartile Range
• Five-Number summary, Boxplots and Outliers
• Variance and Standard Deviation
Range, Quartiles & IQR
• Range: Difference between the
largest and smallest values.
• Quantiles: Data points that spilt the data distribution into
equal-size consecutive sets.
• 1-quantile is the data point dividing the lower and upper
halves of the data distribution
• 4-quartiles are the 3 data points that split the data
distribution into 4 equal parts
• Each equal part commonly referred to as quartiles.
• 100 quartiles commonly referred to as Percentiles
• Interquartile range (IQR)=Q3-Q1
Five-Number Summary and Boxplots
• In Symmetric distribution, the median splits
the data into equal-size halves.
• In Skewed distribution, it having two quartiles
Q1 and Q3 along with median distribution.
• To identifying suspected OUTLIERS is to finding
the values falling at least 1.5xIQR above the
third quartile or below the first quartile.
Five-Number Summary and Boxplots,
Cont.,
• Five-number summary of distribution consists
– Minimum
– Quartile Q1(25th Quantile)
– Median Q2
– Quartile Q3(75th quantile)
– Maximum
• Boxplots are popular way of visualizing a
distribution. It incorporates the five-number
summary.
Boxplot representation
Variance and Standard Deviation
• Both indicate how data spread or distributed.
• Low standard deviation means that the data
observations tend to very close to mean
• High standard deviation indicates that the data
are spread out over a large range of values.
• Variance:

• Standard Deviation:
Graphic Displays
• Graphs are helpful to visual description of
data, which is useful in data preprocessing by
identifying the noise and outliers
– Histograms
– Quantile Plot
– Quantile-Quantile plot
– Scatter Plots
ra m s
to g
His

• Histos means pole, gram means chart, so histogram is a


chart of poles
• Histogram is used to summarizing the given attribute X
• Height of the bar indicates the frquency(count) of the
attribute X value
• Numeric attributes are preferred for plot the histograms
• The range of values for X is partitioned into disjoint
consecutive equal subranges called buckets or bins
• Range of bin is known as width
o t
l e Pl
nti
a
Qu
• A simple and effective way to have a first look at a
one variable data distribution.
• It displays all of the data for the given attribute
• It plots quantile information
• Each observation, is paired with a percentage
which indicates that approximately % of the data
are below the value .
• =
o t
-q pl
q
• It graphs the quantiles of one univariate
distribution againest another.
• If M=N, then plot the quantile plot
• If M<N, there only M points on the q-q plot
l o t
tte rP
S ca
• To construct a scatter plot, each pair of values
is treated as a pair of coordinates in an
algebraic sense and plotted as points in the
plane.
• Provides first look at bivariate data to see
clusters and outliers or to explore the
possibility of correlations.
Types of Datasets
• General characteristics of datasets
– Dimensionality (Preprocesssing require
dimensionality reduction)
– Sparsity (Saves the storage and computational
time)
– Resolution (seems to be different with
corresponding difference)
Types of Datasets
• Record data
– Transaction or market basket data
– Data matrix or pattern matrix
– Sparse data matrix or Document term matrix
• Graph based data
• Ordered data
– Sequential data
– Sequence data
– Time series data
– Spatial data
Record Data
• Dataset is a collection of records
(Data objects) each of which
consists of a fixed set of data fields
(attributes)
• There is no explicit relationship
among records or data fields and
every record has same set of
attributes.
• Record data usually stored either in
flat-files or in relational DBs or DB
Servers.
Transaction or Market Basket data
• It is a special type of record data, where each
record(transaction) includes a set of items.
• Transaction data is a collection of set of items,
but it can be viewed as, a set of records whose
fields are asymmetric attributes.
• The attributes can be
– Discrete: No.of items purchased
– Continuous: Amount spent
Data Matrix or Pattern Matrix
• Set of data objects can be
interpreted as an mxn matrix
is called Data matrix.
• Data matrix is a variation of
record data, but we can
apply standard matrix
operations to transform.
• It is the standard data format
for most statistical data.
Sparse Data Matrix
• It is a spl.case of Data matrix in which the attributes
are of the same type and are asymmetric(Only non-
zero values are imp)
• Transaction data is an example of a Sparse matrix
has only 0-1 entries.
• Ex: Document data
• Document can be a term vector, where each term is
a component(attribute) of the vector and the values
of each component is the no.of times that
corresponding term occurs.
Graph-Based Data
• Two cases
– The graph captures relationships among data
objects
– The data objects themselves are represented as
graphs.
Different variations of Graph Data
Ordered Data
• Different types of ordered data are:
– Sequential Data(Transaction)
– Sequence data
– Time series data
– Spatial data
Sequential Data
• Sequential data:
Extension of record data,
where each record has a
time associated with it.
– Ex: Retail transaction
Dataset
Sequence Data
• contains a sequence of
individual entities, such
as seq.of letters or
words.
– Ex: Genetic sequence
data to predict
similarities in the
structure and functions
of genes
Time Series Data
• Time series data is a special
type of sequential data in
which each record is a time
series over time.
• Ex: Stocks, temperature
• Temporal autocorrelation:
When working with temporal
data, if two measurements are
close in time, then the values
of those measurements are
often very similar
Spatial Data
• Some objects have
special attributes, such
as position or areas and
other types of
attributes
• Ex: Weather data
collected form various
geographical areas.
Technologies adopted by Data Mining
Technologies adopted by Data Mining

You might also like