0% found this document useful (0 votes)
31 views

Data Mining Notes

Uploaded by

akshay.c c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Data Mining Notes

Uploaded by

akshay.c c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Data Mining

Data mining:

Extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or
knowledge from huge amount of data.

It can also be called by various names. Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archaeology, data dredging, information harvesting, business
intelligence, etc.

Knowledge Discovery (KDD) Process

KDD is an iterative process where evaluation measures can be enhanced, mining can be refined,
new data can be integrated and transformed in order to get different and more appropriate results.

1. Data Integration: The process of collecting data from several different sources and storing
all those data in a common source (Database).
Data integration can be done using Data migration tools, data synchronization tools, ETL
(extract-transform-load) process.
2. Data Cleaning: Removal of noisy and irrelevant data from collection. Irrelevant data hear
means the missing values, noisy data (noise hear refers to random or variance error) etc.
Data cleaning can be done using Data discrepancy detection and Data transformation tools.
3. Date Warehousing: Now that we have cleaned the data, we need to store in a common
source to perform the necessary analysis. So they are stored in data warehouses. Since the
data that is stored is in a structured format, it becomes to perform data select and
transformation.
4. Data Selection: The process where data relevant to the analysis is decided and retrieved
from the data collection. Data Selection can be done using Neural network, Decision trees,
Naïve Bayes, clustering, regression etc.
5. Data Transformation: The process of transforming data into appropriate form required by
the mining procedure. It is a two-step process:
Data Mapping: Assigning elements from source base to destination to capture
transformation
Code Generation: Creation of the actual Transformation program.
6. Data Mining: defined as clever techniques that are applied to extract patterns potentially
useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
7. Pattern Evaluation: defined as identifying strictly increasing patterns representing
knowledge based on given measures.
A pattern can be considered interesting by using the interestingness score.
Uses summarization and visualization to make data understandable for the users
8. Knowledge representation: It is a technique which utilizes visualization tool to represent
data mining results. They could be in the form of reports, tables, discriminant rules,
classification rules, characterization rules etc.

Data Mining Functions

1. Characterization and Discrimination


Characterization: It’s the process of summarizing data of the class under study or it is the
summarization of general characteristics or features of the target class of data.
Example:
 Characteristics of the software product with sales that increased by 15% in the last
quarter
 Characteristics of the customer who spend more than $10000 a year on the grocery
at a mall, e.g age, employment, credit rating, etc.
Discrimination: It is comparison of features of the target class data object against the
features of the objects from other class of the data.
Example:
 Comparing features of the software product with sales increased by 10% last quarter
against those with sales decreased by 10% in the last quarter.
 Comparing two groups of customers those who frequently shop for specific product
and those who rarely shop for such product. Customers who buy product frequently
from the shop are between age 20-30 and are college degree, whereas those who
rarely but are above 60 and have no college degree

2. Association and Correlation


Association: The technique used in finding out interesting patterns in a dataset having items
that are together in the dataset.
Example: A typical association rule
■ Diaper Beer [1%, 75%] (support, confidence)
■ Are strongly associated items also strongly correlated?
It is not necessary that strongly associated items are strongly correlated but strongly
correlated items are strongly associated.

3. Classification
 Describe and distinguish classes or concepts for future prediction
 E.g., classify countries based on (climate), or classify cars based on (gas mileage)
 Predict some unknown class labels
 Construct models (functions) based on some training example
 Typical methods
 Decision trees, naïve Bayesian classification, support vector machines, neural networks,
rule-based classification, pattern-based classification, logistic regression,

 Typical applications: Credit card fraud detection, direct marketing, classifying stars, diseases,
webpages,

4. Cluster Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution
patterns
 Principle: Maximizing intra-class similarity & minimizing interclass similarity
 Many methods and applications
 K-means, k-medoid, hierarchical clustering

5. Outlier Analysis
Outlier: A data object that does not comply with the general behavior of the data
■ Noise or exception? ― One person’s garbage could be another person’s treasure
■ Methods: by product of clustering or regression analysis, …
■ Useful in fraud detection, rare events analysis

Technologies Used:

Issues in Datamining Research


 Data mining algorithms must be efficiency and scalable
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 It is important to study impact of data mining on society
 How can we use data mining to benefit society? How can we guard against its misuse?
 Privacy-preserving data mining: DM poses risk of disclosing an individual’s personal
information

Getting to Know your Data

Data Objects

A dataset is made up of data objects and data objects represent the entities of the dataset.
Example:
Sales database: the data objects here are customers, store items, sales
Medical database: Patients, treatments
Data objects are described by attributes. Database Rows -> data objects, columns -> attributes

Attributes: Also called as dimensions, feature, variables. It is a data field that represents a
characteristic or feature of a data object.
Ex: customer id, name, address

Attribute Types:
1. Nominal: Qualitative in nature. It can be categories, states, or names of things.
Ex: Hair colour = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
2. Binary: Nominal attribute with only 2 states (0 and1)
 Symmetric binary: Both outcomes are equally important. Ex: gender
 Asymmetric binary: outcomes not equally important. Ex: medical test (positive vs.
negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)

3. Ordinal: Values from which meaningful order can be obtained but the magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings

4. Numeric: Quantitative in nature.


Interval: Measured on a scale of equally sized units. Values have order. Ex: Temperature in
CO and FO, calendar dates. There is no true zero-point, since it is a range.
Ratio: Has a zero-point. We can speak of values as being an order of magnitude larger than
other values in terms of unit of measurement. Ex: 10 K˚ is twice as high as 5 K˚, length,
counts, monetary quantities.

Kinds of Attributes
1. Discrete Attribute: Has only a finite or countably infinite set of values. Sometimes,
represented as integer variables. Note: Binary attributes are a special case of discrete
attributes
E.g., zip codes, profession, or the set of words in a collection of documents.

2. Continuous Attribute: Has real numbers as attribute values. Practically, real values can only
be measured and represented using a finite number of digits. Continuous attributes are
typically represented as floating-point variables.

Measuring the Central Tendency


1. Mean (algebraic measure): It is the characteristic of the sample or population.

Sample mean: population mean:

Weighted Mean:

2. Median: The midpoint of the values after they have been ordered from the minimum to the
maximum values. Middle value if odd number of values, or average of the middle two values
otherwise.

3. Mode: Value that occurs the most in the data. It can be Unimodal, bimodal, trimodal.
Empirical Formula: mean – mode = 3* (mean-median)

Symmetric Versus Skewed data

Mean>Median>Mode Mean=median=mode

Mode > Median > Mean


Measuring the dispersion of Data
1. Range (Quartiles, outliers, and boxplots)
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1 , median, Q3 , max
 Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
2. Variance and Standard deviation
Variance is the average squared deviations from the mean, while standard deviation is the
square root of this number.

Variance of sample: Variance of population:

SD for sample: SD for population:

Boxplot analysis: It is a five number summary of a distribution.

Boxplot
 Data is represented with a box
 The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended to Minimum and Maximum
 Outliers: points beyond a specified outlier threshold, plotted individually

Properties of Normal Distribution Curve


The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it
Graphic Displays of Basic Statistical Descriptions
1. Histogram Analysis:
 Histogram: Graph display of tabulated frequencies, shown as bars
 It shows what proportion of cases fall into each of several categories
 Differs from a bar chart, in that it is the area of the bar that denotes the value, not the
height as in bar charts, a crucial distinction when the categories are not of uniform
width.
 The categories are usually specified as non-overlapping intervals of some variable. The
categories (bars) must be adjacent.
 Histograms Often Tell More than Boxplots
 The two histograms shown in the right may have the same boxplot representation
 The same values for: min, Q1, median, Q3, max
 But they have rather different data distributions

2. Quantile-Quantile Plot
The quantile-quantile plot is a graphical method
for determining whether two samples of data came
from the same population or not. A q-q plot is a plot
of the quantiles of the first data set against
the quantiles of the second data set.

3. Scatter Plot
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to
display values for typically two variables for a set of data.
Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Correlations:
To measure the degree for which the data points of one domain tend to diverge with changes in the
data points of another domain, called as correlation coefficient.

Types of correlation:
1. Positive: A positive correlation—when the correlation coefficient is greater than 0—signifies
that both variables move in the same direction. When ρ is +1, it signifies that the two
variables being compared have a perfect positive relationship; when one variable moves
higher or lower, the other variable moves in the same direction with the same magnitude.
The closer the value of ρ is to +1, the stronger the linear relationship

2. Negative: A negative (inverse) correlation occurs when the correlation coefficient is less than


0. This is an indication that both variables move in the opposite direction. In short, if one
variable increases, the other variable decreases with the same magnitude (and vice versa).

Similarity and Dissimilarity


1. Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
2. Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
3. Proximity refers to a similarity or dissimilarity

Data Matrix and Dissimilarity Matrix


1. Data Matrix:
 n data points with p dimensions
 Two modes
2. Dissimilarity Matrix
 n data points, but registers only the distance
 A triangular matrix
 Single mode

Proximity Measure for Nominal Attributes


■ Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute)
■ Method 1: Simple matching
m: # of matches, p: total # of variables

■ Method 2: Use a large number of binary attributes


 creating a new binary attribute for each of the M nominal states

Proximity Measure for Binary Attributes


 A contingency table for binary data

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:

 Jaccard coefficient (similarity measure for asymmetric binary variables):

 Note: Jaccard coefficient is the same as “coherence”:


Standardizing Numeric Data
Reason: Data standardization is about making sure that data is internally consistent; that is, each
data type has the same content and format. Standardized values are useful for tracking data that
isn't easy to compare otherwise.

The most widely used standardization methos is the Z-score:

 X: raw score to be standardized, μ: mean of the population, σ: standard deviation


 the distance between the raw score and the population mean in units of the standard
deviation
 negative when the raw score is less than mean, positive when greater than mean.

Alternative ways: The mean absolute deviation

 Using mean absolute deviation is more robust than using standard deviation

Association Rule Mining

Definition: Association rule mining is machine learning technique used to find association/relation
between items/products/variables from large transactional data sets
To find association between items we first need to find frequent item sets from transactional data.

Frequent Pattern Analysis: a pattern (a set of items, subsequence’s, substructures, etc.) that appear
frequently in a data set.
A typical example of frequent itemset mining is market basket analysis. It is process of analysing
customer buying habits by finding association between the different items that customer place in
their shopping basket

Applications
 Helps in business decision making processes such as catalogue design, cross-marketing,
customer shopping behaviour
 Products that are frequently purchased together can be bundled together and discount can
be offered to increase the sale.
 Design store layout
 Strategy 1: Items that are purchased together can be placed in proximity
 Strategy 2: At opposite ends – customers who purchase such items to pick up other
items along the way.

Basic Concepts
Frequent Patterns: Itemset is a set of items, and itemset that contains k items is called as k-itemset.
Itemset is called as frequent itemset if it satisfies minimum support threshold condition and all its
non-empty subsets are also frequent.
Association Rules:
Frequent patterns are represented in the form of rules
Support. This says how popular an itemset is, as measured by the proportion of transactions in
which an itemset appears (probability)

Confidence: This says how likely item Y is purchased when item X is purchased. This is measured by
the proportion of transactions with item X, in which item Y also appears (conditional probability)

Association rule mining can be viewed as a two-step process:


 Find all frequent itemsets (k-itemset that are frequently purchase together)
 Generate strong rules from frequent itemsets

Apriori algorithm (finding frequent itemset)


Apriori is seminal algorithm proposed by Agrawal and Srikant in 1994 for mining frequent itemsets.
Apriori algorithm is based on Apriori principle
Apriori pruning principle:
 If there is any itemset which is infrequent, its superset should not be generated/tested.
 Every non-empty subset of frequent itemset should also be frequent.

Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate frequent 2-itemset
 Generate frequent 3 itemset …..
 Terminate when no frequent candidate set can be

Generating Association rules


For each frequent itemset l, generate all nonempty subsets of l
For every nonempty subset s of l,
generate rule: s => (l-s) if support count (l) / support count (s) >= min-conf threshold

As rules are generated from frequent itemsets each one automatically satisfies the minimum
support threshold
Misleading Rules:

 The rule is misleading because probability of purchasing video is 75% which is greater than
confidence of the rule (66%)
 Computer games and video are negatively correlated because purchase of one of these
items decreases the likelihood of purchasing the other.
 Use another measure lift

Lift: This says how likely item Y is purchased when item X is purchased, while controlling for how
popular item Y is.

 if lift(A->B)=1 then occurrence of A is independent of occurrence of B. No association


between items.
 if lift(A->B) <1, then occurrence of A is negatively correlated with occurrence of B i.e.
occurrence of A decreases chances of occurrence of B by the factor of lift(A->B).
 if lift(A->B)>1 then occurrence of A is positively correlated with occurrence of B i.e.
occurrence of A increases chances of occurrence of B by the factor of lift(A->B)

Leverage:
 leverage(X→Y) =support(X→Y)−support(X)*support(Y)
leverage = Pr(L, R) - Pr(L).Pr(R).
 Leverage measures the difference between observed and expected joint probability of X and
Y
 Leverage value “0” indicates that occurrence of X and Y is independent of each other
 Value above zero is desired for Leverage
Conviction:
 conviction (X -> Y) =(1-supp(Y))/(1-conf(X -> Y)) = P(X)P(not Y)/P(X and not Y)
conviction = Pr(L). Pr(not R) / Pr(L,R).
 Conviction compares the probability that X appears without Y if they were dependent, with
the actual frequency of the appearance of X without Y.
 conviction value “1” indicates that occurrence of X and Y is independent of each other

Classification

Supervised learning:
 The training dataset has label indicating the class of observation. New data is classified
based on the training set.
Unsupervised learning:
 The class labels of training dataset are unknown

Classification is a two-step process


 Model Construction: A model is built/constructed based on what kind of outcome the
developer is expecting i.e., a set of predetermined classes are used in describing the model.
It is assumed that each sample/tuple belongs to predefined class, which is determined by
the class label attribute.
A set of tuples from the dataset called as training dataset, is used for model construction.
The model is represented as classification rules, decision trees, or mathematical formulae.

 Model Usage: The built model is used in the future for classifying unknown objects.
The accuracy of the model is calculated. The known labels of test sample is compared with
the result of the model.

Accuracy: The percentage of correctly predicted test samples by the model.


Test set is independent of training set. If the accuracy is acceptable, use the model to classify
new data. If the test set is used to select models, it is called the validation set
Decision Tree:
Decision Trees (DTs) are a non-parametric supervised learning method used
for classification and regression. The goal is to create a model that predicts the value of a target
variable by learning simple decision rules inferred from the data features. 

Algorithm used: A greedy and recursive algorithm (Greedy meaning that at every step it makes the
most optimal decision and recursive meaning it splits the larger question into smaller questions and
resolves them the same way)

 The DT is constructed in a top-down recursive divide-and-conquer manner


 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected based on a heuristic or statistical measure (e.g., information
gain)

Conditions for stopping partitioning


 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning – majority voting is employed for
classifying the leaf
 There are no samples left

Entropy (Information Theory)


It is a measure of uncertainty associated with a random variable.
 Formula to calculate:

 High entropy means high uncertainty


 Low entropy means low uncertainty

Conditional Entropy:
Attribute Selection Measures
1. Information gain (models/algorithms: ID3/C4.5)
The attribute with the highest information gain is selected. Information gain is calculated
using the following steps and formula.
Let pi be the probability that an arbitrary tuple in D belongs to class C i.
Information gained by branching on attribute A

Entropy before splitting the data

This is calculated on the target variable or the overall result.

Entropy after splitting the data using attribute A

The decision to split at each node is made according to the metric called purity. A node is 100%
impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single
class.
Computing Information-Gain for Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the values of attribute A in increasing order
 Typically, the midpoint between each pair of adjacent values is considered as a
possible split point
 (ai +ai+1)/2 is the midpoint between the values of ai and a i+1
 The point with the lowest entropy (minimum expected information requirement) for
A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying
A > split-point.

2. Gain Ratio (C4.5)


 Information gain measure is biased towards attributes with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to
information gain)

The values of Dj is the number of categories the attribute has. Using the previous example,
there are 4 rows with low-income label, 6 with medium label and 4 with high label.

Gain(A) is calculated using Information Gain


gain ratio (income) = 0.029/1.557 = 0.019
 The attribute with the maximum gain ratio is selected as the splitting attribute.

3. Gini Index (CART, IBM Intelligent Miner)


Calculates the amount of probability of a specific feature that is classified incorrectly when
selected randomly.
If a data set D contains examples from n classes, Gini index, gini (D) is defined as

where pj is the relative frequency of class j in D


If a data set D is split on A into two subsets D1 and D2 , the Gini index gini (D) is defined as

Reduction in Impurity:
The attribute provides the smallest gini split(D) (or the largest reduction in impurity) is
chosen to split the node (need to enumerate all the possible splitting points for each
attribute)

Gini Index vs Information Gain


Look below for the getting discrepancy between Gini Index and Information Gain,
1. The Gini Index facilitates the bigger distributions so easy to implement whereas the
Information Gain favours lesser distributions having small count with multiple specific values.
2. The method of the Gini Index is used by CART algorithms, in contrast to it, Information Gain
is used in ID3, C4.5 algorithms.
3. Gini index operates on the categorical target variables in terms of “success” or “failure”
and performs only binary split, in opposite to that Information Gain computes the difference
between entropy before and after the split and indicates the impurity in classes of elements.
Bayesian Classification:
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
■ Foundation: Based on Bayes’ Theorem.
■ Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance
with decision tree and selected neural network classifiers
Naïve Bayes Classifier: Comments
Advantages
 Easy to implement
 Good results obtained in most of the cases

Disadvantages
 Assumption: class conditional independence, therefore loss of accuracy
 Practically, dependencies exist among variables
1) E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc.,
Disease: lung cancer, diabetes, etc.
2) Dependencies among these cannot be modelled by Naïve Bayes Classifier

 How to deal with these dependencies? Bayesian Belief Networks

Model Evaluation and Selection


Use validation test set of class-labelled tuples instead of training set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
1) Holdout method, random subsampling
2) Cross-validation
3) Bootstrap
 Comparing classifiers:
1) Confidence intervals
2) Cost-benefit analysis and ROC Curves
Overfitting:

Overfitting occurs when our machine learning model tries to cover all the data points or more than
the required data points present in the given dataset. Because of this, the model starts caching noise
and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy
of the model. The overfitted model has low bias and high variance.

How to avoid Overfitting:


o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting:

Underfitting occurs when our machine learning model is not able to capture the underlying trend of
the data.
In the case of underfitting, the model is not able to learn enough from the training data, and hence it
reduces the accuracy and produces unreliable predictions.

How to avoid Underfitting:

o By increasing the training time of the model.


o By increasing the number of features.

You might also like