Data Mining Notes
Data Mining Notes
Data mining:
Extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or
knowledge from huge amount of data.
It can also be called by various names. Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archaeology, data dredging, information harvesting, business
intelligence, etc.
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined,
new data can be integrated and transformed in order to get different and more appropriate results.
1. Data Integration: The process of collecting data from several different sources and storing
all those data in a common source (Database).
Data integration can be done using Data migration tools, data synchronization tools, ETL
(extract-transform-load) process.
2. Data Cleaning: Removal of noisy and irrelevant data from collection. Irrelevant data hear
means the missing values, noisy data (noise hear refers to random or variance error) etc.
Data cleaning can be done using Data discrepancy detection and Data transformation tools.
3. Date Warehousing: Now that we have cleaned the data, we need to store in a common
source to perform the necessary analysis. So they are stored in data warehouses. Since the
data that is stored is in a structured format, it becomes to perform data select and
transformation.
4. Data Selection: The process where data relevant to the analysis is decided and retrieved
from the data collection. Data Selection can be done using Neural network, Decision trees,
Naïve Bayes, clustering, regression etc.
5. Data Transformation: The process of transforming data into appropriate form required by
the mining procedure. It is a two-step process:
Data Mapping: Assigning elements from source base to destination to capture
transformation
Code Generation: Creation of the actual Transformation program.
6. Data Mining: defined as clever techniques that are applied to extract patterns potentially
useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
7. Pattern Evaluation: defined as identifying strictly increasing patterns representing
knowledge based on given measures.
A pattern can be considered interesting by using the interestingness score.
Uses summarization and visualization to make data understandable for the users
8. Knowledge representation: It is a technique which utilizes visualization tool to represent
data mining results. They could be in the form of reports, tables, discriminant rules,
classification rules, characterization rules etc.
3. Classification
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
Predict some unknown class labels
Construct models (functions) based on some training example
Typical methods
Decision trees, naïve Bayesian classification, support vector machines, neural networks,
rule-based classification, pattern-based classification, logistic regression,
Typical applications: Credit card fraud detection, direct marketing, classifying stars, diseases,
webpages,
4. Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution
patterns
Principle: Maximizing intra-class similarity & minimizing interclass similarity
Many methods and applications
K-means, k-medoid, hierarchical clustering
5. Outlier Analysis
Outlier: A data object that does not comply with the general behavior of the data
■ Noise or exception? ― One person’s garbage could be another person’s treasure
■ Methods: by product of clustering or regression analysis, …
■ Useful in fraud detection, rare events analysis
Technologies Used:
Data Objects
A dataset is made up of data objects and data objects represent the entities of the dataset.
Example:
Sales database: the data objects here are customers, store items, sales
Medical database: Patients, treatments
Data objects are described by attributes. Database Rows -> data objects, columns -> attributes
Attributes: Also called as dimensions, feature, variables. It is a data field that represents a
characteristic or feature of a data object.
Ex: customer id, name, address
Attribute Types:
1. Nominal: Qualitative in nature. It can be categories, states, or names of things.
Ex: Hair colour = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
2. Binary: Nominal attribute with only 2 states (0 and1)
Symmetric binary: Both outcomes are equally important. Ex: gender
Asymmetric binary: outcomes not equally important. Ex: medical test (positive vs.
negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
3. Ordinal: Values from which meaningful order can be obtained but the magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
Kinds of Attributes
1. Discrete Attribute: Has only a finite or countably infinite set of values. Sometimes,
represented as integer variables. Note: Binary attributes are a special case of discrete
attributes
E.g., zip codes, profession, or the set of words in a collection of documents.
2. Continuous Attribute: Has real numbers as attribute values. Practically, real values can only
be measured and represented using a finite number of digits. Continuous attributes are
typically represented as floating-point variables.
Weighted Mean:
2. Median: The midpoint of the values after they have been ordered from the minimum to the
maximum values. Middle value if odd number of values, or average of the middle two values
otherwise.
3. Mode: Value that occurs the most in the data. It can be Unimodal, bimodal, trimodal.
Empirical Formula: mean – mode = 3* (mean-median)
Mean>Median>Mode Mean=median=mode
Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
The median is marked by a line within the box
Whiskers: two lines outside the box extended to Minimum and Maximum
Outliers: points beyond a specified outlier threshold, plotted individually
2. Quantile-Quantile Plot
The quantile-quantile plot is a graphical method
for determining whether two samples of data came
from the same population or not. A q-q plot is a plot
of the quantiles of the first data set against
the quantiles of the second data set.
3. Scatter Plot
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to
display values for typically two variables for a set of data.
Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Correlations:
To measure the degree for which the data points of one domain tend to diverge with changes in the
data points of another domain, called as correlation coefficient.
Types of correlation:
1. Positive: A positive correlation—when the correlation coefficient is greater than 0—signifies
that both variables move in the same direction. When ρ is +1, it signifies that the two
variables being compared have a perfect positive relationship; when one variable moves
higher or lower, the other variable moves in the same direction with the same magnitude.
The closer the value of ρ is to +1, the stronger the linear relationship
Using mean absolute deviation is more robust than using standard deviation
Definition: Association rule mining is machine learning technique used to find association/relation
between items/products/variables from large transactional data sets
To find association between items we first need to find frequent item sets from transactional data.
Frequent Pattern Analysis: a pattern (a set of items, subsequence’s, substructures, etc.) that appear
frequently in a data set.
A typical example of frequent itemset mining is market basket analysis. It is process of analysing
customer buying habits by finding association between the different items that customer place in
their shopping basket
Applications
Helps in business decision making processes such as catalogue design, cross-marketing,
customer shopping behaviour
Products that are frequently purchased together can be bundled together and discount can
be offered to increase the sale.
Design store layout
Strategy 1: Items that are purchased together can be placed in proximity
Strategy 2: At opposite ends – customers who purchase such items to pick up other
items along the way.
Basic Concepts
Frequent Patterns: Itemset is a set of items, and itemset that contains k items is called as k-itemset.
Itemset is called as frequent itemset if it satisfies minimum support threshold condition and all its
non-empty subsets are also frequent.
Association Rules:
Frequent patterns are represented in the form of rules
Support. This says how popular an itemset is, as measured by the proportion of transactions in
which an itemset appears (probability)
Confidence: This says how likely item Y is purchased when item X is purchased. This is measured by
the proportion of transactions with item X, in which item Y also appears (conditional probability)
Method:
Initially, scan DB once to get frequent 1-itemset
Generate frequent 2-itemset
Generate frequent 3 itemset …..
Terminate when no frequent candidate set can be
As rules are generated from frequent itemsets each one automatically satisfies the minimum
support threshold
Misleading Rules:
The rule is misleading because probability of purchasing video is 75% which is greater than
confidence of the rule (66%)
Computer games and video are negatively correlated because purchase of one of these
items decreases the likelihood of purchasing the other.
Use another measure lift
Lift: This says how likely item Y is purchased when item X is purchased, while controlling for how
popular item Y is.
Leverage:
leverage(X→Y) =support(X→Y)−support(X)*support(Y)
leverage = Pr(L, R) - Pr(L).Pr(R).
Leverage measures the difference between observed and expected joint probability of X and
Y
Leverage value “0” indicates that occurrence of X and Y is independent of each other
Value above zero is desired for Leverage
Conviction:
conviction (X -> Y) =(1-supp(Y))/(1-conf(X -> Y)) = P(X)P(not Y)/P(X and not Y)
conviction = Pr(L). Pr(not R) / Pr(L,R).
Conviction compares the probability that X appears without Y if they were dependent, with
the actual frequency of the appearance of X without Y.
conviction value “1” indicates that occurrence of X and Y is independent of each other
Classification
Supervised learning:
The training dataset has label indicating the class of observation. New data is classified
based on the training set.
Unsupervised learning:
The class labels of training dataset are unknown
Model Usage: The built model is used in the future for classifying unknown objects.
The accuracy of the model is calculated. The known labels of test sample is compared with
the result of the model.
Algorithm used: A greedy and recursive algorithm (Greedy meaning that at every step it makes the
most optimal decision and recursive meaning it splits the larger question into smaller questions and
resolves them the same way)
Conditional Entropy:
Attribute Selection Measures
1. Information gain (models/algorithms: ID3/C4.5)
The attribute with the highest information gain is selected. Information gain is calculated
using the following steps and formula.
Let pi be the probability that an arbitrary tuple in D belongs to class C i.
Information gained by branching on attribute A
The decision to split at each node is made according to the metric called purity. A node is 100%
impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single
class.
Computing Information-Gain for Continuous-Valued Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the values of attribute A in increasing order
Typically, the midpoint between each pair of adjacent values is considered as a
possible split point
(ai +ai+1)/2 is the midpoint between the values of ai and a i+1
The point with the lowest entropy (minimum expected information requirement) for
A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying
A > split-point.
The values of Dj is the number of categories the attribute has. Using the previous example,
there are 4 rows with low-income label, 6 with medium label and 4 with high label.
Reduction in Impurity:
The attribute provides the smallest gini split(D) (or the largest reduction in impurity) is
chosen to split the node (need to enumerate all the possible splitting points for each
attribute)
Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
1) E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc.,
Disease: lung cancer, diabetes, etc.
2) Dependencies among these cannot be modelled by Naïve Bayes Classifier
Overfitting occurs when our machine learning model tries to cover all the data points or more than
the required data points present in the given dataset. Because of this, the model starts caching noise
and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy
of the model. The overfitted model has low bias and high variance.
Underfitting:
Underfitting occurs when our machine learning model is not able to capture the underlying trend of
the data.
In the case of underfitting, the model is not able to learn enough from the training data, and hence it
reduces the accuracy and produces unreliable predictions.