Fundamentals of
Data Mining
and
Machine Learning
Dr.B.Santhosh Kumar,
Associate Professor,
G. Pulla Reddy Engineering College(Autonomous),
Kurnool.
Introduction
What is Data Mining?
The significant extraction of implicit, previously unknown and
potentially useful information from data.
Data mining is the process of automatically discovering useful
information in large data repositories
Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management
identify those who are likely to leave for a competitor.
Targeted marketing
identify likely responders to promotions
Fraud detection: telecommunications, financial transactions
from an online stream of event identify fraudulent events
Applications(continued)
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between
diseases
Website/store design and promotion
find affinity of visitor to pages and modify layout
Attribute
Types of Attributes
Data Mining Tasks
Predictive tasks : Predict the value of a particular
attribute based on the values of other attributes.
Descriptive tasks : Here, the objective is to derive
patterns (clusters and anomalies) that summarize the
underlying relationships in data.
Examples of Classification
Association Analysis
Cluster Analysis
Anomaly Detection
The task of identifying observations whose characteristics
are significantly different from rest of the data. Such
observations are called anomalies or outliers.
Ex: Credit card fraud detection, network intrusions,
unusual patterns of disease.
Machine Learning
Machine Learning is the science of programming computers
so they can learn from data.
Machine Learning is the field of study that gives computers
the ability to learn without being explicitly programmed.
A computer program is said to learn from experience E with
respect to some task T and some performance measure P, if
its performance on T, as measured by P, improves with
experience E.
Example
Spam filter is a Machine Learning program that can learn to
flag spam given examples of spam emails (e.g., flagged by
users) and examples of regular (nonspam, also called “ham”)
emails.
The examples that the system uses to learn are called the
training set. Each training example is called a training instance
(or sample).
In this case, the task T is to flag spam for new emails, the
experience E is the training data, and the performance measure
P needs to be defined; for example, you can use the ratio of
correctly classified emails.
This particular performance measure is called accuracy and it is
often used in classification tasks.
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Data compression
Data transformation
Normalization
Forms of Data Preprocessing
20
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially
incorrect data, e.g., instrument faulty, human or
computer error, transmission error
incomplete: lacking attribute values, lacking certain
attributes of interest
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” or Salary=“NaN” (an error)
21
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range 12,000 to 98,000 normalized to [0.0, 1.0]. Then
73,600 12,000
73,600 is mapped to (1.0 0) 0 0.716
98,000 12,000
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
22
The Traditional Approach
Use of Machine Learning
Automatic Adaptation
Machine Learning helps Humans
Learn
Types of Machine Learning Algorithms
Supervised learning
Examples: K-Nearest Neighbors, Linear Regression, Logistic Regression,
Support Vector Machines (SVMs) , Decision Trees and Random Forests
Classification
Regression
Unsupervised learning
Data is unlabeled
Examples: Clustering -- k-Means, Hierarchical Cluster Analysis (HCA),
Visualization and dimensionality reduction -- Principal Component Analysis
(PCA), Anomaly detection, Association rule learning -- Apriori
Reinforcement Learning