Data Mining Unit 1
Data Mining Unit 1
Data Mining
● Data mining is the process of discovering patterns, trends, and meaningful information
from large sets of data.
● It involves extracting knowledge from data and transforming it into an understandable
structure for further use.
● The goal of data mining is to uncover hidden patterns and relationships that can be used
to make informed decisions.
● The data sources can include databases, data warehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
● It is also termed as knowledge discovery from data, or KDD,
The knowledge discovery process involves an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
2. What Kind of data can be mined?
Types of Data that Can be Mined:
Relational Databases:
● Data warehouses store large amounts of historical data from various sources. Data
mining can help analyze this data to identify trends and patterns.
● A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of
attributes in the schema, and each cell stores the value of some aggregate measure
such as count or sum(sales amount).
● A data cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data.
● These databases contain records of transactions over time. Data mining can reveal
patterns in customer behavior, buying habits, and other transactional data.
● A transaction typ- ically includes a unique transaction identity number (trans ID) and
a list of the items making up the transaction, such as the items purchased in the
transaction
● Data mining can be applied to spatial data (geographical information) and temporal
data (time-related information) to discover patterns in space and time.
Text and Web Data:
● Unstructured data, such as text and web content, can be mined for valuable insights.
This includes sentiment analysis, topic modeling, and information extraction from
documents.
Multimedia Data:
● Data mining techniques can be applied to multimedia data, such as images and
videos, to identify patterns and extract useful information.
Biological Data:
● In fields like bioinformatics, data mining is used to analyze biological data, including
DNA sequences, protein structures, and medical records.
Social Network Data:
● Social network analysis involves mining data from social media platforms to
understand user behavior, relationships, and trends.
3. What kind of Patterns that can be mined?
● Data mining functionalities are used to specify the kinds of patterns to be found in data
mining tasks.
● In general, such tasks can be classified into two categories: descriptive and predictive.
● Descriptive min- ing tasks characterize properties of the data in a target data set.
● Predictive mining tasks perform induction on the current data in order to make
predictions.
● Data entries can be associated with classes or concepts, and their characteristics can
be summarized through data characterization or discrimination.
● Data characterization involves summarizing the general features of a target class,
often obtained through queries or OLAP operations.
● Discrimination compares the features of the target class with contrasting classes,
resulting in discriminant rules.
● Output forms include charts, tables, and rules.
Example:
Data Characterization: In a retail setting like AllElectronics, understanding customer
spending behavior involves summarizing characteristics of customers who spend more
than $5000 a year. This process yields generalized profiles like age, employment, and
credit ratings.
Data Discrimination: A Customer Relationship Manager at AllElectronics compares two
customer groups: frequent computer product purchasers (over twice a month) and
infrequent purchasers (less than three times a year).
The analysis reveals distinctions, such as 80% of frequent purchasers being 20-40 years
old with a university education, while 60% of infrequent purchasers are seniors or youths
without a university degree. Further exploration, including occupation and income level,
may unveil additional discriminative features between the two groups.
● Techniques like frequent itemset mining are fundamental for association rule mining.
● Classification involves finding models that describe and distinguish data classes.
● Regression predicts continuous values and can be used for numeric prediction.
● Decision trees, mathematical formulas, and neural networks can represent derived
models.
Cluster Analysis:
● Clustering groups data objects based on similarity without using class labels.
● Outliers are data objects that deviate from the general behavior of the dataset.
● Interesting patterns are those easily understood, valid on new data, potentially
useful, and novel.
● Objective measures like support and confidence are used to assess patterns.
● Inferential statistics (or predictive statistics) models data in a way that accounts for
randomness and uncertainty in the observations and is used to draw inferences about
the process or population under investigation.
● A statistical hypothesis test (sometimes called confirmatory data analysis) makes
statistical decisions using experimental data.
● The supervision in the learning comes from the labeled examples in the training data
set.
● Unsupervised learning is essentially a synonym for clustering. The learning process
is unsupervised since the input examples are not class labeled.
● Semi-supervised learning is a class of machine learning techniques that make use of
both labeled and unlabeled examples when learning a model. In one approach,
labeled examples are used to learn class models and unlabeled examples are used to
refine the boundaries between classes.
● Active learning is a machine learning approach that lets users play an active role in
the learning process. An active learning approach can ask a user (e.g., a domain
expert) to label an example, which may be from a set of unlabeled examples or
synthesized by the learning program.
● Values are symbols or names, and these attributes are also called categorical. Examples
include hair color and marital status.
● Numeric codes may represent nominal values, but the numbers lack quantitative
meaning.
● Measures of central tendency like mean and median are not meaningful for nominal
attributes.
Example Nominal attributes: Hair color (black, brown, blond, etc.) and Marital status (single,
married, divorced, widowed).
Binary Attributes
● Binary attributes have two categories (0 or 1), often indicating absence (0) or presence
(1).
● If states are equally valuable, it's symmetric; otherwise, it's asymmetric. Example:
Smoker (0 for non-smoker, 1 for smoker).
Example Binary attributes: Smoker (0, 1) and Medical test result (0 for negative, 1 for positive).
Ordinal Attributes
● Ordinal attributes have values with a meaningful order, but the magnitude between them
is unknown.
● Examples include drink size and professional rank. Ordinal attributes are useful for
subjective assessments and surveys.
Example Ordinal attributes: Drink size (small, medium, large) and Professional rank (assistant,
associate, full).
Numeric Attributes
● Numeric attributes are quantitative, measured in integer or real values. They can be
interval-scaled or ratio-scaled.
o Interval-Scaled Attributes have equal-size units, allowing ranking and
quantifying differences. Mean, median, and mode are applicable.
o Example Interval-scaled attributes: Temperature in Celsius, calendar dates.
o Ratio-Scaled Attributes have a true zero-point, allowing ratio comparisons.
Mean, median, mode, and ratios are applicable.
Example Ratio-scaled attributes: Kelvin temperature, years of experience, weight.
Discrete versus Continuous Attributes
● Basic statistical descriptions can be used to identify properties of the data and highlight
which data values should be treated as noise or outliers.
Measuring the Central Tendency: Mean, Median, and Mode
● Measures of central tendency provide insights into the middle or center of a data
distribution. Common measures include mean, median, mode, and midrange.
● For instance, the mean (average) is calculated as the sum of all values divided by the
number of observations.
● The median becomes particularly useful for skewed data, representing the middle value
that separates the higher and lower halves.
● The mode, the most frequently occurring value, and the midrange, the average of the
largest and smallest values, offer different perspectives on central tendency. Data sets
with one, two,or three modes are respectively called unimodal, bimodal, and trimodal. In
general, adata set with two or more modes is multimodal. At the other extreme, if each
data valueoccurs only once, then there is no mode
Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and
Interquartile Range
● Understanding how data is spread out is crucial. Measures of dispersion such as range,
quartiles, and interquartile range are essential.
● The range of the set is the difference between the largest (max()) and smallest (min())
values. Quantiles arepoints taken at regular intervals of a data distribution, dividing it
into essentially equal size consecutive sets.
● The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles
are the most widely used forms of quantiles.
● The interquartile range (IQR), the difference between the third and first quartiles, is a
valuable indicator of spread
IQR = Q3 − Q1
● A deeper exploration involves the five-number summary, which includes the minimum,
quartiles, and maximum, aiding in the identification of outliers.
● The five-number summary of a distribution consists of the median (Q2), the quartiles Q1
and Q3, and the smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum.
● Boxplots are a popular way of visualizing a distribution.
● A boxplot incorporates the five-number summary as follows:
● Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
● The median is marked by a line within the box.
● Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations
● Boxplots visualize these summaries effectively
● Variance and standard deviation offer numeric insights into data spread.
● The variance is the average of squared differences from the mean, and the standard
deviation is its square root.
● Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is.
● A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large
range of values
Graphic Displays of Basic Statistical Descriptions of Data
● Quantile plots depict the relationship between data points and their cumulative
distribution, aiding in the comparison of different distributions
8. Data Visualization
● Data visualization aims to communicate data clearly and effectively through graphical
representation.
● Data visualization has been used extensively in many applications—for example, at
work for reporting, managing business operations, and tracking progress of tasks
● To address challenges in layout, space-filling curves like Hilbert and Gray code can
be employed, Additionally, non-rectangular windows, such as the circle segment
technique enhance dimension comparison by forming a circular arrangement.
9. Data preprocessing
Data Cleaning: This involves addressing missing values, smoothing noisy data,
identifying and handling outliers, and rectifying inconsistencies. Cleaning ensures that
the data is free from errors, making it more reliable for subsequent analysis.
Data Integration: In scenarios where data is sourced from multiple origins, integration
ensures seamless alignment of attributes and resolves naming inconsistencies. The goal
is to create a unified dataset that avoids redundancies and discrepancies.
Data Reduction: With large datasets, reducing data volume becomes essential for
efficient analysis. Techniques such as dimensionality reduction and numerosity
reduction aim to represent the data in a more concise form without compromising
analytical outcomes.
Data Transformation: This involves preparing the data to meet the requirements of
specific analytical algorithms. Normalization, discretization, and concept hierarchy
generation are examples of data transformation techniques that facilitate effective
analysis.
● Real-world data are often plagued by issues such as incompleteness, noise, and
inconsistencies.
● Data cleaning, also known as data cleansing, plays a crucial role in addressing these
challenges. The fundamental methods employed in data cleaning, focuses on handling
missing values, smoothing noisy data, and understanding data cleaning as a process.
9.1 Missing Values
● When analyzing AllElectronics sales and customer data, it's common to encounter tuples
with missing values, such as customer income.
● Dealing with missing values involves several methods:
Ignore the tuple: Suitable when the class label is missing, but less effective when
multiple attributes have missing values.
Fill in manually: Time-consuming and impractical for large datasets.
Use a global constant: Replace missing values with a constant like "Unknown."
However, it may introduce biases.
Use central tendency measures: Replace missing values with the mean or median of
the attribute, depending on the data distribution.
Use attribute mean or median for the same class: Replace missing values based on
the class to which the tuple belongs.
Use the most probable value: Utilize regression, Bayesian inference, or decision trees
to predict missing values.
● Many data smoothing methods serve dual purposes, aiding in data discretization and
reduction.
● Regression: Data smoothing can also be done by regression, a technique that con- forms
data values to a function. Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to predict the other. Multiple
linear regression is an extension of linear regression, where more than two attributes are
involved and the data are fit to a multidimensional surface.
● Outlier analysis: Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values that fall outside of the
set of clusters may be considered outliers
9.3 Data Cleaning as a Process
● Data cleaning is not a one-time task but an ongoing process.
● Data mining often necessitates data integration, the process of merging data from various
sources.
● Effective integration reduces redundancies and inconsistencies, enhancing the accuracy
and speed of subsequent data mining endeavors.
10.1 Entity Identification Problem
● Data analysis frequently involves integrating data from diverse sources such as
databases, data cubes, or flat files.
● The entity identification problem arises during this integration, focusing on matching
equivalent real-world entities across multiple sources.
● Metadata, encompassing attributes' name, meaning, data type, and valid value range, aids
in schema integration and transforms data during the entity identification process.
● For instance, matching attributes like "customer id" in one database with "cust number"
in another requires meticulous consideration of data structure to prevent errors during
integration.
● Correlation analysis helps detect redundancies by examining how strongly one attribute
implies another. For nominal data, the χ2 (chi-square) test measures correlation, while
numeric attributes use correlation coefficients and covariance.
● Examining stock prices for AllElectronics and HighTech at different time points
illustrates covariance analysis. Positive covariance implies that stock prices for both
companies tend to rise together.
● Denormalized tables, often employed for performance reasons, can introduce data
redundancy and inconsistencies, especially when updates are incomplete.
● Detecting and resolving duplicates at the tuple level ensures data integrity.
● Integration may lead to conflicts in attribute values, requiring detection and resolution.
Numerosity Reduction:
Techniques under this strategy replace the original data volume with more concise
representations. Examples include regression models and histograms. For pricing analysis, a
histogram could compactly represent the distribution of product prices.
Data Compression:
Transformations are applied to achieve a compressed data representation. Wavelet
transforms, for example, store a fraction of the strongest coefficients, allowing for a sparse
yet effective representation of the data.
Normalization:
Standardize input data to ensure attributes with varying scales contribute equally.
Principal Component Computation:
Compute k orthonormal vectors, termed principal components, representing
directions of maximum variance.
Sorting of Components:
Order principal components by significance, with the first components capturing the
most variance.
Dimensionality Reduction:
Eliminate less significant components to reduce data size while retaining essential
information.
This process transforms the original data into a new coordinate system, highlighting
key patterns and reducing dimensionality for efficient analysis.
Example:
Reducing Dimensionality: PCA identifies orthogonal vectors that capture the essence
of the data, allowing for dimensionality reduction. For instance, in customer data, it
might reveal that age and purchasing behavior are key components.
Example:
Customer Classification: In a dataset containing customer information, attribute subset
selection might involve choosing relevant attributes like age and purchase history while
excluding less relevant ones such as telephone numbers.
Basic heuristic methods of attribute subset selection include the below techniques
Stepwise Forward Selection:
Begin with an empty attribute set and iteratively add the best original attribute at each
step.
● For instance, in sales prediction, a regression model can estimate sales based on
various attributes, reducing the complexity of the dataset while retaining
predictive accuracy.
Example:
Sales Prediction: Regression models can be applied to predict sales based on various
attributes, facilitating data reduction by focusing on the most influential factors.
11.6 Histograms
● Histograms bin data to approximate distributions effectively.
11.7 Clustering
● Clustering techniques group data objects based on similarity, enhancing data
reduction for organized datasets.
● Customer segmentation, facilitated by clustering, groups similar customers
together, creating a more manageable representation for targeted analysis.
Example:
● Customer Segmentation: Clustering techniques can group customers based on
similarities, reducing the dataset by representing customers within each cluster
with common characteristics.
11.8 Sampling
● Sampling, by obtaining a subset of data, provides an efficient means of data
reduction. It is particularly useful for estimating aggregate queries.
● In market research, sampling can offer representative insights without the need to
analyze the entire dataset.
Example:
● Market Research: Sampling can be used to obtain a representative subset of a
larger population for market research, reducing the need to analyze the entire
dataset.
Similar to SRSWOR, but each drawn tuple is replaced back into D, allowing it to be
selected again.
Cluster Sample:
Grouping tuples into M disjoint clusters, then obtaining an SRS of s clusters (where s <
M), providing a reduced data representation.
Stratified Sample:
Dividing D into disjoint parts called strata and obtaining an SRS within each stratum.
Useful for ensuring a representative sample, especially when data are skewed. For
example, creating strata based on customer age groups in customer data.
● Annual Sales Analysis: Aggregating quarterly sales data into annual summaries
using data cubes, making it more manageable for analysis and reducing the
overall volume of data.