0% found this document useful (0 votes)
8 views

Data Mining Unit 1

Data mining is the process of extracting meaningful patterns and knowledge from large datasets, involving steps like data cleaning, integration, selection, transformation, mining, evaluation, and presentation. It can be applied to various data types, including relational databases, data warehouses, transactional databases, and unstructured data such as text and multimedia. The document also discusses mining methodologies, technologies, applications, and challenges in data mining, emphasizing the importance of user interaction and efficiency.

Uploaded by

22bit12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Mining Unit 1

Data mining is the process of extracting meaningful patterns and knowledge from large datasets, involving steps like data cleaning, integration, selection, transformation, mining, evaluation, and presentation. It can be applied to various data types, including relational databases, data warehouses, transactional databases, and unstructured data such as text and multimedia. The document also discusses mining methodologies, technologies, applications, and challenges in data mining, emphasizing the importance of user interaction and efficiency.

Uploaded by

22bit12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

1.

Data Mining

● Data mining is the process of discovering patterns, trends, and meaningful information
from large sets of data.
● It involves extracting knowledge from data and transforming it into an understandable
structure for further use.
● The goal of data mining is to uncover hidden patterns and relationships that can be used
to make informed decisions.
● The data sources can include databases, data warehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
● It is also termed as knowledge discovery from data, or KDD,

The knowledge discovery process involves an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
2. What Kind of data can be mined?
Types of Data that Can be Mined:
Relational Databases:

● Data mining can be applied to traditional relational databases, extracting valuable


patterns and information from tables and records.
● A relational database is a collection of tables, each of which is assigned a unique
name. Each table consists of a set of attributes (columns or fields) and usually stores
a large set of tuples (records or rows).

● Relational data can be accessed by database queries written in a relational query


language (e.g., SQL) or with the assistance of graphical user interfaces.
● A given query is transformed into a set of relational operations, such as join,
selection, and projection, and is then optimized for efficient processing.
Data Warehouses:

● Data warehouses store large amounts of historical data from various sources. Data
mining can help analyze this data to identify trends and patterns.
● A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of
attributes in the schema, and each cell stores the value of some aggregate measure
such as count or sum(sales amount).
● A data cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data.

● By providing multidimensional data views and the precomputation of summarized


data, data warehouse systems can provide inherent support for OLAP.
● Examples of OLAP opera- tions include drill-down and roll-up, which allow the
user to view the data at differing degrees of summarization,
Transactional Databases:

● These databases contain records of transactions over time. Data mining can reveal
patterns in customer behavior, buying habits, and other transactional data.
● A transaction typ- ically includes a unique transaction identity number (trans ID) and
a list of the items making up the transaction, such as the items purchased in the
transaction

Other Kinds of Data


Spatial and Temporal Data:

● Data mining can be applied to spatial data (geographical information) and temporal
data (time-related information) to discover patterns in space and time.
Text and Web Data:

● Unstructured data, such as text and web content, can be mined for valuable insights.
This includes sentiment analysis, topic modeling, and information extraction from
documents.
Multimedia Data:

● Data mining techniques can be applied to multimedia data, such as images and
videos, to identify patterns and extract useful information.
Biological Data:

● In fields like bioinformatics, data mining is used to analyze biological data, including
DNA sequences, protein structures, and medical records.
Social Network Data:

● Social network analysis involves mining data from social media platforms to
understand user behavior, relationships, and trends.
3. What kind of Patterns that can be mined?

● Data mining functionalities are used to specify the kinds of patterns to be found in data
mining tasks.
● In general, such tasks can be classified into two categories: descriptive and predictive.

● Descriptive min- ing tasks characterize properties of the data in a target data set.

● Predictive mining tasks perform induction on the current data in order to make
predictions.

Class/Concept Description (Characterization and Discrimination):

● Data entries can be associated with classes or concepts, and their characteristics can
be summarized through data characterization or discrimination.
● Data characterization involves summarizing the general features of a target class,
often obtained through queries or OLAP operations.
● Discrimination compares the features of the target class with contrasting classes,
resulting in discriminant rules.
● Output forms include charts, tables, and rules.

Example:
Data Characterization: In a retail setting like AllElectronics, understanding customer
spending behavior involves summarizing characteristics of customers who spend more
than $5000 a year. This process yields generalized profiles like age, employment, and
credit ratings.
Data Discrimination: A Customer Relationship Manager at AllElectronics compares two
customer groups: frequent computer product purchasers (over twice a month) and
infrequent purchasers (less than three times a year).
The analysis reveals distinctions, such as 80% of frequent purchasers being 20-40 years
old with a university education, while 60% of infrequent purchasers are seniors or youths
without a university degree. Further exploration, including occupation and income level,
may unveil additional discriminative features between the two groups.

Mining Frequent Patterns, Associations, and Correlations:

● Frequent patterns, such as itemsets or sequential patterns, are discovered by mining


transactional datasets.
● Association analysis identifies relationships between variables, presented as
association rules.
● Support and confidence thresholds filter out uninteresting rules.

● Techniques like frequent itemset mining are fundamental for association rule mining.

● Frequent Itemset Mining: Identifies sets of items frequently occurring together in


transactional datasets.
● Association Analysis: Generates rules like "buys(X, “computer”) ⇒
buys(X,“software”) [support = 1%, confidence = 50%]".
● Example : Association Analysis: A marketing manager at AllElectronics may want to
understand which items are frequently purchased together, leading to association
rules indicating buying patterns.

Classification and Regression for Predictive Analysis:

● Classification involves finding models that describe and distinguish data classes.

● Regression predicts continuous values and can be used for numeric prediction.

● Relevant attributes are identified through relevance analysis.

● Decision trees, mathematical formulas, and neural networks can represent derived
models.

Classification Models: Represented as IF-THEN rules, decision trees, mathematical


formulas, or neural networks.
Relevance Analysis: Identifies significant attributes for classification and regression.

Example Classification and Regression: A sales manager at AllElectronics might


classify items based on responses to a sales campaign or predict revenue using
regression analysis.

Cluster Analysis:

● Clustering groups data objects based on similarity without using class labels.

● Clusters can be formed to maximize intraclass similarity and minimize interclass


similarity.
● Clustering helps in taxonomy formation and organizing observations into hierarchies.
Clustering: Maximizes intraclass similarity and minimizes interclass similarity.
Clusters may represent classes of objects.

Example Cluster Analysis: In AllElectronics, cluster analysis might identify


homogeneous subpopulations of customers for targeted marketing.
Outlier Analysis:

● Outliers are data objects that deviate from the general behavior of the dataset.

● Outlier analysis is essential for applications like fraud detection.

● Detection methods include statistical tests, distance measures, and density-based


approaches.
Outlier Detection: Uses statistical tests, distance measures, or density-based methods to
identify unusual data points.
Applications: Uncovering fraudulent credit card transactions by detecting unusual
purchase patterns.
Example Outlier Analysis: Detecting outliers in credit card transactions by identifying
purchases deviating from regular charges.
Are all Patterns Interesting?

● Interesting patterns are those easily understood, valid on new data, potentially
useful, and novel.
● Objective measures like support and confidence are used to assess patterns.

● Subjective measures reflect user beliefs and interests, focusing on unexpected


or actionable patterns.
● Completeness and optimization challenges exist in generating and identifying
only interesting patterns.
4. Which Technologies are used?
4.1. Statistics

● Statistics studies the collection, analysis, interpretation or explanation, and


presentation of data.
● Data mining has an inherent connection with statistics.

● A statistical model is a set of mathematical functions that describe the behavior of


the objects in a target class in terms of random variables and their associated proba-
bility distributions.
● we can use statistics to model noise and missing data values.

● Inferential statistics (or predictive statistics) models data in a way that accounts for
randomness and uncertainty in the observations and is used to draw inferences about
the process or population under investigation.
● A statistical hypothesis test (sometimes called confirmatory data analysis) makes
statistical decisions using experimental data.

4.2. Machine learning


● Machine learning investigates how computers can learn (or improve their
performance) based on data.
● A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data.
● For example, a typical machine learning problem is to program a computer so that it
can automatically recognize handwritten postal codes on mail after learning from a
set of examples.
● Supervised learning is basically a synonym for classification.

● The supervision in the learning comes from the labeled examples in the training data
set.
● Unsupervised learning is essentially a synonym for clustering. The learning process
is unsupervised since the input examples are not class labeled.
● Semi-supervised learning is a class of machine learning techniques that make use of
both labeled and unlabeled examples when learning a model. In one approach,
labeled examples are used to learn class models and unlabeled examples are used to
refine the boundaries between classes.
● Active learning is a machine learning approach that lets users play an active role in
the learning process. An active learning approach can ask a user (e.g., a domain
expert) to label an example, which may be from a set of unlabeled examples or
synthesized by the learning program.

4.3. Database Systems and Data Warehouses

● Database systems research focuses on the creation, maintenance, and use of


databases for organizations and end-users.
● Database systems are often well known for their high scalability in processing very
large, relatively structured data sets.
● Many data mining tasks need to handle large data sets or even real-time, fast stream-
ing data. Therefore, data mining can make good use of scalable database
technologies to achieve high efficiency and scalability on large data sets.
● Recent database systems have built systematic data analysis capabilities on database
data using data warehousing and data mining facilities. A data warehouse integrates
data originating from multiple sources and various timeframes.
4.4. Information Retrieval

● Information retrieval (IR) is the science of searching for documents or information in


documents. Documents can be text or multimedia, and may reside on the Web.
● The differences between traditional information retrieval and database systems are
twofold: Information retrieval assumes that (1) the data under search are
unstructured; and (2) the queries are formed mainly by keywords, which do not have
complex structures (unlike SQL queries in database systems).
● The document’s language model is the probability density function that generates the
bag of words in the document.
● A topic in a set of text documents can be modeled as a probability dis- tribution over
the vocabulary, which is called a topic model.

5. Which Kinds of Applications are targeted


Business Intelligence:

● Business Intelligence (BI) plays a pivotal role in enhancing businesses'


understanding of their commercial landscape.
● Data mining serves as the cornerstone of BI, enabling effective market analysis,
competitor evaluation, and strategic decision-making.
● Online analytical processing relies on data warehousing and multidimensional data
mining, while predictive analytics leverages classification and prediction techniques
for applications such as market analysis and customer relationship management.

Web Search Engines:

● Web search engines, intricate data mining applications, navigate challenges in


crawling, indexing, and searching vast amounts of web data.
● Data mining techniques are crucial in decision-making processes, determining
crawling frequencies, selecting pages for indexing, and ranking search results.
● The scale of data processed by search engines necessitates the use of computer
clouds, posing challenges in scaling up data mining methods.
● Handling online data and addressing queries posed only a few times present
additional complexities for search engines, requiring real-time responsiveness and
adaptive model maintenance.
6. Major Issues in Data Mining
1. Mining Methodology:
Discovering Different Types of Knowledge: Data mining involves exploring various
aspects of data like patterns, relationships, and trends. It adapts to different tasks, making
it a dynamic and growing field.
Navigating Multidimensional Space: Imagine data as a puzzle with many pieces. Data
mining searches for interesting patterns among these pieces, making it more powerful.
Collaboration of Different Fields: Data mining becomes stronger when it combines
methods from different areas, like using language understanding to mine text data or
incorporating software knowledge to find bugs.
Handling Messy Data: Data is not always perfect. It can be messy, with errors or
missing parts. Techniques like cleaning and preprocessing help in dealing with these
challenges.
Guiding the Search: Not all patterns found by data mining are equally interesting. We
need ways to decide what's valuable based on user preferences and use these preferences
to guide the search.
2. User Interaction:
Interactive Exploration: Data mining should be like a friendly conversation. Users
should be able to explore, change their focus, and refine their searches, making it a
dynamic and interactive process.
Using Existing Knowledge: What you already know matters. Incorporating your
knowledge into the data mining process helps in evaluating patterns better.
Asking Questions Easily: Asking questions is like using a magic wand in data mining.
The process becomes flexible when users can easily ask questions and get meaningful
answers.
Making Results Easy to Understand: Imagine data mining results as a storybook. It
should be easy to read and understand, helping users make sense of the discovered
knowledge.
3. Efficiency and Scalability:
Quick and Predictable Algorithms: Algorithms are like recipes for data mining. They
should be quick, predictable, and efficient, ensuring that we get useful information
without waiting too long.
Teamwork of Computers: Big tasks need teamwork. Computers work together to
handle large amounts of data. Techniques like cloud computing and parallel processing
make data mining faster and more effective.
Adapting to Changes: Data is not static; it keeps changing. Incremental data mining
adapts to these changes, updating our knowledge without starting from scratch every
time.
4. Different Types of Data:
Handling Different Data Types: Data comes in many forms, like tables, text, images,
and more. Data mining tools need to be like superheroes, specialized for different tasks,
whether it's analyzing text or finding patterns in images.
Connecting Global Information: Imagine data as pieces of a giant puzzle spread all
over the world. Data mining explores interconnected information networks, revealing
more patterns than mining isolated data.
5. Data Mining and Society:
Impact on Everyday Life: Data mining affects our daily lives more than we realize. It
helps in scientific discoveries, business decisions, and even in providing personalized
recommendations when shopping online.
Protecting Privacy: While data mining is powerful, it should also respect our privacy.
Ongoing research focuses on finding ways to get valuable insights without
compromising personal information.
Behind-the-Scenes Mining: Sometimes, data mining happens without us knowing.
Systems like search engines use data mining to improve their services. It's like a helpful
friend quietly working in the background.

6. Data Objects and Attribute Types


What Is an Attribute?

● An attribute is a data field representing a characteristic or feature of a data object.


● Attributes describe entities such as customers, items, or sales in databases. Attributes are
often called dimensions, features, or variables, with the term dimension common in data
warehousing, feature in machine learning, and variable in statistics.
● In databases, data objects are referred to as tuples, with rows corresponding to data
objects and columns to attributes.
Nominal Attributes

● Nominal attributes represent categories without a meaningful order.

● Values are symbols or names, and these attributes are also called categorical. Examples
include hair color and marital status.
● Numeric codes may represent nominal values, but the numbers lack quantitative
meaning.
● Measures of central tendency like mean and median are not meaningful for nominal
attributes.
Example Nominal attributes: Hair color (black, brown, blond, etc.) and Marital status (single,
married, divorced, widowed).
Binary Attributes

● Binary attributes have two categories (0 or 1), often indicating absence (0) or presence
(1).
● If states are equally valuable, it's symmetric; otherwise, it's asymmetric. Example:
Smoker (0 for non-smoker, 1 for smoker).
Example Binary attributes: Smoker (0, 1) and Medical test result (0 for negative, 1 for positive).
Ordinal Attributes

● Ordinal attributes have values with a meaningful order, but the magnitude between them
is unknown.
● Examples include drink size and professional rank. Ordinal attributes are useful for
subjective assessments and surveys.
Example Ordinal attributes: Drink size (small, medium, large) and Professional rank (assistant,
associate, full).
Numeric Attributes

● Numeric attributes are quantitative, measured in integer or real values. They can be
interval-scaled or ratio-scaled.
o Interval-Scaled Attributes have equal-size units, allowing ranking and
quantifying differences. Mean, median, and mode are applicable.
o Example Interval-scaled attributes: Temperature in Celsius, calendar dates.
o Ratio-Scaled Attributes have a true zero-point, allowing ratio comparisons.
Mean, median, mode, and ratios are applicable.
Example Ratio-scaled attributes: Kelvin temperature, years of experience, weight.
Discrete versus Continuous Attributes

● Attributes can be classified as discrete or continuous. Discrete attributes have finite or


countably infinite values, while continuous attributes are represented as real numbers.
● Discrete Attributes: Finite or countably infinite values. Examples include hair color,
smoker, medical test.
● Continuous Attributes: Represented as real numbers. Examples include temperature,
years of experience, and weight.
7. Basic Statistical Descriptions of Data

● Basic statistical descriptions can be used to identify properties of the data and highlight
which data values should be treated as noise or outliers.
Measuring the Central Tendency: Mean, Median, and Mode

● Measures of central tendency provide insights into the middle or center of a data
distribution. Common measures include mean, median, mode, and midrange.
● For instance, the mean (average) is calculated as the sum of all values divided by the
number of observations.
● The median becomes particularly useful for skewed data, representing the middle value
that separates the higher and lower halves.

● The mode, the most frequently occurring value, and the midrange, the average of the
largest and smallest values, offer different perspectives on central tendency. Data sets
with one, two,or three modes are respectively called unimodal, bimodal, and trimodal. In
general, adata set with two or more modes is multimodal. At the other extreme, if each
data valueoccurs only once, then there is no mode

Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and
Interquartile Range
● Understanding how data is spread out is crucial. Measures of dispersion such as range,
quartiles, and interquartile range are essential.
● The range of the set is the difference between the largest (max()) and smallest (min())
values. Quantiles arepoints taken at regular intervals of a data distribution, dividing it
into essentially equal size consecutive sets.
● The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles
are the most widely used forms of quantiles.

● The interquartile range (IQR), the difference between the third and first quartiles, is a
valuable indicator of spread
IQR = Q3 − Q1

Q1 = $47,000 and Q3 is $63,000.


● Thus,the interquartile range is IQR = 63 − 47 = $16,000.

Five-Number Summary, Boxplots, and Outliers

● A deeper exploration involves the five-number summary, which includes the minimum,
quartiles, and maximum, aiding in the identification of outliers.
● The five-number summary of a distribution consists of the median (Q2), the quartiles Q1
and Q3, and the smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum.
● Boxplots are a popular way of visualizing a distribution.
● A boxplot incorporates the five-number summary as follows:

● Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
● The median is marked by a line within the box.

● Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations
● Boxplots visualize these summaries effectively

● Variance and standard deviation offer numeric insights into data spread.

● The variance is the average of squared differences from the mean, and the standard
deviation is its square root.
● Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is.
● A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large
range of values
Graphic Displays of Basic Statistical Descriptions of Data

● Visualizing data through various graphs enhances comprehension. Quantile plots,


quantile–quantile plots, histograms, and scatter plots are commonly used.

● Quantile plots depict the relationship between data points and their cumulative
distribution, aiding in the comparison of different distributions

● Quantile–quantile plots compare quantiles of two distributions, providing insights into


shifts and differences
● Histograms visually represent the distribution of numeric data, where frequency is
depicted for different price ranges.
● Scatter plots reveal relationships between two attributes, helping identify patterns or
correlations. Understanding scatter plots is crucial for assessing correlations, whether
positive or negative

8. Data Visualization
● Data visualization aims to communicate data clearly and effectively through graphical
representation.
● Data visualization has been used extensively in many applications—for example, at
work for reporting, managing business operations, and tracking progress of tasks

Pixel-Oriented Visualization Techniques

● Pixel-oriented techniques offer a straightforward approach to visualizing


multidimensional data.
● In this method, each dimension corresponds to a window on the screen, and the
values are depicted using pixel colors.
● By ordering data records based on a global order, patterns and correlations become
evident.
● Example Pixel-oriented visualization of AllElectronics customer data
● Consider a dataset with dimensions like income, credit limit, transaction volume, and
age. Sorting customers in income-ascending order reveals insights: credit limit tends
to increase with income, mid-range income customers make more purchases, and no
clear correlation between income and age is observed

● To address challenges in layout, space-filling curves like Hilbert and Gray code can
be employed, Additionally, non-rectangular windows, such as the circle segment
technique enhance dimension comparison by forming a circular arrangement.

Geometric Projection Visualization Techniques

● While pixel-oriented techniques excel in representing individual dimensions,


understanding data distribution in a multidimensional space necessitates geometric
projection techniques.
● Scatter plots, utilizing Cartesian coordinates, provide insight into 2-D data
relationships. Techniques like 3-D scatter plots and scatter-plot matrices extend this
approach to visualize higher-dimensional data.

Example: Visualization of a 2-D data set using a scatter plot

● Different shapes represent a third dimension, aiding in the identification of patterns.


For datasets surpassing four dimensions, scatter plots become less effective.
● The scatter-plot matrix offers an alternative by displaying 2-D scatter plots for every
pair of dimensions

● Parallel coordinates, accommodating higher dimensions, involve drawing equally


spaced axes for each dimension.
● Each data record is represented by a polygonal line intersecting these axes at
corresponding dimension values.
● However, the technique faces limitations with large datasets, causing visual clutter.
● In essence, combining pixel-oriented and geometric projection techniques allows for
a comprehensive exploration of multidimensional data, offering valuable insights
through visualizations tailored to specific analytical needs.

9. Data preprocessing

● Data preprocessing is an important step in the data mining process.

● It refers to the cleaning, transforming, and integrating of data in order to make it


ready for analysis.
● The goal of data preprocessing is to improve the quality of the data and to make it
more suitable for the specific data mining task.

Major task in Data preprocessing are

Data Cleaning: This involves addressing missing values, smoothing noisy data,
identifying and handling outliers, and rectifying inconsistencies. Cleaning ensures that
the data is free from errors, making it more reliable for subsequent analysis.

Data Integration: In scenarios where data is sourced from multiple origins, integration
ensures seamless alignment of attributes and resolves naming inconsistencies. The goal
is to create a unified dataset that avoids redundancies and discrepancies.

Data Reduction: With large datasets, reducing data volume becomes essential for
efficient analysis. Techniques such as dimensionality reduction and numerosity
reduction aim to represent the data in a more concise form without compromising
analytical outcomes.

Data Transformation: This involves preparing the data to meet the requirements of
specific analytical algorithms. Normalization, discretization, and concept hierarchy
generation are examples of data transformation techniques that facilitate effective
analysis.
● Real-world data are often plagued by issues such as incompleteness, noise, and
inconsistencies.
● Data cleaning, also known as data cleansing, plays a crucial role in addressing these
challenges. The fundamental methods employed in data cleaning, focuses on handling
missing values, smoothing noisy data, and understanding data cleaning as a process.
9.1 Missing Values
● When analyzing AllElectronics sales and customer data, it's common to encounter tuples
with missing values, such as customer income.
● Dealing with missing values involves several methods:

Ignore the tuple: Suitable when the class label is missing, but less effective when
multiple attributes have missing values.
Fill in manually: Time-consuming and impractical for large datasets.
Use a global constant: Replace missing values with a constant like "Unknown."
However, it may introduce biases.
Use central tendency measures: Replace missing values with the mean or median of
the attribute, depending on the data distribution.
Use attribute mean or median for the same class: Replace missing values based on
the class to which the tuple belongs.
Use the most probable value: Utilize regression, Bayesian inference, or decision trees
to predict missing values.

9.2 Noisy Data


● Noise, represented by random errors or variance, can be problematic in numeric
attributes like price. Data smoothing techniques help mitigate noise:
● Binning: Distribute sorted values into bins and replace each value with the mean or
median of its bin.
● Regression: Conform data values to a function, such as linear regression or multiple
linear regression.
● Outlier analysis: Identify outliers using clustering methods.

● Many data smoothing methods serve dual purposes, aiding in data discretization and
reduction.

● Regression: Data smoothing can also be done by regression, a technique that con- forms
data values to a function. Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to predict the other. Multiple
linear regression is an extension of linear regression, where more than two attributes are
involved and the data are fit to a multidimensional surface.
● Outlier analysis: Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values that fall outside of the
set of clusters may be considered outliers
9.3 Data Cleaning as a Process
● Data cleaning is not a one-time task but an ongoing process.

● The process involves:

● Discrepancy Detection: Identify inconsistencies caused by various factors, including


errors in data entry, inconsistencies in representations, and data decay.
● Data Transformation: Correct discrepancies through manual correction, reference
correction, or data transformation using tools.
● Commercial tools like data scrubbing and auditing tools assist in discrepancy detection
and transformation. However, these processes often require iterations, and new
approaches emphasize increased interactivity for efficient data cleaning.

10. Data Integration

● Data mining often necessitates data integration, the process of merging data from various
sources.
● Effective integration reduces redundancies and inconsistencies, enhancing the accuracy
and speed of subsequent data mining endeavors.
10.1 Entity Identification Problem

● Data analysis frequently involves integrating data from diverse sources such as
databases, data cubes, or flat files.
● The entity identification problem arises during this integration, focusing on matching
equivalent real-world entities across multiple sources.
● Metadata, encompassing attributes' name, meaning, data type, and valid value range, aids
in schema integration and transforms data during the entity identification process.

● For instance, matching attributes like "customer id" in one database with "cust number"
in another requires meticulous consideration of data structure to prevent errors during
integration.

10.2 Redundancy and Correlation Analysis

● Redundancy, particularly in attributes like annual revenue, can be problematic during


data integration.

● Correlation analysis helps detect redundancies by examining how strongly one attribute
implies another. For nominal data, the χ2 (chi-square) test measures correlation, while
numeric attributes use correlation coefficients and covariance.

Example: Correlation analysis of nominal attributes using χ2


Consider a survey with attributes "gender" and "preferred reading." The χ2 test examines their
correlation, determining if they are statistically independent or correlated. If correlated, the test
helps quantify the strength of their association.

Covariance of Numeric Data


● For numeric attributes, covariance measures how two attributes change together. Positive
covariance indicates a likely simultaneous increase in attribute values, while negative
covariance suggests an inverse relationship.

Example Covariance analysis of numeric attributes

● Examining stock prices for AllElectronics and HighTech at different time points
illustrates covariance analysis. Positive covariance implies that stock prices for both
companies tend to rise together.

10.3 Tuple Duplication

● Beyond attribute redundancy, tuple duplication detection is crucial.

● Denormalized tables, often employed for performance reasons, can introduce data
redundancy and inconsistencies, especially when updates are incomplete.
● Detecting and resolving duplicates at the tuple level ensures data integrity.

10.4 Data Value Conflict Detection and Resolution

● Integration may lead to conflicts in attribute values, requiring detection and resolution.

● Discrepancies arise from differences in representation, scaling, encoding, or abstraction


levels.
● Robust methodologies are needed to reconcile such conflicts, ensuring data consistency
across diverse sources.

11. Data Reduction Techniques:


● Data reduction is an indispensable step in the analysis of datasets from the expansive
AllElectronics data warehouse.
● It involves strategies such as dimensionality reduction, numerosity reduction, and data
compression to distill large volumes of data while preserving analytical integrity.

11.1 Data Reduction Strategies


Dimensionality Reduction:
In this strategy, the number of attributes is curtailed to enhance efficiency. Techniques like
Principal Components Analysis (PCA) and Attribute Subset Selection are employed. For
instance, in customer data, PCA may reveal that age and purchase history are pivotal
dimensions.

Numerosity Reduction:
Techniques under this strategy replace the original data volume with more concise
representations. Examples include regression models and histograms. For pricing analysis, a
histogram could compactly represent the distribution of product prices.

Data Compression:
Transformations are applied to achieve a compressed data representation. Wavelet
transforms, for example, store a fraction of the strongest coefficients, allowing for a sparse
yet effective representation of the data.

11.2 Wavelet Transforms


● The discrete wavelet transform (DWT) is a powerful linear signal processing
technique.
● It transforms data vectors, maintaining their length but allowing for truncation.

● By storing only significant coefficients, DWT facilitates efficient operations in


wavelet space, making it useful for data cleaning and noise removal.
Example:
Application in Data Reduction: The discrete wavelet transform (DWT) can be employed to
reduce the size of a dataset by storing only a subset of the strongest wavelet coefficients,
leading to a sparse representation.

11.3 Principal Components Analysis


● PCA identifies orthogonal vectors that best represent the data, enabling
dimensionality reduction.
● By sorting these vectors by significance, PCA captures variance, aiding in pattern
identification.
● For example, in sales data, PCA may reveal key components influencing trends.

Normalization:
Standardize input data to ensure attributes with varying scales contribute equally.
Principal Component Computation:
Compute k orthonormal vectors, termed principal components, representing
directions of maximum variance.
Sorting of Components:
Order principal components by significance, with the first components capturing the
most variance.
Dimensionality Reduction:
Eliminate less significant components to reduce data size while retaining essential
information.
This process transforms the original data into a new coordinate system, highlighting
key patterns and reducing dimensionality for efficient analysis.

Example:
Reducing Dimensionality: PCA identifies orthogonal vectors that capture the essence
of the data, allowing for dimensionality reduction. For instance, in customer data, it
might reveal that age and purchasing behavior are key components.

11.4 Attribute Subset Selection


● Addressing datasets with numerous attributes, this technique aims to find a
minimal subset that maintains the original data's distribution.
● Greedy methods like forward selection or decision tree induction are applied. In
customer classification, relevant attributes like age and purchase history may be
retained.

Example:
Customer Classification: In a dataset containing customer information, attribute subset
selection might involve choosing relevant attributes like age and purchase history while
excluding less relevant ones such as telephone numbers.
Basic heuristic methods of attribute subset selection include the below techniques
Stepwise Forward Selection:
Begin with an empty attribute set and iteratively add the best original attribute at each
step.

Stepwise Backward Elimination:


Start with the full set of attributes and, at each step, remove the least valuable attribute.

Combined Forward Selection and Backward Elimination:


Merge forward selection and backward elimination, selecting the best attribute and
removing the worst in each step.

Decision Tree Induction for Attribute Subset Selection:


Use decision tree algorithms to construct a tree from data, where non-appearing
attributes are deemed irrelevant. The tree's attributes form the reduced subset.

11.5 Regression and Log-Linear Models: Parametric Data Reduction


● Regression models approximate data using linear relationships.

● For instance, in sales prediction, a regression model can estimate sales based on
various attributes, reducing the complexity of the dataset while retaining
predictive accuracy.

Example:
Sales Prediction: Regression models can be applied to predict sales based on various
attributes, facilitating data reduction by focusing on the most influential factors.

11.6 Histograms
● Histograms bin data to approximate distributions effectively.

● Examples include pricing analysis, where histograms reveal common price


ranges.
● The partitioning rules, such as equal-width or equal-frequency, contribute to the
adaptability of histograms.
Example:

Pricing Analysis: Histograms can be employed to represent the distribution of product


prices, helping to identify common price ranges and outliers.

11.7 Clustering
● Clustering techniques group data objects based on similarity, enhancing data
reduction for organized datasets.
● Customer segmentation, facilitated by clustering, groups similar customers
together, creating a more manageable representation for targeted analysis.
Example:
● Customer Segmentation: Clustering techniques can group customers based on
similarities, reducing the dataset by representing customers within each cluster
with common characteristics.

11.8 Sampling
● Sampling, by obtaining a subset of data, provides an efficient means of data
reduction. It is particularly useful for estimating aggregate queries.
● In market research, sampling can offer representative insights without the need to
analyze the entire dataset.
Example:
● Market Research: Sampling can be used to obtain a representative subset of a
larger population for market research, reducing the need to analyze the entire
dataset.

The different sampling techniques for data reduction are”

Simple Random Sample without Replacement (SRSWOR) of Size s:


Selecting s tuples from D (where s < N) with an equal chance for each tuple, ensuring all
tuples are equally likely to be sampled.

Simple Random Sample with Replacement (SRSWR) of Size s:

Similar to SRSWOR, but each drawn tuple is replaced back into D, allowing it to be
selected again.

Cluster Sample:

Grouping tuples into M disjoint clusters, then obtaining an SRS of s clusters (where s <
M), providing a reduced data representation.

Stratified Sample:

Dividing D into disjoint parts called strata and obtaining an SRS within each stratum.
Useful for ensuring a representative sample, especially when data are skewed. For
example, creating strata based on customer age groups in customer data.

11.9 Data Cube Aggregation


● Data cubes organize multidimensional aggregated information, enabling efficient
analysis.
● By aggregating data, such as summarizing quarterly sales into annual totals, data
cube aggregation reduces the dataset size, streamlining analysis while retaining
critical insights.

● Annual Sales Analysis: Aggregating quarterly sales data into annual summaries
using data cubes, making it more manageable for analysis and reducing the
overall volume of data.

You might also like