DMW Module 2
DMW Module 2
Data Preprocessing: Data Preprocessing Concepts, Data Cleaning, Data integration and
transformation, Data Reduction, Discretization and concept hierarchy
Data Preprocessing:
Today’ s real-world databasesare highly susceptible to noisy, missing, and inconsistent data
due to their typically huge size (often several gigabytes or more) and their likely origin from multiple,
heterogenoussources. Low-quality data will lead to low-quality mining results. “ How can the data be
preprocessed in order to help improve the quality of the data and, consequently, of the mining results?
How can the databe preprocessed so asto improve the efficiency and ease of the miningprocess?”
Data integration mergesdata from multiple sourcesinto a coherent data store such asa data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or
clustering.
Data transformations(e.g., normalization) may be applied, where data are scaled to fall within a smaller
range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving
distance measurements.
These techniques are not mutually exclusive; they may work together. For example, data cleaning can
involve transformations to correct wrong data, such as by transforming all entries for a date field to a
common format.
Data processing techniques, when applied before mining, can substantially improve the
overall quality of the patternsmined and/ or the time required for the actual mining.
Data have quality if they satisfy the requirements of the intended use. There are many factors
comprising data quality, including accuracy, completeness, consistency, timeliness, believability, and
interpretability.
Imagine that you are a manager at AllElectronics and have been charged with analyzing the
company’ s data with respect to your branch’s sales. You immediately set out to perform this task. You
carefully inspect the company’ s database and data warehouse, identifying and selecting the attributesor
dimensions(e.g., item, price, and unitssold) to be included in your analysis. Alas! You notice that several
of the attributes for varioustuples have no recorded value. For your analysis, you would like to include
information as to whether each item purchased was advertised as on sale, yet you discover that this
information hasnot been recorded. Furthermore, usersof your database system have reported errors,
unusual values, and inconsistencies in the data recorded for some transactions. In other words, the data
you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain
attributesof interest, or containingonly aggregate data); inaccurate or noisy(containingerrors, or values
that deviate from the expected); and inconsistent (e.g., containing discrepancies in the department
codesused to categorize items). Welcome to the real world!
This scenario illustrates three of the elements defining data quality: accuracy, completeness,
and consistency. Inaccurate, incomplete, and inconsistent data are commonplace propertiesof large real
-world databasesand data warehouses. There are many possible reasonsfor inaccurate data (i.e., having
incorrect attribute values). The data collection instruments used may be faulty. There may have been
human or computer errors occurring at data entry. Usersmay purposely submit incorrect data valuesfor
mandatory fields when they do not wish to submit personal information (e.g., by choosing the default
value “January 1”displayed for birthday). This is known as disguised missing data. Errors in data
transmission can also occur. There may be technology limitations such as limited buffer size for
coordinating synchronized data transfer and consumption. Incorrect data may also result from
inconsistencies in naming conventionsor data codes, or inconsistent formatsfor input fields(e.g., date).
Duplicate tuplesalso require data cleaning.
We look at the major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction, and data transformation.
Data cleaning routines work to “ clean”the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies. Dirty data can cause confusion for the
mining procedure, resulting in unreliable output. Although most mining routineshave some procedures
for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on
avoiding overfitting the data to the function being modeled. Therefore, a useful preprocessing step isto
run your data through some data cleaning routines.
Getting back to your task at AllElectronics, suppose that you would like to include data from
multiple sources in your analysis. This would involve integrating multiple databases, data cubes, or files
i.e., data integration. Yet some attributes representing a given concept may have different names in
different databases, causing inconsistencies and redundancies. For example, the attribute for customer
identification may be referred to as customer id in one data store and cust id in another. Naming
inconsistencies mayalso occur for attribute values. For example, the same first name could be registered
as “ Bill”in one database, “William”in another, and “ B.”in a third. Furthermore, you suspect that some
attributesmay be inferred from others(e.g., annual revenue). Having a large amount of redundant data
may slow down or confuse the knowledge discovery process. Clearly, in addition to data cleaning, steps
must be taken to help avoid redundancies during data integration. Typically, data cleaning and data
integration are performed asa preprocessingstep when preparing data for a data warehouse. Additional
data cleaning can be performed to detect and remove redundancies that may have resulted from data
integration.
Getting back to your data, you have decided, say, that you would like to use a distancebased
mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or clustering.
Such methodsprovide better resultsif the data to be analyzed have been normalized, that is, scaled to a
smaller range such as [0.0, 1.0]. Your customer data, for example, contain the attributesage and annual
salary. The annual salary attribute usually takesmuch larger valuesthan age. Therefore, if the attributes
are left unnormalized, the distance measurements taken on annual salary will generally outweigh
distance measurementstaken on age.
Discretization and concept hierarchy generation can also be useful, where raw data valuesfor attributes
are replaced by ranges or higher conceptual levels. For example, raw values for age may be replaced by
higher-level concepts, such asyouth, adult, or senior.
Discretization and concept hierarchy generation are powerful toolsfor data mining in that they
allow data mining at multiple abstraction levels. Normalization, data discretization, and concept
hierarchy generation are forms of data transformation. You soon realize such data transformation
operations are additional data preprocessing procedures that would contribute toward the success of
the mining process.
“Hmmm,”you wonder, as you consider your data even further. “ The data set I have selected for
analysisisHUGE, which issure to slow down the mining process. Isthere a way I can reduce the size of
my data set without jeopardizing the data mining results?”
Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet
produces the same (or almost the same) analytical results. Data reduction strategies include data
aggregation(bulding a data cube for ex.) dimension reduction ( eg. Removing irrelevant attributes
through correlation analysis), data compression( e.g. using encoding schemes such as minimum length
encoding or wavelets) and numerosity reduction( e.g “ replacing”the data by alternative, smaller
representationssuch asclustersor parametricmodels)
In dimensionality reduction, data encoding schemes are applied so as to obtain a reduced or “
compressed”representation of the original data. Examples include data compression techniques (e.g.,
wavelet transforms and principal components analysis), attribute subset selection (e.g., removing
irrelevant attributes), and attribute construction (e.g., where a small set of more useful attributes is
derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller representationsusing parametric
models (e.g., regression or log-linear models) or nonparametric models (e.g., histograms, clusters,
sampling, or data aggregation).
Figure 3.1 above summarizes the data preprocessing steps described here. Note that the
previouscategorization isnot mutually exclusive. For example, the removal of redundant data may be
seen asa form of data cleaning, aswell asdata reduction.
In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data preprocessing
techniques can improve data quality, thereby helping to improve the accuracy and efficiency of the
subsequent mining process. Data preprocessing is an important step in the knowledge discovery
process, because quality decisionsmust be based on quality data. Detecting data anomalies, rectifying
them early, and reducingthe data to be analyzed can lead to huge payoffsfor decision making.
Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing)
routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistenciesin the data. In thissection, you will study basicmethodsfor data cleaning.
v MissingValues
Imagine that you need to analyze AllElectronics sales and customer data. You note that many
tuples have no recorded value for several attributes such as customer income. How can you go about
filling in the missingvaluesfor thisattribute? Let’s look at the following methods.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification). This method is not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of missing values per
attribute variesconsiderably.
2. Fill in the missing value manually: In general, this approach is time consuming and may not be
feasible given a large data set with manymissingvalues.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the
same constant such as a label like “ Unknown”or −∞. If missing values are replaced by, say, “
Unknown,”then the miningprogram maymistakenly thinkthat theyform an interesting concept,
since they all have a value in common—that of “ Unknown.”Hence, although this method is
simple, it isnot recommended.
4. Use the attribute mean to fill in the missing value: For example, suppose that the data
distribution regarding the income of AllElectronics customers is symmetric and that the mean
income is$56,000. Use thisvalue to replace the missing value for income.
5. Use the attribute mean or median for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, we may replace the missing
value with the mean income value for customers in the same credit risk category as that of the
given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction. For
example, usingthe other customer attributesin your data set, you may construct a decision tree
to predict the missingvaluesfor income.
Methods 3 through 6 bias the data—the filled-in value may not be correct. Method 6,
however, is a popular strategy. In comparison to the other methods, it uses the most
information from the present data to predict missing values. By considering the other attributes’
values in its estimation of the missing value for income, there is a greater chance that the
relationshipsbetween income and the other attributesare preserved.
v Noisy Data
“What isnoise?”Noise isa random error or variance in a measured variable. . Given a numeric
attribute such as, say, price, how can we “
smooth”out the data to remove the noise? Let ’
s look at the
following data smoothingtechniques.
1. Binning: Binning methods smooth a sorted data value by consulting its “ neighborhood,”that is,
the values around it. The sorted values are distributed into a number of “ buckets,”or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
Figure 3.2 illustratessome binning techniques. In thisexample, the data for price are first sorted
and then partitioned into equal-frequency binsof size 3 (i.e., each bin containsthree values). In
smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this
bin isreplaced bythe value 9.
Similarly, smoothing by bin medianscan be employed, in which each bin value isreplaced by the
bin median.
In smoothing by bin boundaries, the minimum and maximum valuesin a given bin are identified
as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general, the
larger the width, the greater the effect of the smoothing. Alternatively, binsmay be equal width, where
the interval range of valuesin each bin is constant.
2. Clustering : Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.”Intuitively, valuesthat fall outside of the set of clustersmay
be considered outliers. [ Refer Figure ]
3. Combined computer and human inspection: Outliers may be identified through a combination
of computer and human inspection.
In one application, for example, an information-theoretic measure was used to help
identify outlier patterns in a handwritten character database for classification. The measure's
value reflected the “surprise" content of the predicted character label with respect to the known
label. Outlier patterns may be informative (e.g., identifying useful data exceptions, such as
different versions of the characters “ 0"or “ 7"), or “garbage" (e.g., mislabeled characters).
Patternswhose surprise content isabove a threshold are output to a list. A human can then sort
through the patternsin the list to identify the actual garbage ones.
This is much faster than having to manually search through the entire database. The garbage
patterns can then be removed from the (training) database. The garbage patterns can be
excluded from use in subsequent data mining.
4. Regression: Data smoothing can also be done by regression, a technique that conforms data
values to a function. Linear regression involves finding the “ best”line to fit two attributes (or
variables) so that one attribute can be used to predict the other. Multiple linear regression is an
extension of linear regression, where more than two attributesare involved and the data are fit
to a multidimensional surface.
Many data smoothing methods are also used for data discretization (a form of data
transformation) and data reduction. For example, the binning techniques described before
reduce the number of distinct valuesper attribute. Thisactsasa form of data reduction for logic-
based data mining methods, such as decision tree induction, which repeatedly makes value
comparisonson sorted data. Concept hierarchies are a form of data discretization that can also
be used for data smoothing. A concept hierarchy for price, for example, may map real price
valuesinto inexpensive, moderately priced, and expensive, thereby reducing the number of data
valuesto be handled by the miningprocess.
v Inconsistent data
There may be inconsistencies in the data recorded for some transactions. Some data
inconsistencies may be corrected manually using external references. For example, errorsmade at data
entry may be corrected by performing a paper trace. This may be coupled with routines designed to
help correct the inconsistent use of codes. Knowledge engineering toolsmay also be used to detect the
violation of known data constraints. For example, known functional dependencies between attributes
can be used to find valuescontradictingthe functional constraints.
There may also be inconsistencies due to data integration, where a given attribute can have
different namesin different databases. Redundanciesmay also result.
Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundanciesand inconsistencies in the resulting data set.
Thiscan help improve the accuracy and speed of the subsequent data mining process.
Data Integration
It is likely that your data analysis task will involve data integration, which combines data from
multiple sourcesinto a coherent data store, asin data warehousing. These sourcesmay include multiple
databases, datacubes, or flat files.
There are a number of issuesto consider during data integration. Schema integration and object
matching can be tricky. How can equivalent real-world entities from multiple data sources be matched
up?This isreferred to asthe entity identification problem. For example, how can the data analyst or the
computer be sure that customer_id in one database and cust_number in another refer to the same
attribute? Examples of metadata for each attribute include the name, meaning, data type, and range of
values permitted for the attribute, and null rules for handling blank, zero, or null values (Section 3.2).
Such metadata can be used to help avoid errors in schema integration. The metadata may also be used
to help transform the data.
where n is the number of tuples, ai and bi are the respective valuesof A and B in tuple i, A¯ and B¯ are
the respective mean values of A and B, σA and σB are the respective standard deviations of A and B.
Note that −1 rA,B +1. If rA,B isgreater than 0, then A and B are positivelycorrelated, meaning that
the values of A increase as the values of B increase. The higher the value, the stronger the correlation
(i.e., the more each attribute impliesthe other). Hence, a higher value may indicate that A (or B) may
be removed as a redundancy. If the resulting value is equal to 0, then A and B are independent and
there isno correlation between them. If the resulting value isless than 0, then A and B are negatively
correlated, where the values of one attribute increase as the values of the other attribute decrease.
Thismeansthat each attribute discourages the other.
Data integration also involvesthe detection and resolution of data value conflicts. For example,
for the same real-world entity, attribute values from different sources may differ. This may be due to
differences in representation, scaling, or encoding. For instance, a weight attribute may be stored in
metric unitsin one system and British imperial units in another. For a hotel chain, the price of roomsin
different cities may involve not only different currencies but also different services (e.g., free breakfast)
and taxes. When exchanging information between schools, for example, each school may have its own
curriculum and grading scheme. One university may adopt a quarter system, offer three courses on
database systems, and assign gradesfrom A+ to F, whereas another mayadopt a semester syst em, offer
two courseson databases, and assign grades from 1 to 10. It is difficult to work out precise course-to-
grade transformation rulesbetween the two universities, making information exchange difficult.
Data Transformation
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Data transformation can involvethe following:
1. Smoothing, which works to remove noise from the data. Techniquesinclude binning, regression,
and clustering.
2. Aggregation, where summary or aggregation operationsare applied to the data. For example,
the daily sales data maybe aggregated so asto compute monthlyand annual total amounts. This
step istypically used in constructing a data cube for data analysisat multiple granularities.
3. Generalization of the data, where low-level or “ (raw) data are replaced by higher level
primitive”
conceptsthrough the use of concept heirarchies. ( for ex. where attributessuch asstreet can be
generalized to higher-level concepts, like city or country)
4. Normalization, where the attribute data are scaled so asto fall within a smaller range, such as−
1.0 to 1.0, or 0.0 to 1.0.
5. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributesto help the mining process.
In thissection, we therefore concentrate on (5)Attribute construction and (4)Normalization
To help avoid dependence on the choice of measurement units, the data should be normalized
or standardized. Thisinvolves transforming the data to fall within a smaller or common range such
as[−1,1] or [0.0, 1.0].
Normalizing the dataattemptsto give all attributesan equal weight. Normalization isparticularly
useful for classification algorithms involving neural networks or distance measurements such as
nearest-neighbor classification and clustering. If using the neural network backpropagation
algorithm for classification mining, normalizing the input values for each attribute measured in the
training tupleswill help speed up the learning phase.
For distance-based methods, normalization helps prevent attributes with initially large ranges
(e.g., income) from outweighing attributeswith initially smaller ranges (e.g., binary attributes). It is
also useful when given no prior knowledge of the data. There are many methods for data
normalization. We study min-max normalization, z-score normalization, and normalization by
decimal scaling. For our discussion, let A be a numeric attribute with n observed values, v1, v2,..., vn.
Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute, A. Min-max normalization
mapsa value, vi , of A to vi / in the range [new minA,new maxA] bycomputing
Min-max normalization preservesthe relationshipsamong the original data values. It will encounter
an “out-of-bounds”error if a future input case for normalization falls outside of the original data range
for A
.
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The
number of decimal points moved depends on the maximum absolute value of A. A value, v , of A is
normalized to v/ by computing
Decimal scaling. Suppose that the recorded values of A range from −986 to 917. The maximum absolute
value of A is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j = 3) so
that −986 normalizesto −0.986 and 917 normalizesto 0.917.
In attribute construction, new attributesare constructed from the given attributesand added in order to
help to improve accuracy and understanding of structure in highdimensional data. For example, we may
wish to add the attribute area based on the attributesheight and width. Attribute construction can help
alleviate the fragmentation problem when decision tree algorithms are used for classification, where an
attribute is repeatedly tested along a path in derived decision tree.By combining attributes, attribute
construction can discover missing information about the relationshipsbetween data attributesthat can
be useful for knowledge discovery.
Data Reduction
Imagine that you have selected data from the AllElectronics data warehouse for analysis. The
data set will likely be huge! Complex data analysisand mining on huge amountsof data can take a long
time, making such analysisimpractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation of the data set
that ismuch smaller in volume, yet closely maintainsthe integrity of the original data. That is, mining
on the reduced data set should be more efficient yet produce the same (or almost the same) analytical
results.
Data reduction strategies include the following
1. Data Cube Aggregation, where aggregation operationsare applied to the construction of a data
cube.
2. Dimension Reduction, where irrelevant, weakly relevant, or redundant attributesor dimensions
maybe detected and removed.
3. Data compression, where encodingmechanismsare used to reduce the data set size.
4. Numerosity reduction, where the the original data volume are replaced by alternative, smaller
forms of data representation. These techniques may be parametric or nonparametric. For
parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. (Outliers may also be stored.)
Regression and log-linear models are examples. Nonparametric methods for storing reduced
representationsof the datainclude histograms, clustering, sampling.
5. Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels.Concept heirarchies allow the mining of data at
multiple levels of abstraction and are a powerful tool for data mining.
Imagine that you have collected the data for your analysis. These data consist of the
AllElectronics sales per quarter, for the years 2008 to 2010. You are, however, interested in the
annual sales (total per year), rather than the total per quarter. Thus, the data can be aggregated so
that the resulting data summarize the total sales per year instead of per quarter. This aggregation
is illustrated in Figure 3.10. The resulting data set is smaller in volume, without loss of information
necessary for the analysistask.
Data cubes store multidimensional aggregated information. For example, Figure 3.11 shows a
data cube for multidimensional analysis of sales data with respect to annual sales per item type for
each AllElectronicsbranch.
Data reduction strategies include the following
1. Data Cube Aggregation, where aggregation operationsare applied to the construction of a data
cube.
2. Dimension Reduction, where irrelevant, weakly relevant, or redundant attributesor dimensions
maybe detected and removed.
3. Data compression, where encodingmechanismsare used to reduce the data set size.
4. Numerosity reduction, where the the original data volume are replaced by alternative, smaller
forms of data representation. These techniques may be parametric or nonparametric. For
parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. (Outliers may also be stored.)
Regression and log-linear models are examples. Nonparametric methods for storing reduced
representationsof the datainclude histograms, clustering, sampling.
5. Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels.Concept heirarchies allow the mining of data at
multiple levels of abstraction and are a powerful tool for data mining.
Imagine that you have collected the data for your analysis. These data consist of the
AllElectronics sales per quarter, for the years 2008 to 2010. You are, however, interested in the
annual sales (total per year), rather than the total per quarter. Thus, the data can be aggregated so
that the resulting data summarize the total sales per year instead of per quarter. This aggregation
is illustrated in Figure 3.10. The resulting data set is smaller in volume, without loss of information
necessary for the analysistask.
Data cubes store multidimensional aggregated information. For example, Figure 3.11 shows a
data cube for multidimensional analysis of sales data with respect to annual sales per item type for
each AllElectronicsbranch.
Each cell holds an aggregate data value, corresponding to the data point in multidimensional
space. (For readability, only some cell values are shown.) Concept hierarchies may exist for each
attribute, allowing the analysis of data at multiple abstraction levels. For example, a hierarchy for
branch could allow branchesto be grouped into regions, based on their address. Data cubesprovide
fast access to precomputed, summarized data, thereby benefiting online analytical processing as
well as data mining. The cube created at the lowest abstraction level is referred to as the base
cuboid. The base cuboid should correspond to an individual entity of interest such as sales or
customer. In other words, the lowest level should be usable, or useful for the analysis. A cube at
the highest level of abstraction is the apex cuboid. For the sales data in Figure 3.11, the apex
cuboid would give one total—the total sales for all three years, for all item types, and for all
branches. Data cubescreated for varying levelsof abstraction are often referred to as cuboids, so
that a data cube may instead refer to a lattice of cuboids. Each higher abstraction level further
reduces the resulting data size. When replying to data mining requests, the smallest available
cuboid relevant to the given task should be used.
Ü Dimensionality Reduction
Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to
the mining task or redundant. For example, if the task is to classify customers based on whether or not
they are likely to purchase a popular new CD at AllElectronicswhen notified of a sale, attributessuch as
the customer’ s telephone number are likely to be irrelevant, unlike attributes such as age or music
taste. Although it may be possible for a domain expert to pick out some of the useful attributes, thiscan
be a difficult and timeconsuming task, especially when the data’ s behavior is not well known. (Hence, a
reason behind its analysis!) Leaving out relevant attributes or keeping irrelevant attributes may be
detrimental, causing confusion for the mining algorithm employed. This can result in discovered patterns
of poor quality. In addition, the added volume of irrelevant or redundant attributes can slow down the
mining process.
Dimensionality Reduction reduces the data set size by removing such attributes or dimensions
from it. The methodsof attribute subset selection are applied. The goal of attribute subset selection isto
find a minimum set of attributessuch that the resultingprobabilitydistribution of the data classesisas
close as possible to the original distribution obtained using all attributes. Mining on a reduced set of
attributes has an additional benefit: It reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
“How can we find a ‘ good’subset of the original attributes?”For n attributes, there are 2n
possible subsets. An exhaustive search for the optimal subset of attributescan be prohibitivelyexpensive,
especially as n and the number of data classes increase. Therefore, heuristic methods that explore a
reduced search space are commonly used for attribute subset selection. These methods are typically
greedy in that, while searching through attribute space, they always make what looksto be the best
choice at the time. Their strategy isto make a locally optimal choice in the hope that thiswill lead to a
globally optimal solution. Such greedy methods are effective in practice and may come close to
estimatingan optimal solution.
Basic heuristicmethodsof attribute subset selection include the techniquesthat follow, some of
which are illustrated in Figure 3.6.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set. At each
subsequent iteration or step, the best of the remaining original attributesis added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each
step, it removesthe worst attribute remaining in the set.
The DWT isclosely related to the discrete Fourier transform (DFT), a signal processing technique
involving sinesand cosines. In general, however, the DWT achieves better lossy compression. That is, if
the same number of coefficients is retained for a DWT and a DFT of a given data vector, the DWT
version will provide a more accurate approximation of the original data. Hence, for an equivalent
approximation, the DWT requiresless space than the DFT. Unlike the DFT, waveletsare quite localized
in space, contributingto the conservation of local detail.
There is only one DFT, yet there are several familiesof DWTs. Figure 3.4 shows some wavelet
families. Popular wavelet transformsinclude the Haar-2, Daubechies-4, and Daubechies-6.
The general procedure for applying a discrete wavelet transform uses a hierarchical pyramid
algorithm that halves the data at each iteration, resulting in fast computational speed. The method is
asfollows:
1. The length, L, of the input data vector must be an integer power of 2. This condition can be
met by padding the datavector with zerosasnecessary(L n).
2. Each transform involves applying two functions. The first applies some data smoothing, such
as a sum or weighted average. The second performs a weighted difference, which acts to bring out the
detailed featuresof the data.
3. The two functionsare applied to pairs of data pointsin X, that is, to all pairsof measurements
.This results in two data sets of length L/ 2. In general, these represent a smoothed or low-frequency
version of the input data and the highfrequencycontent of it, respectively.
4. The two functionsare recursively applied to the data sets obtained in the previousloop, until
the resulting data setsobtained are of length 2.
5. Selected values from the data sets obtained in the previous iterations are designated the
wavelet coefficientsof the transformed data.
Equivalently, a matrix multiplication can be applied to the input data in order to obtain the
wavelet coefficients, where the matrix used depends on the given DWT. The matrix must be
The DWT isclosely related to the discrete Fourier transform (DFT), a signal processing technique
involving sinesand cosines. In general, however, the DWT achieves better lossy compression. That is, if
the same number of coefficients is retained for a DWT and a DFT of a given data vector, the DWT
version will provide a more accurate approximation of the original data. Hence, for an equivalent
approximation, the DWT requiresless space than the DFT. Unlike the DFT, waveletsare quite localized
in space, contributingto the conservation of local detail.
There is only one DFT, yet there are several familiesof DWTs. Figure 3.4 shows some wavelet
families. Popular wavelet transformsinclude the Haar-2, Daubechies-4, and Daubechies-6.
The general procedure for applying a discrete wavelet transform uses a hierarchical pyramid
algorithm that halves the data at each iteration, resulting in fast computational speed. The method is
asfollows:
1. The length, L, of the input data vector must be an integer power of 2. This condition can be
met by padding the datavector with zerosasnecessary(L n).
2. Each transform involves applying two functions. The first applies some data smoothing, such
as a sum or weighted average. The second performs a weighted difference, which acts to bring out the
detailed featuresof the data.
3. The two functionsare applied to pairs of data pointsin X, that is, to all pairsof measurements
.This results in two data sets of length L/ 2. In general, these represent a smoothed or low-frequency
version of the input data and the highfrequencycontent of it, respectively.
4. The two functionsare recursively applied to the data sets obtained in the previousloop, until
the resulting data setsobtained are of length 2.
5. Selected values from the data sets obtained in the previous iterations are designated the
wavelet coefficientsof the transformed data.
Equivalently, a matrix multiplication can be applied to the input data in order to obtain the
wavelet coefficients, where the matrix used depends on the given DWT. The matrix must be
PCA can be applied to ordered and unordered attributes, and can handle sparse data and
skewed data. Multidimensional data of more than two dimensionscan be handled by reducing the
problem to two dimensions. Principal components may be used as inputs to multiple regression
and cluster analysis. In comparison with wavelet transforms, PCA tends to be better at handling
sparse data, whereaswavelet transformsare more suitable for data of high dimensionality.
Ü Numerosityreduction
Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation. These techniques may be parametric or nonparametric. For parametric
methods, a model is used to estimate the data, so that typically only the data parameters need to be
stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear models are
examples. Nonparametric methodsfor storing reduced representationsof the data include histograms,
clustering, sampling.
Regression and log-linear models can be used to approximate the given data. In linear regression,
the data are modeled to fit a straight line. For example, a random variable, Y (called a response variable),
can be modeled asa linear function of another random variable, x (called a predictor variable), with the
equation
y = wx+ b,
where the variance of y is assumed to be constant. In the context of data mining, x and y are
numeric database attributes. The coefficients, w and b (called regression coefficients), specify the slope
of the line and the y-intercept, respectively. These coefficients can be solved for by the method of least
squares, which minimizes the error between the actual line separating the data and the estimate of the
line. Multiple linear regression is an extension of (simple) linear regression, which allows a response
variable, y, to be modeled asa linear function of two or more predictor variables.
Regression and log-linear models can both be used on sparse data, although their application
may be limited. While both methods can handle skewed data, regression does exceptionally well.
Regression can be computationally intensive when applied to high-dimensional data, whereas log-linear
models show good scalability for up to 10 or so dimensions.
Histogramsuse binning to approximate data distributionsand are a popular form of data reduction. A
histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, referred to as
buckets or bins. If each bucket represents only a single attribute–value/ frequency pair, the buckets
are called singleton buckets. Often, bucketsinstead represent continuousrangesfor the given attribute.
Example 3.3 Histograms. The following data are a list of AllElectronics prices for commonly sold items
(rounded to the nearest dollar). The numbershave been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12,
14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30.
Figure 3.7 shows a histogram for the data using singleton buckets. To further reduce the data, it is
common to have each bucket denotea continuousvalue range for the given attribute. In Figure 3.8, each
bucket representsa different $10 range for price.
“How are the buckets determined and the attribute values partitioned?”There are several partitioning
rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket range isuniform (e.g., the width of
$10 for the bucketsin Figure 3.8).
Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are created so that,
roughly, the frequency of each bucket isconstant (i.e., each bucket containsroughly the same number of
contiguousdata samples).
V-optimal: If we consider all of the possible histograms for a given number of buckets, the V-optimal
histogram is the one with least variance. Histogram is a weighted sum of the all original valuesthat each
bucket represents, where bucket weight is equal to the number of valuesin the bucket.
MaxDiff: In a MaxDiff histogram , we consider the difference between each pair of adjacent values. A
bucket boundary is established between pair for pairs having k-1 largest differences, where k is user-
specified.
§ Clustering
combinations. This allows a higher-dimensional data space to be constructed from lower-dimensional
spaces. Log-linear models are therefore also useful for dimensionality reduction (since the lower-
dimensional points together typically occupy less space than the original data points) and data
smoothing (since aggregate estimates in the lower-dimensional space are less subject to sampling
variationsthan the estimatesin the higher-dimensional space).
Regression and log-linear models can both be used on sparse data, although their application
may be limited. While both methods can handle skewed data, regression does exceptionally well.
Regression can be computationally intensive when applied to high-dimensional data, whereas log-linear
models show good scalability for up to 10 or so dimensions.
Histogramsuse binning to approximate data distributionsand are a popular form of data reduction. A
histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, referred to as
buckets or bins. If each bucket represents only a single attribute–value/ frequency pair, the buckets
are called singleton buckets. Often, bucketsinstead represent continuousrangesfor the given attribute.
Example 3.3 Histograms. The following data are a list of AllElectronics prices for commonly sold items
(rounded to the nearest dollar). The numbershave been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12,
14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30.
Figure 3.7 shows a histogram for the data using singleton buckets. To further reduce the data, it is
common to have each bucket denotea continuousvalue range for the given attribute. In Figure 3.8, each
bucket representsa different $10 range for price.
“How are the buckets determined and the attribute values partitioned?”There are several partitioning
rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket range isuniform (e.g., the width of
$10 for the bucketsin Figure 3.8).
Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are created so that,
roughly, the frequency of each bucket isconstant (i.e., each bucket containsroughly the same number of
contiguousdata samples).
V-optimal: If we consider all of the possible histograms for a given number of buckets, the V-optimal
histogram is the one with least variance. Histogram is a weighted sum of the all original valuesthat each
bucket represents, where bucket weight is equal to the number of valuesin the bucket.
MaxDiff: In a MaxDiff histogram , we consider the difference between each pair of adjacent values. A
bucket boundary is established between pair for pairs having k-1 largest differences, where k is user-
specified.
The use of multidimensional index trees as a form of data reduction relies on an ordering of the
attribute valuesin each dimension. Multidimensional index trees include R-trees, quad-trees, and their
variations. They are well-suited for handlingboth sparse and skewed data.
§ Sampling
Sampling can be used as a data reduction technique because it allows a large data set to be
represented by a much smaller random data sample (or subset). Suppose that a large data set, D,
contains N tuples. Let ’ s look at the most common ways that we could sample D for data reduction, as
illustrated in Figure 3.9.
1. Simple random sample without replacement (SRSWOR) of size s: Thisis created bydrawing
s of the N tuplesfrom D (s < N), where the probability of drawing anytuple in D is1/N, that is,
all tuplesare equally likely to be sampled.
2. Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it isrecorded and then replaced. That is, after
a tuple isdrawn, it isplaced back in D so that it may be drawn again.
An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, s, as opposed to N, the data set size. Hence, sampling
complexity is potentially sublinear to the size of the data. Other data reduction techniquescan require at
least one complete passthrough D. For a fixed sample size, sampling complexity increases only linearly
as the number of data dimensions, n, increases, whereas techniques using histograms, for example,
increase exponentially in n.
When applied to data reduction, sampling is most commonly used to estimate the answer to
an aggregate query. It ispossible (using the central limit theorem) to determine a sufficient sample size
for estimatinga given function within a specified degree of error. Thissample size, s, may be extremely
small in comparison to N. Sampling isa natural choice for the progressive refinement of a reduced data
set. Such a set can be further refined by simply increasingthe samplesize.
Techniques to generate decision trees for classification can be applied to discretization. Such
techniquesemploy a top-down splitting approach. Unlike the other methodsmentioned so far, decision
tree approachesto discretization are supervised, that is, theymake use of classlabel information.
For example, we may have a data set of patient symptoms(the attributes) where each patient
has an associated diagnosis class label. Class distribution information is used in the calculation and
determination of split-points(data valuesfor partitioning an attribute range). Intuitively, the main idea is
to select split-points so that a given resulting partition containsas many tuples of the same class as
possible. Entropyisthe most commonly used measure for thispurpose. To discretize a numeric attribute,
A, the method selects the value of A that has the minimum entropy as a split-point, and recursively
partitions the resulting intervals to arrive at a hierarchical discretization. Such discretization forms a
concept hierarchy for A.
Figure shows a concept hierarchy for the attribute price. More than one concept hierarchy can
be defined for the same attribute to accommodate the needsof varioususers.
Manual definition of concept hierarchies can be a tedious and time-consuming task for a user or a
domain expert. Fortunately, many hierarchies are implicit within the database schema and can be
automatically defined at the schema definition level. The concept hierarchiescan be used to transform
the data into multiple levelsof granularity.
Discretization and Concept Hierarchy Generation for NumericData
It is difficult to specify concept hierarchies for numeric attributes due to wide diversity of
possible data ranges and the frequent updates of the data values. Such manual specification will be also
quite arbitrary.
Concept hierarchies for numeric data can be constructed automatically based on data
distribution analysis. We examine five methos for numeric concept hierarchy generation: binning,
histogram analysis, entropy based discretization and data segmentation by “
natural partitioning”
1.Binning
Binning does not use class information and is therefore an unsupervised discretization
technique. It is sensitiveto the user-specified number of bins, aswell asthe presence of outliers.
2.Histogram Analysis
Like binning, histogram analysisisan unsupervised discretization technique because it doesnot use class
information. Histogramswere introduced before . A histogram partitions the values of an attribute, A,
into disjoint rangescalled bucketsor bins.Variouspartitioning rules can be used to define histograms.In
an equal-width histogram, for example, the values are partitioned into equal-size partitions or ranges
(e.g., earlier in Figure for price, where each bucket has a width of $10). With an equal-frequency
histogram, the values are partitioned so that, ideally, each partition contains the same number of data
tuples. The histogram analysis algorithm can be applied recursively to each partition in order to
automatically generate a multilevel concept hierarchy, with the procedure terminating once a
prespecified number of concept levelshasbeen reached.
A minimum interval size can also be used per level to control the recursive procedure. This
specifies the minimum width of a partition, or the minimum number of valuesfor each partition at each
level. Histograms can also be partitioned based on cluster analysis of the data distribution,or the
minimum number of valuesfor each partiton at each level.
An information-based measure called entropy can be used to recursively partition the values of a
numeric attribute A, resulting in a hierarchical discretization. Such a discretization forms a numerical
concept hierarchy for the attribute. Given a set of data tuples S, thebasic method for entropy based
discretization isasfollows:
1. Each value of A can be considered a potential interval boundary or threshold T. For ex. a value v
of A can partition S into two subset satisfying the conditions A<v and A v thereby creating a
binary discretization.
2. Given S,the threshold value selected is the one that maximizes ther information gain resulting
from subsequent partitioning. The information gain is
where pi isthe probability of classi in S1. Determined by dividingthe number of samples of class
i in S1 by total number of samples in S1. The value of Ent(S2) can be calculated similarly.
3. The process of determining a threshold value is recursively applied to each partiton obtained
until some stoppingcriterion ismet. Such as
Ent(s) –I(S,T) > δ
Although binning, histogram analysis, clustering and entropy-based discretization are useful in
the generation of numerical hierarchies, manyusers would like to see numerical ranges partitioned into
relatively uniform, easy-to-read intervals that appear intuitive or “
natural". For example, annual salaries
broken into ranges like [$50,000, $60,000) are often more desirable than ranges like [$51263.98,
$60872.34), obtained by some sophisticated clustering analysis.
The 3-4-5 rule can be used to segment numerical data into relatively uniform, natural intervals. The
rule partitonsa given range of data into 3,4 or 5 relatively equiwidth intervalsrecursively and level by
level based on the value range at the most significant digit. The rule is asfollows
1) If an internal covers3,6,7 or 9 distinct valuesat the most significant digit then partiton the range
into 3 intervals( 3 equiwidth intervalsfor 3,6,9 and 3 intervalsin grouping of 2-3-2 for 7)
2) If it covers 2 ,4 or 8 distinct values at the most significant digit then partiton the range into 4
equiwidth intervals.
3) If it covers 1,5 or 10 distinct values at the most significant digit, then partiton the range into 5
equiwidth intervals.
This rule can be recursively applied to each interval, creating a concept hierarchy for the given numeric
attribute.
The following example illustrates the use of the 3-4-5 rule for the automatic construction of a
numerical hierarchy.
Ex: Suppose that profits at different branches of AllElectronics for the year 2004 cover a wide range,
from -$351,976.00 to $4,700,896.50. A user desires the automatic generation of a concept hierarchy for
profit. For improved readability, we use the notation (l...r] to represent the interval (l, r]. For example, (-
$1,000,000...$0] denotesthe range from -$1,000,000 (exclusive) to $0 (inclusive).
Suppose that the data within the 5th percentile and 95th percentile are between -$159,876 and
$1,838,761. The resultsof applyingthe 3-4-5 rule are shown in Figure
1. Based on the above information, the minimum and maximum values are MIN = -$351,976:00, and
MAX= $4,700,896:50. The low (5th percentile) and high (95th percentile) valuesto be considered for the
top or first level of discretization are LOW = -$159,876, and HIGH= $1,838,761.
2. Given LOW and HIGH, the most significant digit (msd) is at the million dollar digit position (i.e., msd = 1,
000,000). Rounding LOW down to the million dollar digit,we get LOW’= -$1,000,000, rounding HIGH up
= +$2,000,000.
to the million dollar digit, we get HIGH’
3. Since thisinterval ranges over three distinct values at the most significant digit, that is, (2,000,000-(-
1,000,000))/1,000,000 = 3, the segment is partitioned into three equal-width subsegmentsaccording to
the 3-4-5 rule: (-$1,000,000 .... $0], ($0 .... $1,000,000], and ($1,000,000 .... $2,000,000]. This represents
the top tier of the hierarchy.
4. We now examine the MINand MAXvaluesto see how they “ fit”into the first-level partitions. Since the
first interval (-$1,000,000 .... $0] covers the MIN value, that is, LOW ‘< MIN, we can adjust the left
boundary of this interval to make the interval smaller. The most significant digit of MIN is the hundred
thousand digit position. Rounding MIN down to thisposition, we get MIN ‘= -$400,000. Therefore, the
first interval isredefined as(-$400,000 .... 0].
Since the last interval, ($1,000,000 .... $2,000,000], doesnot cover the MAXvalue, that is, MAX> HIGH ‘ ,
we need to create a new interval to cover it. Rounding up MAX at its most significant digit position, the
new interval is ($2,000,000 .... $5,000,000]. Hence, the topmost level of the hierarchy contains four
partitions, (-$400,000 .... $0], ($0 ...... $1,000,000], ($1,000,000 ..... $2,000,000], and ($2,000,000: .....
$5,000,000].
5. Recursively, each interval can be further partitioned according to the 3-4-5 rule to form the next lower
level of the hierarchy: The first interval, (-$400,000. . . $0], ispartitioned into 4 subintervals: (-$400,000. .
.-$300,000], (-$300,000. . .-$200,000],(-$200,000. . .-$100,000], and (-$100,000. . . $0]. The second
interval, ($0. . . $1,000,000], is partitioned into 5 subintervals: ($0 .... $200,000],($200,000. . .
$400,000],($400,000. . . $600,000],($600,000. . . $800,000], and ($800,000. . . $1,000,000]. The third
interval, ($1,000,000. . . $2,000,000], is partitioned into 5 subintervals: ($1,000,000. . .
$1,200,000],($1,200,000. . . $1,400,000],($1,400,000. . . $1,600,000], ($1,600,000 . . . $1,800,000], and
($1,800,000 . . . $2,000,000].
Similarly, the 3-4-5 rule can be carried on iteratively at deeper levels, asnecessary
We now look at data transformation for categorical data. In particular, we study concept hierarchy
generation for categorical attributes. Categorical attributes have a finite (but possibly large) number of
distinct values, with no ordering among the values. Examples include geographic location, job category,
and item type.
We studyfour methodsfor the generation of concept hierarchies for nominal data, asfollows.
1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts:
Concept hierarchies for nominal attributesor dimensionstypically involve a group of attributes. A user or
expert can easily define a concept hierarchy by specifying a partial or total ordering of the attributes at
the schema level. For example, suppose that a relational database contains the following group of
attributes: street, city, province or state, and country. Similarly, a data warehouse location dimension
may contain the same attributes. A hierarchy can be defined by specifying the total ordering among
these attributesat the schemalevel such asstreet < city < province or state < country.
2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially the manual
definition of a portion of a concept hierarchy. In a large database, it is unrealistic to define an entire
concept hierarchy by explicit value enumeration. On the contrary, we can easily specify explicit
groupingsfor a small portion of intermediate-level data. For example, after specifying that province and
country form a hierarchy at the schema level, a user could define some intermediate levels manually,
such as“ {Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada”and “ {British Columbia, prairies Canada}
⊂ Western Canada.”
3. Specification of a set of attributes, but not of their partial ordering: A user may specify a set of
attributesforming a concept hierarchy, but omit to explicitly state their partial ordering. The system can
then try to automatically generate the attribute ordering so as to construct a meaningful concept
hierarchy.
“Without knowledge of data semantics, how can a hierarchical ordering for an arbitrary set of nominal
attributes be found?”Consider the observation that since higher-level conceptsgenerally cover several
subordinate lower-level concepts, an attribute defining a high concept level (e.g., country) will usually
contain a smaller number of distinct valuesthan an attribute defining a lower concept level (e.g., street).
Based on thisobservation, a concept hierarchy can be automatically generated based on the number
of distinct valuesper attribute in the given attribute set. The attribute with the most distinct valuesis
placed at the lowest hierarchy level. The lower the number of distinct values an attribute has, the
higher it isin the generated concept hierarchy.
A concept hierarchy for location can be generated automatically, as illustrated in Figure. First,
sort the attributes in ascending order based on the number of distinct values in each attribute. This
results in the following (where the number of distinct values per attribute is shown in parentheses):
country (15), province or state (365), city (3567), and street (674,339).
Second, generate the hierarchy from the top down according to the sorted order, with the first
attribute at the top level and the last attribute at the bottom level. Finally, the user can examine the
generated hierarchy, and when necessary, modify it to reflect desired semanticrelationshipsamong the
attributes. In this example, it is obviousthat there is no need to modify the generated hierarchy. Note
that this heuristic rule is not foolproof. For example, a time dimension in a database may contain 20
distinct years, 12 distinct months, and 7 distinct daysof the week. However, this does not suggest that
the time hierarchy should be “
year < month < daysof the week,”with daysof the week at the top of the
hierarchy.
4. Specification of only a partial set of attributes: Sometimes a user can be careless when defining a
hierarchy, or have only a vague idea about what should be included in a hierarchy. Consequently, the
user may have included only a small subset of the relevant attributesin the hierarchy specification. For
example, instead of including all of the hierarchically relevant attributesfor location, the user may have
specified only street and city. To handle such partially specified hierarchies, it is important to embed
data semantics in the database schema so that attributes with tight semantic connections can be
pinned together. In this way, the specification of one attribute may trigger a whole group of
semantically tightly linked attributesto be “ dragged in”to form a complete hierarchy. Users, however,
should have the option to override thisfeature, asnecessary.
Example 3.8. Suppose that a data mining expert (serving as an administrator) haspinned together
the five attributes number, street, city, province or state, and country, because they are closely linked
semantically regarding the notion of location. If a user were to specify only the attribute city for a
hierarchy defining location, the system can automatically drag in all five semantically related attributes
to form a hierarchy. The user may choose to drop any of these attributes(e.g., number and street) from
the hierarchy, keeping city asthe lowest conceptual level.
In summary, information at the schema level and on attribute–value counts can be used to
generate concept hierarchies for nominal data. Transforming nominal data with the use of concept
hierarchies allows higher-level knowledge patterns to be found. It allows mining at multiple levels of
abstraction, which isa common requirement for data mining applications.