Data Mining
Data Mining
Ans:- Data mining is one of the most useful techniques that help us to extract valuable
information from huge sets of data. Data mining is also called Knowledge Discovery in
Database (KDD).
The process of extracting information to identify patterns, trends, and useful data
from huge sets of data is called Data Mining.
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions –
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and
printers, and concepts of customers include big spenders and budget spenders.
Such descriptions of a class or a concept are called class/concept descriptions
Mining of Association
Associations are used in retail sales to identify patterns that are frequently
purchased together.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
2. Prediction :
It is used to predict missing or unavailable numerical data values rather than class
labels
3.Outlier Analysis:
Outliers may be defined as the data objects that do not comply with the general
behavior or model of the data available.
4.Evolution Analysis:
Evolution analysis refers to the description and model regularities or trends for
objects whose behavior changes over time.
Different processes:
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources
and in different formats, it can't be used directly for the data mining procedure
because the data may not be complete and accurate. Several methods may be
performed on the data as part of selection, integration, and cleaning.
In other words, we can say data mining is the root of our data mining architecture.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be
helpful to guide the search or evaluate the stake of the result patterns. The data
mining engine may receive inputs from the knowledge base to make the result more
accurate and reliable.
Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).
Operational systems are usually concerned with Data warehousing systems are usually
current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.
It is designed for real-time business dealing and It is designed for analysis of business
processes. measures by subject area, categories, and
attributes.
Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of associatively optimized to perform fast retrievals of
small volumes of data. relatively high volumes of data.
Relational databases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)
DATA MART:
Data mart is a smaller form of data warehouse, which serves some specific needs on
data analysis. It is usually derived as a small part from the bigger data warehouse.
Reasons for creating a data mart
o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data warehouse
o It contains only essential business data and is less cluttered.
1.Three-Tier Architecture-
The three-tier architecture consists of the source layer (containing multiple source system),
the reconciled layer and the data warehouse layer (containing both data warehouses and
data marts). The reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population. In some cases, the
reconciled layer is also directly used to accomplish better some operational tasks, such as
producing daily reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically to benefit from
cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A
disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away from being
real-time.
Multidimensional
Relational Database is used
Storage Database is used Multidimensional as storage
location for as storage location Database is used as location for
summary for summary storage location for summary
aggregation aggregation. summary aggregation. aggregation.
Processing time of
Processing ROLAP is very Processing time of Processing time
time slow. MOLAP is fast. of HOLAP is fast.
Small storage
Large storage space
space requirement requirement in
in ROLAP as Medium storage space HOLAP as
compare to requirement in MOLAP compare to
Storage space MOLAP and as compare to ROLAP MOLAP and
requirement HOLAP. and HOLAP. ROLAP.
Relational
Relational Multidimensional database is used
Storage database is used database is used as as storage
location for as storage location storage location for location for detail
detail data for detail data. detail data. data.
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end
tools. To store and manage warehouse data, ROLAP uses relational or extended-
relational DBMS.
ROLAP includes the following −
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional
views of data. With multidimensional data stores, the storage utilization may be low
if the data set is sparse. Therefore, many MOLAP server use two levels of data
storage representation to handle dense and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to
store the large data volumes of detailed information. The aggregations are stored
separately in MOLAP store.
Data Cube
When data is grouped or combined in multidimensional matrices called Data Cubes.
The data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."
For example, a relation with the schema sales (part, supplier, customer, and sale-
price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes
are chosen to be measure attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions.
The figure shows the essential elements of a typical warehouse. We see the Source
Data component shows on the left. The Data staging element serves as the next
building block. In the middle, we see the Data Storage component that handles the
data warehouses data. This element not only stores and manages the data; it also
keeps track of data using the metadata repository. The Information Delivery
component shows on the right consists of all the different ways of making the
information from the data warehouses available to the users.
Production Data: This type of data comes from the different operating systems of
the enterprise. Based on the data requirements in the data warehouse, we choose
segments of the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part
of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business.
In every operational system, we periodically take the old data and store it in achieved
files.
External Data: Most executives depend on information from external sources for a
large percentage of the information they use. They use statistics associating to their
industry produced by the external department.
We will now discuss the three primary functions that take place in the staging area.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information
about the indexes, and so on.
Data Marts
Data marts are lower than data warehouses and usually contain organization. The
current trends in data warehousing are to developed a data warehouse with several
smaller related data marts for particular kinds of queries and reports.
Smoothing:
Aggregation:
Data aggregation is gathering data from a number of sources and storing it in a single format.
Aggregation, in itself, is a process of improving the quality of the data where it helps gather
info about data clusters and collect lots of data.
Discretization:
Discretization is one of the transformation methods that break up continuous data into small
intervals. Although data mining requires continuous data, the existing frameworks can only
handle discrete data chunks.
Attribute construction:
In attribute construction, new attributes are generated and applied in the mining process from
the existing set of attributes. It improves mining efficiency by simplifying the original data.
Generalization:
Generalization is used to convert low-level data attributes to high-level data attributes by the
use of concept hierarchy. An example is an age in the numerical form of raw data (22, 52) is
converted into (Young, old) categorical value.
Normalization:
22.6M
387
Hello Java Program for Beginners
Another example is analytics, where we gather the static data of website visitors. For
example, all visitors who visit the site with the IP address of India are shown under
country level.
A particular city can map with the belonging country. For example, New Delhi can be
mapped to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and
ends with the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized
information and ends with the top to the generalized information.
T200 I2, I4
T300 I2, I3
T500 I1, I3
T600 I2, I3
T700 I1, I3
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong association rules
satisfy both minimum support and minimum confidence). This can be done
using Eq. (6.4) for confidence, which we show again here for completeness:
support count(A ∪B) is the number of transactions containing the itemsets A ∪B, and
support count(A) is the number of transactions containing the itemset A. Based on this
For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if support count(l)
support count(s) ≥
min conf, where min conf is the minimum confidence threshold.
Because the rules are generated from frequent itemsets, each one automatically satis-
fies the minimum support. Frequent itemsets can be stored ahead of time in hash tables
Example 6.4 Generating association rules. Let’s try an example based on the transactional
data for
AllElectronics shown before in Table 6.1. The data contain frequent itemset X = {I1, I2,
I5}. What are the association rules that can be generated from X? The nonempty subsets
of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}. The resulting association rules are
If the minimum confidence threshold is, say, 70%, then only the second, third, and
last rules are output, because these are the only ones generated that are strong. Note
that, unlike conventional classification rules, association rules can contain more than
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects. It means how two or more objects are related to one another. In other words,
we can say that the apriori algorithm is an association rule leaning that analyzes that people
who bought product A also bought product B Apriori algorithm helps the customers to buy
their products with ease and increases the sales performance of the particular store.
Support
Confidence
Lift
Support-- Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by the total
number of transactions. Hence, we get Support (Biscuits) = (Transactions relating biscuits) /
(Total transactions) = 400/4000 = 10 percent.
Confidence-- Confidence refers to the possibility that the customers bought both biscuits
and chocolates together. So, you need to divide the number of transactions that comprise both
biscuits and chocolates by the total number of transactions to get the confidence. Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
involving Biscuits) = 200/400 = 50 percent. It means that 50 percent of customers who
bought biscuits bought chocolates also.
Lift-- Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given below. Lift =
(Confidence (Biscuits - chocolates)/ (Support (Biscuits) = 50/10 = 5 It means that the
probability of people buying both biscuits and chocolates together is five times more than that
of purchasing the biscuits alone. If the lift value is below one, it requires that the people are
unlikely to buy both the items together. Larger the value, the better is the combination.
Apriori algorithm is an expensive method to find support since the calculation has to pass
through the whole database.
Transaction
Reduction
Portioning
Sampling
Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for frequent
itemsets in S instead of D.
Single dimensional or intra dimensional association rule contains a single distinct predicate
(e.g., buys)with multiple occurrences i.e., the predicate occurs more than once within the rule.
b) What is the difference between clustering and prediction? List out some
of the issues regarding clustering and prediction.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different
groups.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for
any training.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
K-Medoids clustering