DWM cheatsheet sem 5
DWM cheatsheet sem 5
Data mining refers to the process of Web mining refers to the process of
extracting useful information, patterns, extracting information from the web
and trends from huge data sets. document and services, hyperlinks, and
server logs
Data engineers and data scientists can Data scientists, data engineers, and data
do data mining. analysts can do web mining.
Data mining is based on pattern Web mining is based on pattern
identification from data available in any identification from web data.
system.
Tools used by data mining are machine Tools used by web mining are PageRank,
learning algorithms. Scrappy, Apache logs.
Applications of data mining are weather It uses the same process but on the web
forecast, market analysis, fraud using the web documents.
detection, etc.
Skill needed for data mining is machine Skills needed for wen mining are application-
learning algorithms, probability, level knowledge, probability, statistics.
statistics.
Classification Prediction
Classification is the process of Predication is the process of identifying
identifying which category a new the missing or unavailable numerical
observation belongs to based on a data for a new observation.
training data set containing observations
whose category membership is known.
In classification, the accuracy depends In prediction, the accuracy depends on
on finding the class label correctly. how well a given predictor can guess
the value of a predicated attribute for
new data.
In classification, the model can be In prediction, the model can be known
known as the classifier. as the predictor.
A model or the classifier is constructed A model or a predictor will be
to find the categorical labels. constructed that predicts a continuous-
valued function or ordered value.
For example, the grouping of patients For example, We can think of prediction
based on their medical records can be as predicting the correct treatment for a
considered a classification. particular disease for a person.
1) Top-down approach:
• Since the data marts are created from the datawarehouse, provides consistent
dimensional view of data marts.
• Creating data mart from datawarehouse is easy.
Disadvantages of Top-Down Approach –
The cost, time taken in designing and its maintenance is very high.
2) Bottom-up approach:
1. First, the data is extracted from external sources (same as happens in top-
down approach).
2. Then, the data go through the staging area (as explained above) and
loaded into data marts instead of datawarehouse. The data marts are
created first and provide reporting capability. It addresses a single
business area.
This approach is given by Kinball as – data marts are created first and
provides a thin view for analyses and datawarehouse is created after
complete data marts have been created.
Advantages of Bottom-Up Approach –
1. As the data marts are created first, so the reports are quickly generated.
2. We can accommodate more number of data marts here and in this way
datawarehouse can be extended.
3. Also, the cost and time taken in designing this model is low comparatively.
• It all starts when the user puts up certain data mining requests, these requests are
then sent to data mining engines for pattern evaluation.
• These applications try to find the solution to the query using the already present database.
• The metadata then extracted is sent for proper analysis to the data mining engine
which sometimes interacts with pattern evaluation modules to determine the result.
• This result is then sent to the front end in an easily understandable manner using a
suitable interface.
6 Intrusion Detection:
Intrusion refers to any kind of action that threatens integrity, confidentiality,
or the availability of network resources. In this world of connectivity, security
has become the major issue. With increased usage of internet and
availability of the tools and tricks for intruding and attacking network
prompted intrusion detection to become a critical component of network
administration.
4) Major issues in data mining
Mining Methodology
• As there are diverse applications, new mining tasks continue to emerge. These tasks
can use the same database in different ways and require the development of new
data mining techniques.
• While searching for knowledge in large datasets, we need to explore
multidimensional space. To find interesting patterns, various combinations of
dimensions need to be applied.
• Uncertain, noisy and incomplete data can sometimes lead to erroneous derivation.
User Interaction:
• The construction of effective and efficient data analysis tools for diverse
applications, wide spectrum of data types from unstructured data, temporal
data, hypertext, multimedia data, and software program code remains a
challenging and active area of research.
Social Impact :
• The disclosure to use the data and the potential violation of individual
privacy and protection of rights are the areas of concern that need to be
addressed.
5) What is data Preprocessing & its steps in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format. Data preprocessing is a way of converting this raw data into a
much-desired form so that useful information can be derived from it, which is fed into the
training model for successful medical decisions, diagnoses, and treatments. Data
preprocessing is essential before its actual use. Data preprocessing is the concept of
changing the raw data into a clean data set. The dataset is preprocessed in order to check
missing values, noisy data, and other inconsistencies before executing it to the algorithm.
Steps Involved in Data Preprocessing:
1) Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
a) Missing Data: This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples: This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
2. Fill the Missing values: There are various ways to do this task. You can
choose to fill the missing values manually, by attribute mean or the most
probable value.
b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
1. Binning Method: This method works on sorted data in order to smooth it.
The whole data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.
2. Regression: Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering: This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
2) Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization: It is done in order to scale the data values in a specified
range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection: In this strategy, new attributes are constructed from
the given set of attributes to help the mining process.
3. Discretization: This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
4. Concept Hierarchy Generation: Here attributes are converted from lower
level to higher level in hierarchy. For Example-The attribute “city” can be
converted to “country”.
3) Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this,
we uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs. The various steps to data reduction are:
1. Data Cube Aggregation: Aggregation operation is applied to data for the
construction of the data cube.
2. Attribute Subset Selection: The highly relevant attributes should be used,
rest all can be discarded. For performing attribute selection, one can use
level of significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded.
3. Numerosity Reduction: This enable to store the model of data instead of
whole data, for example: Regression Models.
4. Dimensionality Reduction: This reduce the size of data by encoding
mechanisms.It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods
of dimensionality reduction are:Wavelet transforms and PCA (Principal
Component Analysis).
6)Various Olap Models And Their Architecture
here are 3 main types of OLAP servers are as following:
I. Relational OLAP (ROLAP) – Star Schema based –
The ROLAP is based on the premise that data need not to be stored multidimensionally in
order to viewed multidimensionally, and that it is possible to exploit the well-proven
relational database technology to handle multidimensionality of data.In ROLAP data is
stored in a relational database.In essence, each action of slicing and dicing is equivalent to
adding a “WHERE” clause in SQL statement. ROLAP can handle large amounts of data.
ROLAP can leverage functionalities inherent in the relational database.
Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems
is extracted which can be in various formats like relational databases, No SQL, XML, and flat
files into the staging area. It is important to extract the data from various source systems
and store it into the staging area first and not directly into the data warehouse because the
extracted data is in various formats and can be corrupted also. Hence loading it directly into
the data warehouse may damage it and rollback will be much more difficult. Therefore, this
is one of the most important steps of ETL process.
Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions
are applied on the extracted data to convert it into a single standard format. It may involve
following processes/tasks:
Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular intervals.
The rate and period of loading solely depends on the requirements and varies from system
to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the
transformed data is being loaded into the data warehouse, the already extracted data can
be transformed. The block diagram of the pipelining of ETL process is shown below:
8) What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
FP-Tree structure
The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative
information about frequent patterns in a database . Han defines the FP-tree as the tree
structure defined below:
1. One root labeled as “null” with a set of item-prefix subtrees as children, and a frequent-
item-header table (presented in the left side of Figure(1);
2. Each node in the item-prefix subtree consists of three fields:
• Item-name: registers which item is represented by the node;
• Count: the number of transactions represented by the portion of the path
reaching the node; iii)Node-link: links to the next node in the FP-tree carrying
the same itemname, or null if there is none.
3. Each entry in the frequent-item-header table consists of two fields Item-name : as
the same to the node.
• Head of node-link: a pointer to the first node in the FP-tree carrying the itemname.
• Additionally the frequent-item-header table can have the count support for an item.
The Figure 1 below show an example of a FP-tree.
11) Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.
Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather
detailed data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements,
an activity recently dubbed online analytical processing (OLAP). These include applications
such as forecasting, profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications.
In contrast, a warehouse database is updated from operational systems periodically, usually
during off-hours. As OLTP data accumulates in production databases, it is regularly
extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to
users. As the warehouse is populated, it must be restructured tables de-normalized, data
cleansed of errors and redundancies and new fields and keys added to reflect the needs to
the user for sorting, combining, and summarizing data.
Definition and A star schema contains both A snowflake schema contains all
Meaning dimension tables and fact three- dimension tables, fact
tables in it. tables, and subdimension tables.
Type of Model It is a top-down model type. It is a bottom-up model type.
Time Taken With the Star Schema, the With the Snowflake Schema, the
for Queries process of execution of process of execution of queries
queries takes less time. takes more time.
Use of The Star Schema does not The Snowflake Schema makes use
Normalizati make use of normalization. of both Denormalization as well
on as Normalization.
Complexity of The design of a Star Schema The designing of a Snowflake
Design is very simple. Schema is very complex.
14) Market Bsaket Anylysis With Example?
In market basket analysis (also called association analysis or frequent itemset mining), you
analyze purchases that commonly happen together. For example, people who buy bread
and peanut butter also buy jelly. Or people who buy shampoo might also buy conditioner.
What relationships there are between items is the target of the analysis. Knowing what your
customers tend to buy together can help with marketing efforts and store/website layout.
Market basket analysis isn’t limited to shopping carts. Other areas where the technique is
used include analysis of fraudulent insurance claims or credit card purchases.
Market basket analysis can also be used to cross-sell products. Amazon famously uses an
algorithm to suggest items that you might be interested in, based on your browsing history
or what other people have purchased.
A popular urban legend is that a grocery store, after running market basket analysis, found
that men were likely to buy beer and diapers together. Sales increased sales by placing beer
next to the diapers.It sounds simple (and in many cases, it is). However, pitfalls to be aware
of:
For large inventories (i.e. over 10,000), the combination of items may explode into the
billions, making the math almost impossible.
Data is often mined from large transaction histories. A large amount of data is usually
handled by specialized statistical software
15) demonstrate multidimensional and multi level
association rule mining with example?
MULTILEVEL ASSOCIATION RULES:
• Association rules generated from mining data at multiple levels of abstraction
are called multiple-level or multilevel association rules.
• Multilevel association rules can be mined efficiently using concept hierarchies
under a support- confidence framework.
• Rules at high concept level may add to common sense while rules at low
concept level may not be useful always.
• Using uniform minimum support for all levels:
• When a uniform minimum support threshold is used, the search procedure is
simplified.
• The method is also simple, in that users are required to specify only one
minimum support threshold.
• The same minimum support threshold is used when mining at each level of
abstraction.
• There are mainly 2 major approaches for data integration – one is the “tight
coupling approach” and another is the “loose coupling approach”.
1) Tight Coupling:
• Here, a data warehouse is treated as an information retrieval component.
• In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation, and Loading.
2) Loose Coupling:
• Here, an interface is provided that takes the query from the user, transforms
it in a way the source database can understand, and then sends the query
directly to the source databases to obtain the result.
• And the data only remains in the actual source databases.
3) Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These are explained in
brief below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are referred to as the entity
identification problem.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtained from another
attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting data
set.
• Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
• This is the third critical issue in data integration.
• Attribute values from different sources may differ for the same real-world
entity.
• An attribute in one system may be recorded at a lower level of abstraction
than the “same” attribute in another.
20) Describe Data Loading ?
• Loading process is the physical movement of the data from the computer systems
storing the source database(s) to that which will store the data warehouse database.
• The whole process of moving data into the data warehouse repository is referred to
in the following ways:
• Initial Load : For the very first time loading all the data warehouse tables.
• Incremental Load: Periodically applying ongoing changes as per the requirement.
• Full Refresh: Deleting the contents of a table and reloading it with fresh data.
• Data Refresh versus Update;
• After the initial load, the data warehouse needs to be maintained and updated an
this can be done by the following two methods:
• Update-application of incremental changes in the data sources.
- Refresh-complete reload at specified intervals.
• Data Loading
• Data are physically moved to the data warehouse.
• The loading takes place within a "load window".
• The trend is to near real time updates of the data warehouse as the warehouse is
increasingly used for operational applications