0% found this document useful (0 votes)
7 views27 pages

DWM cheatsheet sem 5

The document provides a comprehensive overview of various data processing techniques, including data mining, data warehousing, OLAP, and OLTP, highlighting their differences and applications. It discusses the top-down and bottom-up approaches to data warehousing, the architecture of data mining systems, and the importance of data preprocessing. Additionally, it addresses major issues in data mining, such as methodology, user interaction, efficiency, and social impact.

Uploaded by

kadamaditya202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views27 pages

DWM cheatsheet sem 5

The document provides a comprehensive overview of various data processing techniques, including data mining, data warehousing, OLAP, and OLTP, highlighting their differences and applications. It discusses the top-down and bottom-up approaches to data warehousing, the architecture of data mining systems, and the importance of data preprocessing. Additionally, it addresses major issues in data mining, such as methodology, user interaction, efficiency, and social impact.

Uploaded by

kadamaditya202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Mining Web Mining

Data mining refers to the process of Web mining refers to the process of
extracting useful information, patterns, extracting information from the web
and trends from huge data sets. document and services, hyperlinks, and
server logs
Data engineers and data scientists can Data scientists, data engineers, and data
do data mining. analysts can do web mining.
Data mining is based on pattern Web mining is based on pattern
identification from data available in any identification from web data.
system.
Tools used by data mining are machine Tools used by web mining are PageRank,
learning algorithms. Scrappy, Apache logs.
Applications of data mining are weather It uses the same process but on the web
forecast, market analysis, fraud using the web documents.
detection, etc.
Skill needed for data mining is machine Skills needed for wen mining are application-
learning algorithms, probability, level knowledge, probability, statistics.
statistics.

Data Warehouse Data Mart


Data warehouse is a Centralised While it is a decentralised system.
system.
In data warehouse, lightly While in Data mart, highly
denormalization takes place. denormalization takes place.
Data warehouse is top-down model. While it is a bottom-up model.
To built a warehouse is difficult. While to build a mart is easy.
In data warehouse, Fact constellation While in this, Star schema and snowflake
schema is used. schema are used.
Data Warehouse is flexible. While it is not flexible.
Data Warehouse is the data-oriented While it is the project-oriented in nature.
in nature.
Data Ware house has long life. While data-mart has short life than
warehouse.
In Data Warehouse, Data are contained While in this, data are contained in
in detail form. summarized form.
Data Warehouse is vast in size. While data mart is smaller than
warehouse.
OLAP (Online Analytical OLTP (Online Transaction
Processing) Processing)
It is well-known as an online database It is well-known as an online database
query management system. modifying system.
Consists of historical data from Consists of only operational current data.
various Databases.
It makes use of a data warehouse. It makes use of a standard database
management system (DBMS).
It is subject-oriented. Used for Data It is application-oriented. Used for
Mining, Analytics, Decisions making, business tasks.
etc.
In an OLAP database, tables are not In an OLTP database, tables
normalized. are normalized (3NF).
The data is used in planning, problem- The data is used to perform day-to-day
solving, and decision-making. fundamental operations.
It provides a multi-dimensional view It reveals a snapshot of present business
of different business tasks. tasks.
It serves the purpose to extract It serves the purpose to Insert, Update,
information for analysis and decision- and Delete information from the database.
making.
A large amount of data is stored The size of the data is relatively small as
typically in TB, PB the historical data is archived in MB, and
GB.
Relatively slow as the amount of data Very Fast as the queries operate on 5% of
involved is large. Queries may take the data.
hours.

ER Modeling Dimensional Modeling


It is transaction-oriented. It is subject-oriented.
Entities and Relationships. Fact Tables and Dimension Tables.
Few levels of granularity. Multiple levels of granularity.
Real-time information. Historical information.
It eliminates redundancy. It plans for redundancy.
High transaction volumes using few Low transaction volumes using many
records at a time. records at a time.
Highly Volatile data. Non-volatile data.
Physical and Logical Model. Physical Model.
Normalization is suggested. De-Normalization is suggested.
OLTP Application. OLAP Application.
The application is used for buying Application to analyze buying patterns of
products from e-commerce websites the customer of the various cities over the
like Amazon. past 10 years.
Operational Database Data Warehouse
Operational frameworks are outlined to Data warehousing frameworks are
back high-volume exchange preparing. regularly outlined to back high-volume
analytical processing (i.e., OLAP).
operational frameworks are more often Data warehousing frameworks are
than not concerned with current data. ordinarily concerned with verifiable
information.
Data inside operational frameworks are Non-volatile, unused information may be
basically overhauled frequently included routinely. Once Included once
agreeing to need. in a while changed.
It is planned for real-time commerce It is outlined for investigation of
managing and processes. commerce measures by subject range,
categories, and qualities.
Relational databases are made for on- Data Warehouse planned for on-line
line value-based Preparing (OLTP) Analytical Processing (OLAP)
Operational frameworks are ordinarily Data warehousing frameworks are more
optimized to perform quick embeds often than not optimized to perform
and overhauls of cooperatively little quick recoveries of moderately tall
volumes of data. volumes of information.
Data In Data out
Operational database systems are While data warehouses are generally
generally application-oriented. subject-oriented.

Classification Prediction
Classification is the process of Predication is the process of identifying
identifying which category a new the missing or unavailable numerical
observation belongs to based on a data for a new observation.
training data set containing observations
whose category membership is known.
In classification, the accuracy depends In prediction, the accuracy depends on
on finding the class label correctly. how well a given predictor can guess
the value of a predicated attribute for
new data.
In classification, the model can be In prediction, the model can be known
known as the classifier. as the predictor.
A model or the classifier is constructed A model or a predictor will be
to find the categorical labels. constructed that predicts a continuous-
valued function or ordered value.
For example, the grouping of patients For example, We can think of prediction
based on their medical records can be as predicting the correct treatment for a
considered a classification. particular disease for a person.
1) Top-down approach:

The essential components are discussed below:


1. External Sources – External source is a source from where data is collected irrespective
of the type of data. Data can be structured, semi structured and unstructured as well.
2. Stage Area – Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into data warehouse.
For this purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.
• T(Transform): Data is transformed into the standard format.
• L(Load): Data is loaded into data warehouse after transforming it into the
standard format.
3. Data-warehouse – After cleansing of data, it is stored in the data warehouse as central
repository. It actually stores the meta data and the actual data gets stored in the data
marts. Note that data warehouse stores the data in its purest form in this top-down
approach.
4. Data Marts – Data mart is also a part of storage component. It stores the information of
a particular function of an organisation which is handled by single authority. There can
be as many number of data marts in an organisation depending upon the functions. We
can also say that data mart contains subset of the data stored in data warehouse.
5. Data Mining – The practice of analysing the big data present in data warehouse is data
mining. It is used to find the hidden patterns that are present in the database or in data
warehouse with the help of algorithm of data mining. This approach is defined by In
mon as – data warehouse as a central repository for the complete organisation and data
marts are created from it after the complete data warehouse has been created.
Advantages of Top-Down Approach –

• Since the data marts are created from the datawarehouse, provides consistent
dimensional view of data marts.
• Creating data mart from datawarehouse is easy.
Disadvantages of Top-Down Approach –
The cost, time taken in designing and its maintenance is very high.

2) Bottom-up approach:

1. First, the data is extracted from external sources (same as happens in top-
down approach).

2. Then, the data go through the staging area (as explained above) and
loaded into data marts instead of datawarehouse. The data marts are
created first and provide reporting capability. It addresses a single
business area.

3. These data marts are then integrated into datawarehouse.

This approach is given by Kinball as – data marts are created first and
provides a thin view for analyses and datawarehouse is created after
complete data marts have been created.
Advantages of Bottom-Up Approach –
1. As the data marts are created first, so the reports are quickly generated.

2. We can accommodate more number of data marts here and in this way
datawarehouse can be extended.

3. Also, the cost and time taken in designing this model is low comparatively.

Disadvantage of Bottom-Up Approach –


1. This model is not strong as top-down approach as dimensional
view of data marts is not consistent as it is in above approach.
3) Architectre Of Data Mining System
Data Mining refers to the detection and extraction of new patterns from the already
collected data. Data mining is the amalgamation of the field of statistics and computer
science aiming to discover patterns in incredibly large datasets and then transform them
into a comprehensible structure for later use. The architecture of Data Mining:
Basic Working:

• It all starts when the user puts up certain data mining requests, these requests are
then sent to data mining engines for pattern evaluation.
• These applications try to find the solution to the query using the already present database.
• The metadata then extracted is sent for proper analysis to the data mining engine
which sometimes interacts with pattern evaluation modules to determine the result.
• This result is then sent to the front end in an easily understandable manner using a
suitable interface.

A detailed description of parts of data mining architecture is shown:


1. Data Sources: Database, World Wide Web(WWW), and data warehouse are parts of
data sources. The data in these sources may be in the form of plain text, spreadsheets,
or other forms of media like photos or videos. WWW is one of the biggest sources of
data.
2. Database Server: The database server contains the actual data ready to be processed. It
performs the task of handling data retrieval as per the request of the user.
3. Data Mining Engine: It is one of the core components of the data mining architecture
that performs all kinds of data mining techniques like association, classification,
characterization, clustering, prediction, etc.
4. Pattern Evaluation Modules: They are responsible for finding interesting patterns in the
data and sometimes they also interact with the database servers for producing the
result of the user requests.
5. Graphic User Interface: Since the user cannot fully understand the complexity of the
data mining process so graphical user interface helps the user to communicate
effectively with the data mining system.
6. Knowledge Base: Knowledge Base is an important part of the data mining engine that is
quite beneficial in guiding the search for the result patterns. Data mining engines may
also sometimes get inputs from the knowledge base. This knowledge base may contain
data from user experiences. The objective of the knowledge base is to make the result
more accurate and reliable.

Applications of data mining system


Here is the list of areas where data mining is widely used −
1. Financial Data Analysis
2. Retail Industry
3. Telecommunication Industry
4. Biological Data Analysis
5. Other Scientific Applications
6. Intrusion Detection
1 Financial Data Analysis:
The financial data in banking and financial industry is generally reliable and
of high quality which facilitates systematic data analysis and data mining.
2 Retail Industry:
Data Mining has its great application in Retail Industry because it collects
large amount of data from on sales, customer purchasing history, goods
transportation, consumption and services. It is natural that the quantity of
data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web. Data mining in retail industry
helps in identifying customer buying patterns and trends that lead to
improved quality of customer service and good customer retention and
satisfaction.
3 Telecommunication Industry:
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet
messenger, images, email, web data transmission, etc. Due to the
development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data
mining is become very important to help and understand the business. Data
mining in telecommunication industry helps in identifying the
telecommunication patterns, catch fraudulent activities, make better use of
resource, and improve quality of service. Here is the list of examples for
4 Biological Data Analysis:
In recent times, we have seen a tremendous growth in the field of biology
such as genomics, proteomics, functional Genomics and biomedical
research. Biological data mining is a very important part of Bioinformatics.
5 Other Scientific Applications:
The applications discussed above tend to handle relatively small and
homogeneous data sets for which the statistical techniques are appropriate.
Huge amount of data have been collected from scientific domains such as
geosciences, astronomy, etc. A large amount of data sets is being generated
because of the fast numerical simulations in various fields such as climate
and ecosystem modeling, chemical engineering, fluid dynamics, etc.

6 Intrusion Detection:
Intrusion refers to any kind of action that threatens integrity, confidentiality,
or the availability of network resources. In this world of connectivity, security
has become the major issue. With increased usage of internet and
availability of the tools and tricks for intruding and attacking network
prompted intrusion detection to become a critical component of network
administration.
4) Major issues in data mining
Mining Methodology
• As there are diverse applications, new mining tasks continue to emerge. These tasks
can use the same database in different ways and require the development of new
data mining techniques.
• While searching for knowledge in large datasets, we need to explore
multidimensional space. To find interesting patterns, various combinations of
dimensions need to be applied.
• Uncertain, noisy and incomplete data can sometimes lead to erroneous derivation.

User Interaction:

• The data analyzing process should be highly interactive. It is important for


facilitating the mining process to be user interactive.
• The domain knowledge, background knowledge, constraints, etc., should all
be incorporated in the data mining process.
• The knowledge discovered by mining the data should be usable for humans.
The system should adopt an expressive representation of knowledge, user-
friendly visualization techniques, etc.
Efficiency And Scalability :

• Data mining algorithms should be efficient and scalable to effectively extract


interesting data from a huge amount of data in the data repositories.
• Wide distribution of data, complexity in computation motivates the
development of parallel and distributed data-intensive algorithms.
Diversity of Database Types :

• The construction of effective and efficient data analysis tools for diverse
applications, wide spectrum of data types from unstructured data, temporal
data, hypertext, multimedia data, and software program code remains a
challenging and active area of research.
Social Impact :

• The disclosure to use the data and the potential violation of individual
privacy and protection of rights are the areas of concern that need to be
addressed.
5) What is data Preprocessing & its steps in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format. Data preprocessing is a way of converting this raw data into a
much-desired form so that useful information can be derived from it, which is fed into the
training model for successful medical decisions, diagnoses, and treatments. Data
preprocessing is essential before its actual use. Data preprocessing is the concept of
changing the raw data into a clean data set. The dataset is preprocessed in order to check
missing values, noisy data, and other inconsistencies before executing it to the algorithm.
Steps Involved in Data Preprocessing:

1) Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
a) Missing Data: This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:

1. Ignore the tuples: This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
2. Fill the Missing values: There are various ways to do this task. You can
choose to fill the missing values manually, by attribute mean or the most
probable value.
b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
1. Binning Method: This method works on sorted data in order to smooth it.
The whole data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.
2. Regression: Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering: This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
2) Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization: It is done in order to scale the data values in a specified
range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection: In this strategy, new attributes are constructed from
the given set of attributes to help the mining process.
3. Discretization: This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
4. Concept Hierarchy Generation: Here attributes are converted from lower
level to higher level in hierarchy. For Example-The attribute “city” can be
converted to “country”.
3) Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this,
we uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs. The various steps to data reduction are:
1. Data Cube Aggregation: Aggregation operation is applied to data for the
construction of the data cube.
2. Attribute Subset Selection: The highly relevant attributes should be used,
rest all can be discarded. For performing attribute selection, one can use
level of significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded.
3. Numerosity Reduction: This enable to store the model of data instead of
whole data, for example: Regression Models.
4. Dimensionality Reduction: This reduce the size of data by encoding
mechanisms.It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods
of dimensionality reduction are:Wavelet transforms and PCA (Principal
Component Analysis).
6)Various Olap Models And Their Architecture
here are 3 main types of OLAP servers are as following:
I. Relational OLAP (ROLAP) – Star Schema based –
The ROLAP is based on the premise that data need not to be stored multidimensionally in
order to viewed multidimensionally, and that it is possible to exploit the well-proven
relational database technology to handle multidimensionality of data.In ROLAP data is
stored in a relational database.In essence, each action of slicing and dicing is equivalent to
adding a “WHERE” clause in SQL statement. ROLAP can handle large amounts of data.
ROLAP can leverage functionalities inherent in the relational database.

II. Multidimensional OLAP (MOLAP) – Cube based –


MOLAP stores data on disks in a specialized multidimensional array structure. OLAP is
performed on it relying on the random access capability of the arrays. Arrays element are
determined by dimension instances, and the fact data or measured value associated with
each cell is usually stored in the corresponding array element. In MOLAP, the
multidimensional array is usually stored in a linear allocation according to nested traversal
of the axes in some predetermined order.
But unlike ROLAP, where only records with non-zero facts are stored, all array elements are
defined in MOLAP and as a result, the arrays generally tend to sparse, with empty elements
occupying a greater part of it.Since both storage and retrieval costs are important while
assessing online performance efficiency, MOLAP systems typically include provision such as
advanced indexing and hashing to locate data while performing queries for handling sparse
arrays. MOLAP cubes are fast data retrieval, optimal for slicing and dicing and they can
perform complex calculation. All calculation are pre-generated when the cube is created.
III. Hybrid OLAP (HOLAP) –
HOLAP is a combination of ROLAP and MOLAP. HOLAP servers allows storing the large data
volumes of detail data.On the one hand, HOLAP leverages the greater scalability of ROLAP.
On the other hand, HOLAP leverages the cube technology for faster performance and for
summarytype information. Cubes are smaller than MOLAP since detail data is kept in the
relational database. The database are used to stores data in the most functional way
possible.
7) Short note on ETL Process
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a
process in which an ETL tool extracts the data from various data source systems, transforms
it in the staging area, and then finally, loads it into the Data Warehouse system.

Let us understand each step of the ETL process in-depth:

Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems
is extracted which can be in various formats like relational databases, No SQL, XML, and flat
files into the staging area. It is important to extract the data from various source systems
and store it into the staging area first and not directly into the data warehouse because the
extracted data is in various formats and can be corrupted also. Hence loading it directly into
the data warehouse may damage it and rollback will be much more difficult. Therefore, this
is one of the most important steps of ETL process.

Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions
are applied on the extracted data to convert it into a single standard format. It may involve
following processes/tasks:

• Filtering – loading only certain attributes into the data warehouse.


• Cleaning – filling up the NULL values with some default values, mapping
U.S.A, United States, and America into USA, etc.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute (generally key-
attribute).

Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular intervals.
The rate and period of loading solely depends on the requirements and varies from system
to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the
transformed data is being loaded into the data warehouse, the already extracted data can
be transformed. The block diagram of the pipelining of ETL process is shown below:
8) What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse


or data mart. Fact Constellation Schema can design with a collection of de-
normalized FACT, Shared, and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult


to summarize information. Fact Constellation Schema can implement
between aggregate Fact tables or decompose a complex Fact table into
independent simplex Fact tables.

9) What are spatial data structure?


Spatial data structures store data objects organized by position and are an important class
of data structures used in geographic information systems, computer graphics, robotics, and
many other fields. A number of spatial data structures are used for storing point data in two
or more dimensions.

Why is spatial data important in GIS


Spatial analysis is the most intriguing and remarkable aspect of GIS. Using spatial analysis,
you can combine information from many independent sources and derive new sets of
information (results) by applying a sophisticated set of spatial operators. This
comprehensive collection of spatial analysis tools extends your ability to answer complex
spatial questions. Statistical analysis can determine if the patterns that you see are
significant. You can analyze various layers to calculate the suitability of a place for a
particular activity. And by employing image analysis, you can detect change over time.
These tools and many others, which are part of ArcGIS, enable you to address critically
important questions and decisions that are beyond the scope of simple visual analysis.
10) Data visualization
• Data visualization is actually a set of data points and information that are
represented graphically to make it easy and quick for user to understand. Data
visualization is good if it has a clear meaning, purpose, and is very easy to interpret,
without requiring context. Tools of data visualization provide an accessible way to
see and understand trends, outliers, and patterns in data by using visual effects or
elements such as a chart, graphs, and maps.
• Data visualization is a graphical representation of quantitative information and data
by using visual elements like graphs, charts, and maps.
• Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.

Categories of Data Visualization ;


Data visualization is very critical to market research where both numerical and categorical
data can be visualized that helps in an increase in impacts of insights and also helps in
reducing risk of analysis paralysis. So, data visualization is categorized into following
categories :

FP-Tree structure
The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative
information about frequent patterns in a database . Han defines the FP-tree as the tree
structure defined below:
1. One root labeled as “null” with a set of item-prefix subtrees as children, and a frequent-
item-header table (presented in the left side of Figure(1);
2. Each node in the item-prefix subtree consists of three fields:
• Item-name: registers which item is represented by the node;
• Count: the number of transactions represented by the portion of the path
reaching the node; iii)Node-link: links to the next node in the FP-tree carrying
the same itemname, or null if there is none.
3. Each entry in the frequent-item-header table consists of two fields Item-name : as
the same to the node.
• Head of node-link: a pointer to the first node in the FP-tree carrying the itemname.
• Additionally the frequent-item-header table can have the count support for an item.
The Figure 1 below show an example of a FP-tree.
11) Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.
Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather
detailed data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements,
an activity recently dubbed online analytical processing (OLAP). These include applications
such as forecasting, profiling, summary reporting, and trend analysis.

Production databases are updated continuously by either by hand or via OLTP applications.
In contrast, a warehouse database is updated from operational systems periodically, usually
during off-hours. As OLTP data accumulates in production databases, it is regularly
extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to
users. As the warehouse is populated, it must be restructured tables de-normalized, data
cleansed of errors and redundancies and new fields and keys added to reflect the needs to
the user for sorting, combining, and summarizing data.

Data Warehouse Architecture: Basic


12) Hierarchical Clustering
A Hierarchical clustering method works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data point as a separate cluster.
Then, it repeatedly executes the subsequent steps:
Identify the 2 clusters which can be closest together, and
Merge the 2 maximum comparable clusters. We need to continue these steps until
all the clusters are merged together. In Hierarchical Clustering, the aim is to produce
a hierarchical series of nested clusters. A diagram called Dendrogram (A Dendrogram
is a tree-like diagram that statistics the sequences of merges or splits) graphically
represents this hierarchy and is an inverted tree that describes the order in which
factors are merged (bottom-up view) or clusters are broken up (top-down view).
The basic method to generate hierarchical clustering is
Agglomerative: Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster. (It is a bottom-up method). At
first, every dataset is considered as an individual entity or cluster. At every iteration,
the clusters merge with different clusters until one cluster is formed. The algorithm
for Agglomerative Hierarchical Clustering is:
• Calculate the similarity of one cluster with all the other clusters
• (calculate proximity matrix)
• Consider every data point as an individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Steps 3 and 4 until only a single cluster remains.
13) What is star schema and snow flake schema?
Star Schema It is a multidimensional model type that one can use in the case of a
data warehouse. A typical Star Schema contains both- the dimensional tables and the
fact tables. It also makes use of a fewer number of foreign-key joins. In simpler
words, this type of schema leads to the formation of a star with dimension tables and
fact tables.
Snowflake Schema It is also a multidimensional model type that one can use in the
case of a data warehouse. A typical Snowflake Schema contains all three- dimension
tables, fact tables, and sub-dimension tables. In simpler words, this type of schema
leads to the formation of a snowflake with dimension tables, fact tables, and also
subdimension tables.

Difference Between Star Schema and Snowflake Schema


Parameters Star Schema Snowflake Schema

Definition and A star schema contains both A snowflake schema contains all
Meaning dimension tables and fact three- dimension tables, fact
tables in it. tables, and subdimension tables.
Type of Model It is a top-down model type. It is a bottom-up model type.

Space It makes use of more allotted It makes use of less allotted


Occupied space. space.

Time Taken With the Star Schema, the With the Snowflake Schema, the
for Queries process of execution of process of execution of queries
queries takes less time. takes more time.
Use of The Star Schema does not The Snowflake Schema makes use
Normalizati make use of normalization. of both Denormalization as well
on as Normalization.
Complexity of The design of a Star Schema The designing of a Snowflake
Design is very simple. Schema is very complex.
14) Market Bsaket Anylysis With Example?
In market basket analysis (also called association analysis or frequent itemset mining), you
analyze purchases that commonly happen together. For example, people who buy bread
and peanut butter also buy jelly. Or people who buy shampoo might also buy conditioner.
What relationships there are between items is the target of the analysis. Knowing what your
customers tend to buy together can help with marketing efforts and store/website layout.
Market basket analysis isn’t limited to shopping carts. Other areas where the technique is
used include analysis of fraudulent insurance claims or credit card purchases.
Market basket analysis can also be used to cross-sell products. Amazon famously uses an
algorithm to suggest items that you might be interested in, based on your browsing history
or what other people have purchased.
A popular urban legend is that a grocery store, after running market basket analysis, found
that men were likely to buy beer and diapers together. Sales increased sales by placing beer
next to the diapers.It sounds simple (and in many cases, it is). However, pitfalls to be aware
of:
For large inventories (i.e. over 10,000), the combination of items may explode into the
billions, making the math almost impossible.

Data is often mined from large transaction histories. A large amount of data is usually
handled by specialized statistical software
15) demonstrate multidimensional and multi level
association rule mining with example?
MULTILEVEL ASSOCIATION RULES:
• Association rules generated from mining data at multiple levels of abstraction
are called multiple-level or multilevel association rules.
• Multilevel association rules can be mined efficiently using concept hierarchies
under a support- confidence framework.
• Rules at high concept level may add to common sense while rules at low
concept level may not be useful always.
• Using uniform minimum support for all levels:
• When a uniform minimum support threshold is used, the search procedure is
simplified.
• The method is also simple, in that users are required to specify only one
minimum support threshold.
• The same minimum support threshold is used when mining at each level of
abstraction.

MULTIDIMENSIONAL ASSOCIATION RULES:


In Multi dimensional association:

• Attributes can be categorical or quantitative.


• Quantitative attributes are numeric and incorporates
hierarchy.
• Numeric attributes must be discretized.
• Multi dimensional association rule consists of more than one
dimension:
16) Describe KDD Process?
The process of discovering knowledge in data and application and data mining
method refers to term knowledge discovery in database(KDD)
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1. Developing an understanding of o the application domain o the relevant prior
knowledge o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
• Removal of noise or outliers. o Collecting necessary information to model or
account for noise.
• Strategies for handling missing data fields.
• Accounting for time sequence information and known changes.
4. Data reduction and projection.
• Finding useful features to represent the data depending on the goal of the
task. o Using dimensionality reduction or transformation methods to reduce
the effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining task.
• Deciding whether the goal of the KDD process is classification, regression,
clustering, etc.
6. Choosing the data mining algorithm(s).
• Selecting method(s) to be used for searching for patterns in the data.
• Deciding which models and parameters may be appropriate. o Matching a
particular data mining method with the overall criteria of the KDD process.
7. Data mining.
• Searching for patterns of interest in a particular representational form or a set of
such representations as classification rules or trees, regression, clustering, and so
forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
17) What is relationship between data warehousing and
data replication? Which form of replication (syncronus or
asyncronus) is better suted for data warehousing ?why
explain with appropriate example?
• Data warehouses are carefully designed databases that hold integrated
data for secondary usage. If you only need data from one system, but can't
impact the performance of that system, then it is suggested to take a copy
i.e. a replicated data store unless the complexities and width of data is very
large and the various ways to require access to it are very high, a data
warehouse would be overkill
• A replicated data store is a database that holds schemas from other
systems, but doesn't truly integrate the data. This means it is typically in a
format similar to what the source systems had. The value in a replicated
data store is that it provides a single source for resources to go to inorder to
access data from any system without negatively impacting the performance
of thatsystem.
• Data replication is simply a method for creating copies of data in a
distributed
• environment. Replication technology can be used to capture changes to
source data.
• Synchronous replication: Synchronous replication is used for creating
replicas in real time. In synchronous replication data is written to primary
storage and the replica is done simultaneously. Primary copy and the
replica should always remain synchronized
• Asynchronous replication: It is used for creating time delayed replicas. In
asynchronous replication data is written to the primary storage first and
then copy data to the replica.
• Syncronus replication is best suted for datawarehouse as it creates replicas
in real time.
18) Discribe K-means Clustering ?

• In 1967, J. MacQueen and then in 1975 J. A. Hartigan and M. A. Wong


devlop
• K-means clustering algorithm. In K-means approach the data objects are
classified based on their attributes or features into k number of clusters.
The number of clusters i.e. K is an input given by the user.
• K-means is one of the simplest unsupervised learning algorithms.
• Define K centroids for K clusters which are generally far away from each
other. Then Group the elements into clusters, which are nearer to the
centroid of that cluster
• After this first step, again calculate the new centroid for each cluster
based on the elements of that cluster. Follow the same method and
group the elements based on new centroid.
• In every step, the centroid changes and elements move from one
cluster to another. Do the same process till no element is moving from
one cluster to another ie. Till two consecutive steps with same centroid
and same elements are obtained.
19) Describe Data integration and consolidation)
• Data Integration is a data preprocessing technique that combines data from
multiple heterogeneous data sources into a coherent data store and provides
a unified view of the data. These sources may include multiple data cubes,
databases, or flat files.
• The data integration approaches are formally defined as triple <G, S, M>
where,
• G stand for the global schema,
• S stands for the heterogeneous source of schema,
• M stands for mapping between the queries of source and global schema.

• There are mainly 2 major approaches for data integration – one is the “tight
coupling approach” and another is the “loose coupling approach”.
1) Tight Coupling:
• Here, a data warehouse is treated as an information retrieval component.
• In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation, and Loading.
2) Loose Coupling:
• Here, an interface is provided that takes the query from the user, transforms
it in a way the source database can understand, and then sends the query
directly to the source databases to obtain the result.
• And the data only remains in the actual source databases.
3) Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These are explained in
brief below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are referred to as the entity
identification problem.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtained from another
attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting data
set.
• Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
• This is the third critical issue in data integration.
• Attribute values from different sources may differ for the same real-world
entity.
• An attribute in one system may be recorded at a lower level of abstraction
than the “same” attribute in another.
20) Describe Data Loading ?
• Loading process is the physical movement of the data from the computer systems
storing the source database(s) to that which will store the data warehouse database.
• The whole process of moving data into the data warehouse repository is referred to
in the following ways:
• Initial Load : For the very first time loading all the data warehouse tables.
• Incremental Load: Periodically applying ongoing changes as per the requirement.
• Full Refresh: Deleting the contents of a table and reloading it with fresh data.
• Data Refresh versus Update;
• After the initial load, the data warehouse needs to be maintained and updated an
this can be done by the following two methods:
• Update-application of incremental changes in the data sources.
- Refresh-complete reload at specified intervals.
• Data Loading
• Data are physically moved to the data warehouse.
• The loading takes place within a "load window".
• The trend is to near real time updates of the data warehouse as the warehouse is
increasingly used for operational applications

21) Explain Data Loading And Techniques ?


Loading process is the physical movement of the data from the computer systems storing
the source database(s) to that which will store the data warehouse database.
The whole process of moving data into the data warehouse repository is referred to in the
following ways:
1) Initial Load: For the very first time loading all the data warehouse tables.
2) Incremental Load: Periodically applying ongoing changes as per the requirement.
3) Full Refresh: Deleting the contents of a table and reloading it with fresh data.
Data Refresh versus Update
After the initial load, the data warehouse needs to be maintained and updated and this can
be done by the following two methods: Update-application of incremental changes in the
data sources. Refreshcomplete reload at specified intervals.
Data Loading Data are physically moved to the data warehouse. The loading takes place
within a "load window".
The trend is to near real time updates of the data warehouse as the warehouse is
increasingly used for operational applications.

You might also like