0% found this document useful (0 votes)
44 views3 pages

Data Mining

Uploaded by

Debasish Sarangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views3 pages

Data Mining

Uploaded by

Debasish Sarangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

DATA WAREHOUSE - A data warehouse is a relational Characteristics of Data Warehouse - Subject-oriented – A data

database specifically designed for query and warehouse is always a subject oriented as it delivers information
analysis, rather than for transaction processing. It about a theme instead of organization’s current operations. It can
serves as a centralized repository where data from be achieved on specific theme. That means the data warehousing
various sources is consolidated, facilitating easier process is proposed to handle with a specific theme which is more
and more efficient data analysis. This foundational defined. These themes can be sales, distributions, marketing
structure supports data mining, which involves etc. A data warehouse never put emphasis only current
identifying patterns and extracting valuable operations. Instead, it focuses on demonstrating and analysis of
information from large data sets. OLTP SYSTEM - data to make various decision. It also delivers an easy and precise
Online Transactional Processing (OLTP) is a data demonstration around particular theme by eliminating data which
processing system that executes tasks like updating,
is not required to make the decisions. Integrated – It is
inserting, or deleting database data. It's different from
somewhere same as subject orientation which is made in a
Online Analytical Processing (OLAP), which is designed for
complex data analysis and reporting. OLTP is optimized for reliable format. Integration means founding a shared entity to
real-time updates and transactional processing. OLTP scale the all similar data from the different databases. The data
systems are used in many industries and consumer-facing also required to be resided into various data warehouse in shared
systems, including: Online banking applications, ATMs, and generally granted manner. A data warehouse is built by
Financial transaction systems, Online bookings, and integrating data from various sources of data such that a
Ticketing and reservation systems. OLAP - OLAP, or mainframe and a relational database. In addition, it must have
Online Analytical Processing, is a technology that reliable naming conventions, format and codes. Integration of
enables users to perform complex queries and data warehouse benefits in effective analysis of data. Reliability in
analyze large sets of data from multiple database naming conventions, column scaling, encoding structure etc.
systems efficiently. It allows analysts, managers, should be confirmed. Integration of data warehouse handles
and executives to extract, study, and summarize various subject related warehouse.1) Time-Variant – In this data
data for better decision-making and reporting. Data is maintained via different intervals of time such as weekly,
warehouses have many advantages and monthly, or annually etc. It founds various time limit which are
applications in data mining, including: Data structured between the large datasets and are held in online
storage: Data warehouses provide a centralized transaction process (OLTP). The time limits for data warehouse is
location to store, access, and manage data. This can wide-ranged than that of operational systems. The data resided in
help reduce costs related to data storage and data warehouse is predictable with a specific interval of time and
management. Data security: Data warehouses can
delivers information from the historical perspective. It comprises
help improve data security by storing data in one
elements of time explicitly or implicitly. Another feature of time-
location, which makes it easier for a cybersecurity
variance is that once data is stored in the data warehouse then it
team to secure. Cloud data warehouses can also
provide stronger data security and encryption than cannot be modified, alter, or updated. Data is stored with a time
on-premise data warehouses. Data dimension, allowing for analysis of data over time. Non-Volatile
consistency: Data warehouses can help lead to – As the name defines the data resided in data warehouse is
more accurate data by standardizing data across the permanent. It also means that data is not erased or deleted when
business. This can help each department produce new data is inserted. It includes the mammoth quantity of data
consistent results. Data integration: Data that is inserted into modification between the selected quantity
warehouses can help consolidate siloed data on logical business. It evaluates the analysis within the
through ETL pipelines. This can help speed queries technologies of warehouse. Data is not updated, once it is stored
and processing and enable more users to access in the data warehouse, to maintain the historical data. In this,
data. Scalability: Data warehouses can grow with data is read-only and refreshed at particular intervals. This is
your business. Preservation of insights: Data beneficial in analysing historical data and in comprehension the
warehouses can retain the output of data trawling by functionality. It does not need transaction process, recapture and
mining tools. AI and machine learning: Data concurrency control mechanism. Functionalities such as delete,
warehouses can store summarized data from update, and insert that are done in an operational application are
multiple sources, such as databases, which can be lost in data warehouse environment. Two types of data
used for machine learning or AI. TYPES OF DATA operations done in the data warehouse are: Data Loading, Data
WAREHOUSE – 1. Enterprise Data Warehouse Access. TOOLS FOR DATA WAREHOUSE DEVELOPMENT - Amazon
(EDW) An Enterprise Data Warehouse (EDW) refers Redshift - A data warehouse service that can be optimized for a
to a comprehensive data repository that integrates
specific use-case, and fully managed by AWS, when it comes to
data drawn from different areas of an organisation.
analyzing huge volumes of data. It has a column storage model to
It holds all information for all business units required
giving a consolidated and unified view of the facilitate the query of structured information. Microsoft Azure
organization. 2. Operational Data Store (ODS) An A suite of data warehouse programs such as Azure Synapse
ODS is another form of data warehouse data layer Analytics that takes a cloud computing system approach. It helps
that holds data from operational systems in a to build, deploy and manage data warehousing solutions with
consolidated and integrated format for near real-
machine learning capabilities within its architecture. Oracle
time reporting and operational analysis. 3. Data
Mart Autonomous Warehouse A self-driving cloud data warehouse
A Data Mart can be defined as an element of a Data service by Oracle. It automates administration tasks like
Warehouse system designed to hold data from a provisioning, scaling, and security, simplifying data warehouse
particular business division, department or user management. PostgreSQL A powerful, open-source relational
type. It is created to serve the specific interests of a database management system (RDBMS) known for its reliability
specific class of people. 4. Cloud Data and feature richness. It supports complex queries and integrates
Warehouses are Data Warehousing solutions that well with various BI tools
are located on the cloud platform that offer a
scalable platform for effective usage of data storage
and analysis. Data cube classification: The data cube can be classified into two
categories:Multidimensional data cube: It basically helps in
Top-Down approach, the process begins with a high-level storing large amounts of data by making use of a multi-
overview of the system, breaking it down into smaller, more dimensional array. It increases its efficiency by keeping an index
manageable components. This method often involves defining of each dimension. Thus, dimensional is able to retrieve data fast.
the overall goals and structure before diving into the details. It is Relational data cube: It basically helps in storing large amounts of
typically used in scenarios like designing data warehouses, where data by making use of relational tables. Each relational table
a comprehensive understanding of the entire system is essential. displays the dimensions of the data cube. It is slower compared to
Bottom-Up approach starts with detailed data and builds up to a a Multidimensional Data Cube. Data cube operations are used to
broader perspective. It emphasizes the creation of specific manipulate data to meet the needs of users. These operations
components, such as data marts, which can later be integrated help to select particular data for the analysis purpose. There are
into a larger system. This method is generally more flexible, mainly 5 operations listed below- Roll-up: operation and
allowing for adjustments based on specific needs and findings as aggregate certain similar data attributes having the same
the analysis progresses. dimension together. Drill-down: this operation is the reverse of
DATA MINING - Data mining is the process of extracting the roll-up operation. It allows us to take particular information
knowledge and insights from large volumes of data using various and then subdivide it further for coarser granularity analysis. It
statistical and computational techniques. It involves zooms into more detail. Slicing: this operation filters the
automatically searching through data to identify patterns and unnecessary portions. Suppose in a particular dimension, the user
trends that may not be apparent through simple analysis doesn’t need everything for analysis, rather a particular attribute.
methods. DATA MINING FUNCTIONALITIES - Classification Dicing: this operation does a multidimensional cutting, that not
classification is the technique of categorizing elements in a only cuts only one dimension but also can go to another
collection, basis their predefined functionalities and properties. dimension and cut a certain range of it. As a result, it looks more
In classification, the model can classify new instances whose like a subcube out of the whole cube (as depicted in the figure).
classification is unknown. These particular instances used to Pivot: this operation is very important from a viewing point of
create the model are called training data. Such a classification view. It basically transforms the data cube in terms of view. It
mechanism uses if-then, decision trees, neural networks, or even doesn’t change the data present in the data cube. Data cubes in
a set of classification rules. These methods can be retrieved to data mining are a multi-dimensional array of values, primarily
identify future data. Association Analysis Association Analysis is used for online analytical processing (OLAP). They allow for the
also called Market Basket Analysis. It is a prevalent data mining organization and summarization of large datasets, enabling users
methodology with usage in sales. Association analysis helps to to view and analyze data across various dimensions, such as time,
find relations between elements frequently occurring together. It geography, and product categories. This structure facilitates quick
is made up of a series of sets of elements and rules that describe and efficient data analysis, making it easier to derive insights from
how these are grouped within the cases. Association rules are complex datasets. Functions of Data warehouse: It works as a
used to predict the presence of an element in the database and collection of data and here is organized by various communities
are based on the manifestation of a specific element identified as that endures the features to recover the data functions. It has
important. Cluster Analysis The cluster analysis process is similar stocked facts about the tables which have high transaction levels
to that of classification. In cluster analysis, similar data types are which are observed so as to define the data warehousing
grouped; the only difference is that the class label is unknown. techniques and major functions which are involved in this are
Clustering algorithms divide the database similarities, and the mentioned below: Data Consolidation: The process of combining
grouped data are more similar to each other than the data in multiple data sources into a single data repository in a data
other groups. Cluster analysis is used in machine learning, deep warehouse. This ensures a consistent and accurate view of the
learning, image processing, pattern recognition, NLP, etc. Data data. Data Cleaning: The process of identifying and removing
Characterization It involves summarizing the generic data errors, inconsistencies, and irrelevant data from the data sources
features, which can result in specific rules to define a target class. before they are integrated into the data warehouse. This helps
An attribute-oriented induction technique characterises the data ensure the data is accurate and trustworthy. Data
without much user intervention or interaction. The resultant Integration: The process of combining data from multiple sources
characterized data can be visualized through graphs, charts, or into a single, unified data repository in a data warehouse. This
tables. Data Discrimination is a bias when a data set or source is involves transforming the data into a consistent format and
treated differently than others, intentionally or unintentionally. resolving any conflicts or discrepancies between the data sources.
This data mining functionality helps to separate peculiar data sets Data integration is an essential step in the data warehousing
based on the ambiguity in attribute values. process to ensure that the data is accurate and usable for
analysis. Data from multiple sources can be integrated into a
single data repository for analysis. Data Storage: A data
warehouse can store large amounts of historical data and make it
easily accessible for analysis. Data Transformation: Data can be
transformed and cleaned to remove inconsistencies, duplicate
data, or irrelevant information. Data Analysis: Data can be
analyzed and visualized in various ways to gain insights and make
informed decisions. Data Reporting: A data warehouse can
provide various reports and dashboards for different departments
and stakeholders. Data Mining: Data can be mined for patterns
and trends to support decision-making and strategic planning.
Performance Optimization: Data warehouse systems are
optimized for fast querying and analysis, providing quick access to
data.
Data preprocessing is an important step in the data mining
process. It refers to the cleaning, transforming, and integrating of
data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it
more suitable for the specific data mining task. Steps of Data
Preprocessing : Data preprocessing is an important step in the
data mining process that involves cleaning and transforming raw
data to make it suitable for analysis. Some common steps in data
preprocessing include: Data Cleaning: This involves identifying
and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can
be used for data cleaning, such as imputation, removal, and
transformation. Data Integration: This involves combining data
from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as
record linkage and data fusion can be used for data
integration.Data Transformation: This involves converting the
data into a suitable format for analysis. Common techniques used
in data transformation include normalization, standardization,
and discretization. Normalization is used to scale the data to a
common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is used
to convert continuous data into discrete categories. Data
Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be
achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset
of relevant features from the dataset, while feature extraction
involves transforming the data into a lower-dimensional space
while preserving the important information. Data
Discretization: This involves dividing continuous data into
discrete categories or intervals. Discretization is often used in
data mining and machine learning algorithms that require
categorical data. Discretization can be achieved through
techniques such as equal width binning, equal frequency binning,
and clustering. Data Normalization: This involves scaling the data
to a common range, such as between 0 and 1 or -1 and 1.
Normalization is often used to handle data with different units
and scales. Common normalization techniques include min-max
normalization, z-score normalization, and decimal scaling.

You might also like