Datascience Unit 02 1
Datascience Unit 02 1
UNIT-02
WHAT IS A DATA WAREHOUSE?
• Complexity: OLAP systems can be complex to set up and maintain, requiring specialized technical
expertise.
• Data size limitations: OLAP systems can struggle with very large data sets and may require
extensive data aggregation or summarization.
• Performance issues: OLAP systems can be slow when dealing with large amounts of data,
especially when running complex queries or calculations.
• Data integrity: Inconsistent data definitions and data quality issues can affect the accuracy of
OLAP analysis.
• Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to the need
for specialized hardware and software.
• Inflexibility: OLAP systems may not easily accommodate changing business needs and may
require significant effort to modify or extend.
DATA PREPROCESSING IN DATA MINING
• Data cleaning help us remove inaccurate, incomplete and incorrect data from
the dataset. Some techniques used in data cleaning are −
• Binning − This method handle noisy data to make it smooth. Data gets
divided equally and stored in form of bins and then methods are applied to
smoothing or completing the tasks.
• Regression − Regression functions are used to smoothen the data.
Regression can be linear(consists of one independent variable) or
multiple(consists of multiple independent variables).
• Clustering − It is used for grouping the similar data in clusters and is used for
finding outliers.
DATA INTEGRATION
• In this part, change in format or structure of data in order to transform the data suitable for
mining process. Methods for data transformation are −Normalization − Method of scaling
data to represent it in a specific smaller range( -1.0 to 1.0)
Discretization − It helps reduce the data size and make continuous data divide into intervals.
Attribute Selection − To help the mining process, new attributes are derived from the given
attributes.
Concept Hierarchy Generation − In this, the attributes are changed from lower level to higher
level in hierarchy.
Aggregation − In this, a summary of data gets stored which depends upon quality and quantity
of data to make the result more optimal.
DATA REDUCTION