Datamining Unit -1
Datamining Unit -1
Knowledge discovery from Data (KDD) is essential for data mining. While others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process −
• Data Cleaning − In this step, the noise and inconsistent data isremoved.
• Data Integration − In this step, multiple data sources arecombined.
• Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
• Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregationoperations.
• Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
• Pattern Evaluation − In this step, data patterns are evaluated.
• Knowledge Presentation − In this step, knowledge isrepresented.
Page 1
UNIT-1
Page 2
UNIT-1
3. Data Warehouse
• A data warehouse is defined as the collection of data integrated from multiple
sources that will query and decision making.
• There are three types of data ware house: Enterprise data ware house,
Data Mart and Virtual Warehouse.
• Two approaches can be used to update data in Data Ware house: Query-
driven Approach and Update-driven Approach.
• Application: Business decision making, Data mining ,etc.
4. Transactional Data bases
• Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
• This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
• Highly flexible system where users can modify information without changing any
sensitive information.
• Follows ACID property of DBMS.
• Application: Banking, Distributed systems, Object data bases, etc.
5. Multimedia Databases
• Multimedia databases consists audio, video, images and text media.
• They can be stored on Object-Oriented Data bases.
• They are used to store complex information in pre-specified formats.
• Application: Digital libraries, video-on demand, news-on demand, musical data
base, etc.
6. Spatial Database
• Store geo graphical information.
• Stores data in the form of coordinates, topology, lines, polygons,etc.
• Application: Maps, Global positioning, etc.
7. Time-series Databases
• Time series databases contain stock exchange data and user logged activities.
• Handles array of numbers indexed by time, date, etc.
• It requires real-time analysis.
• Application: eXtremeDB, Graphite, InfluxDB,etc.
8. WWW
• WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
• It is the most heterogeneous repository as it collects data from multiple resources.
• It is dynamic in nature as Volume of data is continuously increasing and changing.
• Application: Online shopping, Job search, Research, studying,etc.
Page 3
UNIT-1
a) Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters
1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a class
or a concept are called class/concept descriptions. These descriptions can be derived by the
following two ways −
• Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to analyze
that if they have positive, negative or no effect on each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the objects
in other clusters.
Page 4
UNIT-1
1. Classification (IF-THEN)Rules
2. Prediction
3. Decision Trees
4. Mathematical Formulae
5. Neural Networks
6. Outlier Analysis
7. Evolution Analysis
3. Decision Trees − A decision tree is a structure that includes a root node, branches,
and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes
the outcome of a test, and each leaf node holds a class label.
6. Outlier Analysis − Outliers may be defined as the data objects that do not comply
with the general behavior or model of the data available.
Page 5
UNIT-1
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
• Set of task relevant data to be mined.
• Kind of knowledge to be mined.
• Background knowledge to be used in discovery process.
• Interestingness measures and thresholds for pattern evaluation.
• Representation for visualizing the discovered patterns.
1. Statistics:
• It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
• Statistical analysis involves the collection of methods, applicable to large amount of
data to conclude and report the trend.
2. Machine learning
• Arthur Samuel defined machine learning as a field of study that gives computers the
ability to learn without being programmed.
• When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning.
• In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
• It is related to computational statistics.
Page 6
UNIT-1
Page 7
UNIT-1
Page 8
UNIT-1
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration.
Major Issues in data mining:
Data mining is a dynamic and fast-expanding field with great strengths. The major issues
can divided into five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a) Mining Methodology:
It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
• Mining knowledge in multidimensional space – when searching for knowledge in
large datasets, we can explore the data in multi dimensional space.
• Handling noisy or incomplete data − the data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
• Pattern evaluation − the patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
b) User Interaction:
• Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Page 9
UNIT-1
Page 10
UNIT-1
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or unaffected, true or false.
i) Symmetric: Both values are equally important(Gender).
ii) Asymmetric: Both values are not equally important(Result).
3. Ordinal Attributes: The Ordinal Attributes contains values that have a meaningful
sequence or ranking (order) between them, but the magnitude between values is not
actually known, the order of values that shows what is important but don’t indicate how
important itis.
Attribute Values
Grade O, S, A, B, C, D, F
5. Discrete: Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countable infinite set of values.
Example
Attribute Values
Teacher, Business man,
Profession
Peon
ZIP Code 521157, 521301
6. Continuous: Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and3.
Example:
Attribute Values
Height 5.4, 5.7, 6.2, etc.,
Weight 50, 65, 70, 73, etc.,
Page 11
UNIT-1
Page 12
UNIT-1
The data values can represent as Bar charts, pie charts, Line graphs, etc.
Page 13
UNIT-1
Quantile plots:
➢ A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
➢ Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that approximately 100
fi% of the data are below or equal to the value xi
➢ Note that
• the 0.25 quantile corresponds to quartileQ1,
• the 0.50 quantile is the median, and
• the 0.75 quantile isQ3.
Page 14
UNIT-1
Scatter Plot:
➢ Scatter plot
• Is one of the most effective graphical methods for determining if there appears to
be a relationship, clusters of points, or outliers between two numerical attributes.
➢ Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Data Visualization:
Visualization is the use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data.
Categorization of visualization methods:
a) Pixel-oriented visualization techniques
b) Geometric projection visualization techniques
c) Icon-based visualization techniques
d) Hierarchical visualization techniques
e) Visualizing complex data and relations
a) Pixel-oriented visualization techniques
➢ For a data set of m dimensions, create m windows on the screen, one for each
dimension
➢ The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows
➢ The colors of the pixels reflect the corresponding values
Page 15
UNIT-1
➢ To save space and show the connections among multiple dimensions, space filling is
often done in a circle segment
Page 16
UNIT-1
InfoCube Worlds-within-worlds
Page 17
UNIT-1
Page 18
UNIT-1
a) Euclidean Distance
Assume that we have measurements xik, i = 1, … , N, on variables k = 1, … , p (also
called attributes).
The Euclidean distance between the ith and jth objects is
Note that λ and p are two different parameters. Dimension of the data matrix remains
finite.
Page 19
UNIT-1
Page 20