0% found this document useful (0 votes)
4 views

Dwm practical ..

The document provides an introduction to WEKA and R tools, detailing their features, installation procedures, and applications in data analysis and machine learning. WEKA is a Java-based software suite for data mining, offering various algorithms for tasks like classification and clustering, while R is an open-source programming language known for statistical computing and data visualization. The document also outlines practical steps for data preprocessing, emphasizing its importance in improving data quality for analysis.

Uploaded by

bp110673
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Dwm practical ..

The document provides an introduction to WEKA and R tools, detailing their features, installation procedures, and applications in data analysis and machine learning. WEKA is a Java-based software suite for data mining, offering various algorithms for tasks like classification and clustering, while R is an open-source programming language known for statistical computing and data visualization. The document also outlines practical steps for data preprocessing, emphasizing its importance in improving data quality for analysis.

Uploaded by

bp110673
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Practical-1

AIM:- Introduction to WEKA and R tools.


WEKA:- Weka contains a collection of visualization tools and algorithms for data analysis and
predictive modelling, together with graphical user interfaces for easy access to these functions. The
original non-Java version of Weka was a Tcl/Tk front-end to (mostly third-party) modelling
algorithms implemented in other programming languages, plus data preprocessing utilities in C and a
makefile-based system for running machine learning experiments.
This original version was primarily designed as a tool for analyzing data from agricultural domains.
Still, the more recent fully Java-based version (Weka 3), developed in 1997, is now used in many
different application areas, particularly for educational purposes and research. Weka has the
following advantages, such as:
 Free availability under the GNU General Public License.
 Portability, since it is fully implemented in the Java programming language and thus runs on
almost any modern computing platform.
 A comprehensive collection of data preprocessing and modelling techniques.
 Ease of use due to its graphical user interfaces.
Weka supports several standard data mining tasks, specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. Input to Weka is expected to be
formatted according to the Attribute-Relational File Format and filename with the .arff
extension.
All Weka's techniques are predicated on the assumption that the data is available as one flat file
or relation, where a fixed number of attributes describes each data point (numeric or nominal
attributes, but also supports some other attribute types). Weka provides access to SQL
databases using Java Database Connectivity and can process the result returned by a database
query.

R tools:- R is a powerful open-source programming language and software environment used for
statistical computing and graphics. It is widely used in academia, research, and industry for data
analysis, data visualization, and machine learning.
Key Features of R:
1. Data Handling: Efficient handling of data structures like vectors, matrices, data frames, and
lists.

1|Page
2. Statistical Analysis: Built-in functions for various statistical tests, linear and nonlinear
modeling.
3. Visualization: Provides advanced plotting capabilities (e.g., ggplot2, lattice) for beautiful
visualizations.
4. Machine Learning: Includes various libraries for machine learning (e.g., caret, random Forest,
e1071).
5. Integration with Other Tools: Can integrate with databases, big data frameworks, and other
programming languages.
6. Extensibility: R has an extensive package ecosystem available via CRAN for specialized tasks.
Applications of R:
 Statistical analysis and hypothesis testing.
 Data science and machine learning.
 Data visualization for exploratory data analysis.
 Bioinformatics, finance, and social science research.

2|Page
Practical-2
AIM:- Installation of Weka/ R Tool.
Follow the below steps to install Weka on Windows:-
Step 1: Visit this website using any web browser. Click on Free Download.

Step 2: It will redirect to a new webpage, click on Start Download. Downloading of the executable
file will start shortly. It is a big 118 MB file that will take some minutes.

Step 3: Now check for the executable file in downloads in your system and run it.

Step 4: It will prompt confirmation to make changes to your system. Click on Yes.

Step 5: Setup screen will appear, click on Next.

3|Page
Step 6: The next screen will be of License Agreement, click on I Agree.

Step 7: Next screen is of choosing components, all components are already marked so don’t change
anything just click on the Install button.

Step 8: The next screen will be of installing location so choose the drive which will have sufficient
memory space for installation. It needed a memory space of 301 MB.

Step 9: Next screen will be of choosing the Start menu folder so don’t do anything just click on
Install Button.

4|Page
Step 10: After this installation process will start and will hardly take a minute to complete the
installation.

Step 11: Click on the Next button after the installation process is complete.

Step 12: Click on Finish to finish the installation process.

Step 13: Weka is successfully installed on the system and an icon is created on the desktop.

5|Page
Step 14: Run the software and see the interface.

Follow the below steps to install R on Windows:-


Step 1:- First, we have to download the R setup from https://cloud.r-project.org/bin/windows/base/.

Step 2:- When we click on Download R 3.6.1 for windows, our downloading will be started of R
setup. Once the downloading is finished, we have to run the setup of R in the following way:
1) Select the path where we want to download the R and proceed to Next.

2) Select all components which we want to install, and then we will proceed to Next.

6|Page
3) In the next step, we have to select either customized startup or accept the default, and then we
proceed to Next.

4) When we proceed to next, our installation of R in our system will get started:

5) In the last, we will click on finish to successfully install R in our system.

7|Page
8|Page
Practical-3
AIM: - Introduction of various components of WEKA/ R tool.
Weka:
Weka is data mining software that uses a collection of machine learning algorithms. These algorithms
can be applied directly to the data or called from the Java code.
Weka is a collection of tools for:
 Regression
 Clustering
 Association
 Data pre-processing
 Classification
 Visualisation

The features of Weka are shown in Figure 1.

Weka’s application interfaces:

Installation of Weka:

9|Page
You can download Weka from the official website http://www.cs.waikato.ac.nz/ml/weka/.
Execute the following commands at the command prompt to set the Weka environment variable for
Java, as follows:
setenv WEKAHOME /usr/local/weka/weka-3-0-2
setenv CLASSPATH $WEKAHOME/weka.jar: $CLASSPATH
Once the download is completed, run the exe file and choose the default set-up.
Weka application interfaces:
There are totally five application interfaces available for Weka. When we open Weka, it will start
the Weka GUI Chooser screen from where we can open the Weka application interface.
The Weka GUI screen and the available application interfaces are seen in Figure 2.

Weka data formats:


Weka uses the Attribute Relation File Format for data analysis, by default. But listed below are
some formats that Weka supports, from where data can be imported:
 CSV
 ARFF
10 | P a g e
 Database using ODBC
Weka Explore:
1.Preprocessing:
Data preprocessing is a must. There are three ways to inject the data for preprocessing:
 Open File – enables the user to select the file from the local machine
 Open URL – enables the user to select the data file from different locations
 pen Database – enables users to retrieve a data file from a database source
A screen for selecting a file from the local machine to be preprocessed is shown in Figure 5.
After loading the data in Explorer, we can refine the data by selecting different options. We can also
select or remove the attributes as per our need and even apply filters on data to refine the result.

2. Classification:

Selecting a Classifier: At the top of the classify section is the Classifier box. This box has a text
field that gives the name of the currently selected classifier, and its options. Clicking on the text box
with the left mouse button brings up a Generic Object Editor dialog box, just the same as for filters
that you can use to configure the options of the current classifier. With a right click (or
Alt+Shift+left click) you can once again copy the setup string to the clipboard or display the
properties in a Generic Object Editor dialog box. The Choose button allows you to choose on4eof
the classifiers that are available in WEKA.
11 | P a g e
Test Options: The result of applying the chosen classifier will be tested according to the options that
are set by clicking in the options box.
There are four test modes:
1. Use training set: The classifier is evaluated on how well it predicts the class of the instances it
was trained on.
2. Supplied test set: The classifier is evaluated on how well it predicts the class of a set of
instances loaded from a file. Clicking the Set. button brings up a dialog allowing you to choose the
file to test on. A
3. Cross-validation: The classifier is evaluated by cross-validation, using the number of folds that
are entered in the Folds text field.
4. Percentage split: The classifier is evaluated on how well it predicts a certain percentage of the
data which is held out for testing. The amount of data held out depends on the value entered in the
% field.
3.Clustering:

Cluster Modes: The Cluster Mode box is used to chose what to cluster and how to evaluate the
results. The first three options are the same as for classification: Use training set, Supplied test set,
and Percentage split.
4.Associating:

Setting Up: This panel contains schemes for learning association rules, and the learners are chosen
and configured in the same way as the clusters, filters, and classifiers in the other panels.
5. Selecting Attributes:

12 | P a g e
Attribute selection crawls through all possible combinations of attributes in the data to decide
which of these will best fit the desired calculation—which subset of attributes works best for
prediction. The attribute selection method contains two parts.
 Search method: Best-first, forward selection, random, exhaustive, genetic algorithm, ranking
algorithm
 Evaluation method: Correlation-based, wrapper, information gain, chi-squared
All the available attributes are used in the evaluation of the data set by default. But it enables
users to exclude some of them if they want to.
6. Visualization:

The user can see the final piece of the puzzle, derived throughout the process. It allows users to
visualise a 2D representation of data, and is used to determine the difficulty of the learning
problem. We can visualise single attributes (1D) and pairs of attributes (2D), and rotate 3D
visualisations in Weka. It has the Jitter option to deal with nominal attributes and to detect ‘hidden’
data points.
R Tool:
R is an open-source software created over 20 years ago by lhaka and Gentleman at the University
of Auckland, New Zealand.
Key Features of R:
Statistical analysis: R has a vast array of statistical libraries and functions that make it easy to
perform complex statistical analysis.

13 | P a g e
Data visualization: R's ggplot2 library provides an extensive range of data visualization tools,
making it easy to create interactive and informative plots.

Data manipulation: R's dplyr library provides a powerful set of tools for data manipulation,
including filtering, grouping, and summarizing data.
Machine learning: R has a wide range of machine learning libraries, including caret, dplyr, and
randomForest, making it easy to build and train models.

14 | P a g e
Practical-4
AIM: Fundamental programming using WEKA/ R tool.
WEKA (Waikato Environment for Knowledge Analysis) is a software suite for machine learning
and data mining. It's user-friendly and provides a variety of algorithms for data analysis.
1. Getting Started:
o Installation: Download WEKA from the official website and install it. It runs as a Java
application.
o Data Import: Load datasets into WEKA using formats like ARFF (Attribute-Relation File
Format) or CSV. You can use the "Explorer" interface to load and preprocess data.
2. Data Preparation:
o Preprocessing: Use the "Preprocess" tab to clean and transform data. This includes handling
missing values, normalizing data, and selecting attributes.
o Visualization: WEKA provides tools to visualize data distributions and relationships.
3. Applying Algorithms:
o Classification: Use the "Classify" tab to apply algorithms like J48 (decision tree), Naive
Bayes, or SVM (Support Vector Machine) to your data.
o Clustering: For unsupervised learning, use the "Cluster" tab to apply clustering algorithms
like K-means or EM (Expectation-Maximization).
o Association Rule Mining: Use the "Associate" tab for algorithms like Apriori to discover
interesting rules in your data.
4. Evaluation:
o Performance Metrics: Evaluate models using metrics such as accuracy, precision, recall, and
F1-score. WEKA provides these metrics in the output after running algorithms.
o Cross-Validation: Use cross-validation to assess the generalization ability of your models.
5. Scripting:
o Command Line: Advanced users can use WEKA’s command-line interface or write scripts in
Java to automate tasks.
R
R is a programming language and environment for statistical computing and graphics. It is highly
extensible and provides a wide range of packages for data analysis.
1. Getting Started:
o Installation: Download and install R from CRAN. You might also want to install RStudio,
which provides a more user-friendly interface.
2. Basic Programming:
o Data Import: Use functions like read.csv() to load data from CSV files or readRDS() for R-
specific formats.

15 | P a g e
o Data Frames: R’s primary data structure is the data frame, which you can manipulate using
functions like subset(), merge(), or dplyr package functions.
3. Data Preparation:
o Cleaning: Handle missing values with functions like na.omit() or is.na(). Transform data
using functions from packages like dplyr.
o Visualization: Use the ggplot2 package to create various types of plots and visualize data.
4. Applying Algorithms:
o Classification and Regression: Use functions like glm() for generalized linear models, rpart()
for decision trees, or packages like caret for a unified interface to multiple algorithms.
o Clustering: Apply clustering algorithms such as K-means (kmeans()) or hierarchical
clustering (hclust()).
o Association Rules: Use the arules package to discover association rules.
5. Evaluation:
o Performance Metrics: Evaluate model performance using metrics like accuracy, confusion
matrices, and ROC curves. The caret package provides tools for this.
o Cross-Validation: Use functions from the caret package to perform cross-validation.
6. Scripting:
o R Scripts: Write scripts in R to automate data analysis tasks. R scripts are executed in the R
console or RStudio.

16 | P a g e
Practical-5
AIM: - Implementing data preprocessing.
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Steps of Data Preprocessing:


Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:
Why Preprocess the Data?
In the real world, many databases and data warehouses have noisy, missing, and inconsistent data due
to their huge size. Low quality data leads to low quality data mining.
Noisy: Containing errors or outliers. E.g., Salary = “-10”
Noisy data may come from
 Human or computer error at data entry.
 Errors in data transmission.
Missing: lacking certain attribute values or containing only aggregate data. E.g., Occupation =
“”
Missing (Incomplete) may data come from
 “Not applicable” data value when collected.
 Human/hardware/software problems.
Major Tasks in Data Preprocessing:
Data preprocessing is an essential step in the knowledge discovery process, because quality decisions
must be based on quality data. And Data Preprocessing involves Data Cleaning, Data Integration,
Data Reduction and Data Transformation.
Steps in Data Preprocessing:
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.

17 | P a g e
2. Data Integration: This involves combining data from multiple sources to create a unified dataset.
Data integration can be challenging as it requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be used for data integration.

3. Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to convert continuous data into
discrete categories.

4. Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, while
feature extraction involves transforming the data into a lower-dimensional space while preserving the
important information.

5. Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical
data. Discretization can be achieved through techniques such as equal width binning, equal frequency
binning, and clustering.

18 | P a g e
6. Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1
and 1. Normalization is often used to handle data with different units and scales. Common
normalization techniques include min-max normalization, z-score normalization, and decimal
scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data preprocessing may vary depending on the nature
of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become
more accurate.
Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.

Steps Involved in Data Preprocessing


1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
 Missing Data: This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
o Ignore the tuples: This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
o Fill the Missing values: There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.

 Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following
ways :
o Binning Method: This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.
o Regression: Here data can be made smooth by fitting it to a regression function. The regression
used may be linear (having one independent variable) or multiple (having multiple independent
variables).

19 | P a g e
o Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or
it will fall outside the clusters.

2. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
 Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
 Attribute Selection: In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
 Discretization: This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
 Concept Hierarchy Generation: Here attributes are converted from lower level to higher level
in hierarchy. For Example-The attribute “city” can be converted to “country”.
3. Data Reduction: Data reduction is a crucial step in the data mining process that involves reducing
the size of the dataset while preserving the important information. This is done to improve the
efficiency of data analysis and to avoid overfitting of the model. Some common steps involved in
data reduction are:
 Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
 Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original
features are high-dimensional and complex. It can be done using techniques such as PCA,
linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
 Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be done
using techniques such as random sampling, stratified sampling, and systematic sampling.
 Clustering: This involves grouping similar data points together into clusters. Clustering is often
used to reduce the size of the dataset by replacing similar data points with a representative
centroid. It can be done using techniques such as k-means, hierarchical clustering, and density-
based clustering.
 Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gif compression.

20 | P a g e
Practical-6
AIM: - Implementing apriori algorithm.
The Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules. It uses a “bottom-up” approach, where frequent subsets are extended one at a time
(a step known as candidate generation, and groups of candidates are tested against the data).
Problem:
TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2.5

To find frequent item sets for above transaction with a minimum support of 2 having confidence
measure of 70% (i.e, 0.7).
Procedure:
Step 1: Count the number of transactions in which each item occurs TID ITEMS

TID ITEMS
1 2
2 3
3 3
4 1
5 3

ITEM NO. OF
TRANSACTION
1 2
2 3
3 3
5 3

Step 2: Eliminate all those occurrences that have transaction numbers less than the minimum support
( 2 in this case).

21 | P a g e
This is the single items that are bought frequently. Now let’s say we want to find a pair of items that
are bought frequently. We continue from the above table (Table in step 2).
Step 3: We start making pairs from the first item like 1,2;1,3;1,5 and then from second item like
2,3;2,5. We do not perform 2,1 because we already did 1,2 when we were making pairs with 1 and
buying 1 and 2 together is same as buying 2 and 1 together. After making all the pairs we get,
ITEM PAIRS
1,2
1,3
1,5
2,3
2,5
3,5

Step 4: Now, we count how many times each pair is bought together.
ITEM NO. OF
PAIRS TRANSACTION
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2

Step 5: Again remove all item pairs having number of transactions less than 2.
ITEM NO. OF
PAIRS TRANSACTION
1,2 2
2,3 2
2,5 3
3,5 2
These pair of items is bought frequently together. Now, let’s say we want to find a set of three items
that are bought together. We use above table (of step 5) and make a set of three items.

22 | P a g e
Step 6: To make the set of three items we need one more rule (It is termed as self-join), it simply
means, from item pairs in above table, we find two pairs with the same first numeric, so, we get (2,3)
and (2,5), which gives (2,3,5). Then we find how many times (2, 3, 5) are bought together in the
original table and we get the following
ITEM SET NO. OF
TRANSACTION
(2,3,5) 2
Thus, the set of three items that are bought together from this data are (2, 3, 5)

Confidence: We can take our frequent item set knowledge even further, by finding association rules
using the frequent item set. In simple words, we know (2, 3, 5) are bought together frequently, but
what is the association between them. To do this, we create a list of all subsets of frequently bought
items (2, 3, 5) in our case we get following subsets:
 {2}
{3}
{5}
{2,3}
{3,5}
{2,5}
Now, we find association among all the subsets.
{2} => {3,5}:(If„2‟ is bought, what‟s the probability that „3‟ and „5‟would be bought in same
transaction)
Confidence = P (3฀5฀2)/ P (2) =2/3 =67%
{3} => {2,5} = P (3฀5฀2)/P (3) =2/3=67%
{5} => {2,3} = P (3฀5฀2)/P (5) =2/3=67%
{2,3} => {5} = P (3฀5฀2)/P (2฀3) =2/2=100%
{3,5} => {2} = P (3฀5฀2)/P (3฀5) =2/2=100%
{2,5} => {3} = P (3฀5฀2)/ P (2฀5) =2/3=67%
Also, considering the remaining 2-items sets, we would get the following associations-
{1} => {3} =P (1฀3)/P (1) =2/2=100%
{3} => {1} =P (1฀3)/P (3) =2/3=67%
{2} => {3} =P (3฀2)/P (2) =2/3=67%
{3} => {2} =P (3฀2)/P (3) =2/3=67%

23 | P a g e
{2} => {5} =P (2฀5)/P (2) =3/3=100%

{5} => {2} =P (2฀5)/P (5) =3/3=100%


{3} => {5} =P (3฀5)/P (3) =2/3=67%
{5} => {3} =P (3฀5)/P (5) =2?3=67%
Eliminate all those having confidence less than 70%. Hence, the rules would be –
{2,3} => {5}, {3,5} => {2}, {1} => {3},{2} => {5}, {5} => {2}.
Now these manual results should be checked with the rules generated in WEKA

So first create a csv file for the above problem, the csv file for the above problem will look like the
rows and columns in the above figure. This file is written in excel sheet.
Procedure for running the rules in weka:
Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives a better
understanding of how to do that.

Step 2:
Now select the association tab and then choose apriori algorithm by setting the minimum support and
confidence as shown in the figure

24 | P a g e
Step 3:
Now run the apriori algorithm with the set values of minimum support and the confidence. After
running the weka generates the association rules and the respective confidence with minimum
support as shown in the figure.
The above csv file has generated 5 rules as shown in the figure:

25 | P a g e
Conclusion:
As we have seen the total rules generated by us manually and by the weka are matching, hence the
rules generated are 5.

Practical-7
AIM: - Implementing classification using decision tree.
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.

26 | P a g e
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.
o Below diagram explains the general structure of a decision tree:

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies:

Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

How does the Decision Tree algorithm Work?


In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
The complete process can be better understood using the below algorithm:

27 | P a g e
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM). The root node splits further into the next decision node (distance from
the office) and one leaf node based on the corresponding labels. The next decision node further
gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures:

While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)


28 | P a g e
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Were,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of trees pruning technology
used:
o Cost Complexity Pruning
o Reduced Error Pruning.
Advantages of the Decision Tree:
o It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:
o The decision tree contains lots of layers, which makes it complex.
29 | P a g e
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Python Implementation of Decision Tree:
Now we will implement the Decision tree using Python. For this, we will use the dataset
"user_data.csv," which we have used in previous classification models. By using the same
dataset, we can compare the Decision tree classifier with other classification models such
as KNN SVM, Logistic Regression, etc.
Steps will also remain the same, which are given below:
o Data Pre-processing step
o Fitting a Decision-Tree algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
1. Data Pre-Processing Step:
Below is the code for the pre-processing step:

In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given
as:

2. Fitting a Decision-Tree algorithm to the Training set


Now we will fit the model to the training set. For this, we will import
the DecisionTreeClassifier class from sklearn.tree library. Below is the code for it:
30 | P a g e
In the
above code, we have created a classifier object, in which we have passed two main parameters;
o "criterion='entropy': Criterion is used to measure the quality of split, which is calculated by
information gain given by entropy.
o random_state=0": For generating the random states.
Output:

3. Predicting the test result


Now we will predict the test set result. We will create a new prediction vector y_pred. Below is the
code for it:

Output:
In the below output image, the predicted output and real test output are given. We can clearly see that
there are some values in the prediction vector, which are different from the real vector values.
These are prediction errors.

4. Test accuracy of the result (Creation of Confusion matrix)

31 | P a g e
In the above output, we have seen that there were some incorrect predictions, so if we want to know
the number of correct and incorrect predictions, we need to use the confusion matrix. Below is the
code for it:

Output:

In the above output image, we can see the confusion matrix, which has 6+3= 9 incorrect
predictions and62+29=91 correct predictions. Therefore, we can say that compared to other
classification models, the Decision Tree classifier made a good prediction.
6. Visualizing the training set result:

Here we will visualize the training set result. To visualize the training set result we will plot a
graph for the decision tree classifier. The classifier will predict yes or No for the users who have
either Purchased or Not purchased the SUV car as we did in Logistic Regression.

Output:

32 | P a g e
The above output is completely different from the rest classification models. It has both vertical and
horizontal lines that are splitting the dataset according to the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of overfitting.
6. Visualizing the test set result:
Visualization of test set result will be similar to the visualization of the training set except that the
training set will be replaced with the test set.

Output:

Practical-8
AIM:- Implementing classification using decision tree induction.
Decision tree learning is one of the most widely used and practical methods for inductive inference
over supervised data. It represents a procedure for classifying categorical database on their attributes.
This representation of acquired knowledge in tree form is intuitive and easy to assimilate by humans.
ILLUSTRATION:
Build a decision tree for the following data
AGE INCOME STUDENT CREDIT_RATING BUYS_COMPUTER
Youth High No Fair No
Youth High No Excellent No
Middle aged High No Fair Yes
Senior Medium No Fair Yes
33 | P a g e
Senior Low Yes Fair No
Senior Low Yes Excellent No
Middle aged Medium Yes Excellent Yes
Youth Low No Fair No
Youth Low Yes Fair Yes
Senior Medium Yes Fair Yes
Youth Medium Yes Excellent Yes
Middle aged High No Excellent Yes
Middle aged High Yes Fair Yes
Senior Medium No Excellent No
The entropy is a measure of the uncertainty associated with a random variable. As uncertainty
increases, so does entropy, values range from [0-1] to present the entropy of information Entropy (D)
= Information gain is used as an attribute selection measure; pick the attribute having the highest
information gain, the gain is calculated by: Gain (D, A) = Entropy (D) - Where, D: A given data
partition A: Attribute V: Suppose we were partition the tuples in D on some attribute A having v
distinct values D is split into v partition or subsets, (D1, D2…. Dj) , where Dj contains those tuples
in D that have outcome Aj of A. Class P: buys computer=”yes” Class N: buys computer=”no”
Entropy (D) = -9/14log (9/14)-5/15log (5/14) =0.940 Compute the expected information requirement
for each attribute start with the attribute age Gain (age, D) = Entropy (D) - = Entropy ( D ) –
5/14Entropy(Syouth)-4/14Entropy(Smiddle-aged)-5/14Entropy(Ssenior) = 0.940-0.694 =0.246
Similarly, for other attributes, Gain (Income, D) =0.029 Gain (Student, D ) = 0.151 Gain (credit
rating, D) = 0.048

Income Student Credit rating Class


High No Fair No
High No Excellent No
Medium No Fair No
Low Yes Fair Yes
medium Yes excellent yes
Now, calculating information gain for subtable (age<=30) I The attribute age has the highest
information gain and therefore becomes the splitting * attribute at the root node of the decision tree.
Branches are grown for each outcome of age. These tuples are shown partitioned accordingly.
Income=” high” S11=0, S12=2 I=0 Income=” medium” S21=1 S22=1 I (S21, S23) = 1 Income=”
low”S31=1 S32=0 I=0 Entropy for income E(income ) = (2/5)(0) + (2/5)(1) + (1/5)(0) = 0.4

34 | P a g e
Gain( income ) = 0.971 - 0.4 = 0.571 Similarly, Gain(student)=0.971 Gain(credit)=0.0208
Gain( student) is highest ,
A decision tree for the concept buys computer, indicating whether a customer at All Electronics is
likely to purchase a computer. Each internal (non-leaf) node represents a test on an attribute. Each
leaf node represents a class (either buys computer=”yes” or buys computer=”no”. first create a csv
file for the above problem, the csv file for the above problem will look like the rows and columns in
the above figure. This file is written in excel sheet.
Procedure for running the rules in weka:
Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives a better
understanding of how to do that.
Step2:
Now select the classify tab in the tool and click on start button and then we can see the result of the
problem as below
Step3:
Check the main result which we got manually and the result in weka by right clicking on the result
and visualizing the tree. The visualized tree in weka is as shown below:

Practical-9
AIM:- Implementation k-mean clustering.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabelled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.

35 | P a g e
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm
The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The working of the K-Means algorithm is explained in the below steps:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Python Implementation of K-means Clustering Algorithm
In the above section, we have discussed the K-means algorithm, now let's see how it can be
implemented using Python.
Before implementation, let's understand what type of problem we will solve here. So, we have a
dataset of Mall_Customers, which is the data of customers who visit the mall and spend there.

In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending Score
(which is the calculated value of how much a customer has spent in the mall, the more the value,
the more he has spent). From this dataset, we need to calculate some patterns, as it is an
unsupervised method, so we don't know what to calculate exactly
The steps to be followed for the implementation are given below:
1. Data Pre-Processing
2. Finding the optimal number of clusters using the elbow method
3. Training the k-means algorithm on the training dataset
4. Visualizing the clusters

36 | P a g e
Step-1: Data pre-processing Step
The first step will be the data pre-processing, as we did in our earlier topics of Regression and
Classification. But for the clustering problem, it will be different from other models. Let's discuss
it:
o Importing Libraries
As we did in the previous topics, firstly, we will import the libraries for our model, which is part of
data pre-processing. The code is given below:

In the above code, the numpy we have imported for the performing mathematics calculation,
matplotlib is for plotting the graph, and pandas are for managing the dataset
o Importing the Dataset:
Next, we will import the dataset that we need to use. So here, we are using the
Mall_Customer_data.csv dataset. It can be imported using the below code:

By executing the above lines of code, we will get our dataset in the Spyder IDE. The dataset looks
like the below image:

o Extracting Independent Variables


Here we don't need any dependent variable for data pre-processing step as it is a clustering problem,
and we have no idea about what to determine. So we will just add a line of code for the matrix of
features.

As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d plot to visualize
the model, and some features are not required, such as customer_id.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering problem. So,
as discussed above, here we are going to use the elbow method for this purpose.

As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on
the Y-axis and the number of clusters on the X-axis. So we are going to calculate the value for WCSS
for different k values ranging from 1 to 10. Below is the code for it:
As we can see in the above code, we have used the KMeans class of sklearn. cluster library to form
the clusters.
Next, we have created the wcss_list variable to initialize an empty list, which is used to contain the
value of wcss computed for different values of k ranging from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging from 1 to
10; since for loop in Python, exclude the outbound limit, so it is taken as 11 to include 10th value.

37 | P a g e
The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a matrix
of features and then plotted the graph between the number of clusters and WCSS.
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on the dataset.
To train the model, we will use the same two lines of code as we have used in the above section, but
here instead of using i, we will use 5, as we know there are 5 clusters that need to be formed. The
code is given below:
The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent variable y_predict to train the model.
By executing the above lines of code, we will get the y_predict variable. We can check it under the
variable explorer option in the Spyder IDE. We can now compare the values of y_predict with our
original dataset. Consider the below image
Step-4: Visualizing the Clusters
The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize
each cluster one by one.
To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.
Output:
The output image is clearly showing the five different clusters with different colors. The clusters are
formed between two parameters of the dataset; Annual income of customer and Spending. We can
change the colors and labels as per the requirement or choice. We can also observe some points from
the above patterns, which are given below:
o Cluster1 shows the customers with average salary and average spending so we can categorize
these customers as
o Cluster2 shows the customer has a high income but low spending, so we can categorize them as
careful.
o Cluster3 shows the low income and also low spending so they can be categorized as sensible.

o Cluster4 shows the customers with low income with very high spending so they can be
categorized as careless.
o Cluster5 shows the customers with high income and high spending so they can be categorized as
target, and these customers can be the most profitable customers for the mall owner.

38 | P a g e
Practical-10
AIM:- Implementing different Data visualization tools.
Together with the demand for data visualization and analysis, the tools and solutions in this area
develop fast and extensively. Novel 3D visualizations, immersive experiences and shared VR offices
are getting common alongside traditional web and desktop interfaces. Here are three categories of
data visualization technologies and tools for different types of users and purposes.
Tableau:
Tableau is one of the leaders in this field. Startups and global conglomerates like Verizon and Henkel
rely on this platform to derive meaning from data and use insights for effective decision making.

39 | P a g e
Apart from a user-friendly interface and a rich library of interactive visualizations and data
representation techniques, Tableau stands out for its powerful capabilities.
The platform provides diverse integration options with various data storage, management, and
infrastructure solutions, including Microsoft SQL Server, Databricks, Google BigQuery, Teradata,
Hadoop, and Amazon Web Services. This is a great tool for both occasional data visualizations and
professional data analytics.
The system can easily handle any type of data, including streaming performance data, and allows to
combine visualizations into functional dashboards. Tableau, as part of Salesforce since 2019, invests
in AI and augmented analytics and equips customers with tools for advanced analytics and
forecasting.

FusionCharts:
FusionCharts is a versatile platform for creating interactive dashboards on web and mobile. It offers
rich integration capabilities with support for various frontend and backend frameworks and
languages, including Angular, React, Vue, ASP.NET, and PHP.
FusionCharts caters to diverse data visualization needs, offering rich customization options, pre-built
themes, 100 ready-to-use charts and 2000 maps, and extensive documentation to make developers’
lives easier. This explains the popularity of the platform. Over 800,000 developers and 28,000
organizations such as Dell, Apple, Adobe, and Google, already use this platform.

Sisense:
Sisense is another industry-grade data visualization tool with rich analytics capabilities. This cloud-
based platform has a drag-and-drop interface, can handle multiple data sources, and supports natural
language queries. Sisense dashboards are highly customizable. You can personalize the look and feel,
add images, text, videos, and links, add filters and drill-down features, and transform static
visualizations into interactive storytelling experiences.
The platform has a strong focus on AI and ML to provide actionable insights for users. The platform
stands out for its scalability and flexibility. It’s easy to integrate Sisense analytics and visualizations
using their flexible developer toolkit and SDKs to either build a new data application or embed
dashboards and visualizations into an existing one.

Also in this category: Plotly is a popular platform mainly focused on developing data apps with
Python. It offers rich data visualization tools and techniques and enables integrations with ChatGPT
and LLMs to create visualizations using prompts. Plotly’s open-source libraries for Python, R,
JavaScript, F#, Julia, and other programming languages help developers create various interactive
visualizations, including complex maps, animations, and 3D charts. So if you focus on Python
software development services, consider adding Plotly to your toolset.
IBM Cognos Analytics is known for its NLP capabilities. The platform supports conversational data
control and provides versatile tools for dashboard building and data reporting. The AI assistant uses
natural language queries to build stunning visualizations and can even choose optimal visual data
analysis techniques based on what insights you need to get.

40 | P a g e
If MongoDB is a part of your stack, consider also MongoDB Charts for your MongoDB data. It
seamlessly integrates with the core platform’s tools and offers various features for creating charts and
dashboards.

Tools for complex data visualization and analytics


The growing adoption of connected technology places a lot of opportunities before companies and
organizations. To deal with large volumes of multi-source often unstructured data, businesses search
for more complex visualization and analytics solutions. This category includes Microsoft Azure
Power BI, ELK stack Kibana, and Grafana. Power BI is exceptional for its highly intuitive drag-and-
drop interface, short learning curve, and large integration capabilities, including Salesforce and
MailChimp. Not to mention moderate pricing ($10 per month for a Pro version).
Microsoft
Thanks to Azure services, Power BI became one of the most robust data visualization and analytics
tools that can handle nearly any amount and any type of data. First of all, the platform allows you to
create customized reports from different data sources and get insights in a couple of clicks. Secondly,
Power BI is powerful and can easily work with streaming real-time data. Finally, it’s not only fully
compatible with Azure and other Microsoft services but also can directly connect to existing apps
and drive analytics to custom systems.
Kibana:
Kibana is the part of the ELK Stack that turns data into actionable insights. It’s built on and designed
to work with Elasticsearch data. This exclusivity, however, does not prevent it from being one of the
best data visualization tools for log data. Kibana allows you to explore various big data visualization
techniques in data science — interactive charts, maps, histograms, etc.
Moreover, Kibana goes beyond building standard dashboards for data visualization and analytics.
This tool will help you leverage various visual data analysis techniques in big data: combine
visualizations from multiple sources to find correlations, explore trends, and add machine learning
features to reveal hidden relationships between events. Drag-and-drop Kibana Lens helps you
explore visualized data and get quick insights in just a few clicks. And a rich toolkit for developers
and APIs come as a cherry on top.

41 | P a g e

You might also like