DVDA Laboratory Manual for Data Analysis
DVDA Laboratory Manual for Data Analysis
EXPERIMENT NO: 1
AIM: Use MS-Excel to create pivot table & apply statistical measures to it.
CODE:
Step1: Create a normal table
Step2: Fill the data with random values. Use the formula/syntax =RANDBETWEEN (m, n)
where m is starting number and n is ending number. This formula will help to fill random
values in table.
Step3: After filling all data we going to take sum of all subjects of all students so we are
using formula called =SUM (cell1 + cell 2+…+.). After doing that for one student just drag
that row downwards so it will calculate all the students sum of marks.
Where n is the total no of subject and sum is the sum of all subjects.
Step5: Now to find the grade of the student we are using if-else condition for that we have
formula like e.g.
=IF(I2>=60,"A”, IF(I2>=50,"B",IF(I2>=40,"C",IF(I2>=20,"F"))))
Step6: Now select the table and click on insert and then on pivot table and select the option
you want i.e you want pivot table in existing sheet or in new sheet then press OK
[1]
OUTPUT:
[2]
EXPERIMENT NO: 2
Aim: Use the table created in above practical to generate different charts.
Code:
Column Labels
Values MOTO Grand Total
Sum of JAN 212 212
Sum of FEB 469 469
Sum of MAR 161 161
Sum of APR 150 150
Sum of MAY 125 125
Sum of JUN 297 297
Sum of JUL 438 438
Sum of AUG 381 381
Sum of SEP 398 398
Sum of OCT 571 571
Sum of NOV 445 445
Sum of DEC 288 288
OUTPUT:
MOTO
600
500
400
300
200 MOTO
100
0
Sum Sum Sum Sum Sum Sum Sum Sum Sum Sum Sum Sum
of JAN of FEB of of APR of of JUN of JUL of of SEP of OCT of of DEC
MAR MAY AUG NOV
[3]
800
700
GOOGLE
600
IPHONE
500 IQOO
MOTO
400
NOKIA
300
ONE PLUS
200 OPPO
SAMSUNG
100
VIVO
0
Sum of Sum of Sum of Sum of Sum of Sum of Sum of Sum of Sum of Sum of Sum of Sum of
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
6000
Sum of DEC
5000 Sum of NOV
Sum of OCT
4000
Sum of SEP
[4]
EXPERIMENT NO:3
AIM: Perform the histogram analysis of given dataset using data analysis
toolbox of excel.
Code:
Step1: Create a table of 50 students marks of 5 subject and fill the data using random values
for that we have a formula/function i.e.
Output:
[5]
[6]
EXPERIMENT NO: 4
AIM: Use python libraries to generate chart from data stored in excel.
CODE:
from statistics import median
import pandas as pd
meanm = d["Maths"].mean()
medianm = d["Maths"].median()
modem = d["Maths"].mode()
meanp = d["Physics"].mean()
medianp = d["Physics"].median()
modep = d["Physics"].mode()
cor = d.corr()
print("Maths:")
print("Mean: ",meanm)
print("Median: ",medianm)
print("Mode: ",modem)
print("Physics:")
print("Mean: ",meanp)
print("Median: ",medianp)
print("Mode: ",modep)
[7]
print("Correlation: ")
print(cor)
plt.show()
OUTPUT:
[8]
EXPERIMENT NO: 5
AIM: Perform multiple linear regression on data.
CODE:
import pandas as pd
import numpy as np
df=pd.read_excel(r'H:\CODES\PYTHON\JEET.xlsx')
df.head()
x=df.drop(['V'],axis=1).values
y=df['V'].values
print(x)
print(y)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
ml=LinearRegression()
ml.fit(x_train,y_train)
y_pred=ml.predict(x_test)
print(y_pred)
r2_score(y_test,y_pred)
plt.figure(figsize=(15,10))
plt.scatter(y_test,y_pred)
plt.xlabel('Original')
[9]
plt.ylabel('Predicted')
plt.title('Original vs Predicted')
pred_y_df=pd.DataFrame({'Original Value':y_test,'Predicted
Value':y_pred,'Difference':y_test-y_pred})
pred_y_df[0:20]
OUTPUT
[10]
[11]
EXPERIMENT NO: 6
AIM: Perform the Logistic Regression on a dataset and interpret the
regression table.
CODE:
from sklearn.datasets import load_digits
import pandas as pd
digits = load_digits()
dir(digits)
digits.data[4]
plt.gray()
plt.matshow(digits.images[1])
digits.target[0:5]
x_train,x_test,y_train,y_test = train_test_split(digits.data,digits.target,test_size=0.2)
len(x_test)
logistic = LogisticRegression()
logistic.fit(x_train,y_train)
logistic.score(x_test,y_test)
plt.matshow(digits.images[84])
digits.target[84]
[12]
logistic.predict([digits.data[84]])
OUTPUT
[13]
EXPERIMENT NO: 7
AIM: Use a dataset and apply KNN to get insights from data
CODE:
[14]
[15]
EXPERIMENT NO: 8
AIM: Use a dataset & apply K means clustering to get insights from data
CODE:
[16]
[17]
[18]
EXPERIMENT NO: 9
AIM: Study about the tools like Orange, Tableau, Weka etc. tool for data
Visualization.
Weka Tool:
The Weka Tool Weka is one of the very popular open source data mining tools developed at
the University of Waikato in New Zealand in 1992. It is a Java based tool and can be used to
implement various machine learning and data mining algorithms written in Java. The
simplicity of using Weka has made it a landmark for machine learning and data mining
implementation. Weka supports reading of files from several different databases and also
allows importing the data from the internet, from web pages or from a remotely located
SQL database server by entering the URL of resource. Among all the available data mining
tools, Weka is the most commonly used of all due to its fast performance and support for
major classification and clustering algorithm. Weka can be easily downloaded and deployed.
Weka provides both, a GUI and CLI for performing data mining and does a good job of
providing support for all the data mining tasks [16]. Weka supports a variety of data formats
like CSV (Comma-separated Value), ARFF and Binary. Weka focuses more on textual
representation of the data rather than visualization although it does provide support to
display some visualization but those are very generic. Also, Weka does not provide visual
representation of results of processing in an effective and understanding manner like Rapid
Miner. Weka performs accurately when the size of the data set is not large. If the size is
large, then Weka does experience some performance issues. Weka provides support for
filtering out data or attributes.
Tableau Tool:
The Tableau Tool Tableau is a powerful data visualization tool used in business intelligence
and data analysis. Tableau Software was invented by Chris Stole, Christian Chabot and Pat
Harahan in January, 2003 [18]. The visualization provided by Tableau has completely
enhanced the ability to gain more knowledge about the data we are working on and can be
used to provide more accurate predictions. “The product queries relational databases,
cubes, cloud databases, and spread sheets and then generates a number of graph types that
can be combined into dashboards which can be securely shared over a computer network or
the internet” *18+. Unlike Rapid Miner and Weka, Tableau does not implement data mining
algorithms provides visualizations of the data. For this, Tableau provides integration with
another popular statistical analysis tool R9 , to provide support for data mining. “Tableau
offers five main products namely Tableau Desktop, Tableau Server, Tableau Online, Tableau
Reader and Tableau Public. Tableau Public and Tableau Reader are available 9 See Appendix
[19]
18 freely, whereas Tableau Server and Tableau Desktop come with a free trial period
afterwards which the user has to pay” . Tableau has made it possible to explore and present
the data in a much simpler and beautiful manner. Working on projects using Tableau is less
time consuming and easy to handle. Tableau uses a feature called Dashboard which is a
collection of worksheets which can be easily imported from anywhere.
Orange Tool:
Orange is a perfect software suite for machine learning & data mining. It best aids the data
visualization and is a component-based software.
As it is a software, the components of orange are called ‘widgets’.
Widgets offer major functionalities like
• Showing data table and allowing to select features
• Reading the data
• Training predictors and to compare learning algorithms
• Visualizing data elements etc.
Additionally, it brings a more interactive and fun vibe to the dull analytic tools. It is quite
interesting to operate.
PRE-PROCESSING IN WEKA
Start Weka
Step2: From Filters tab, select any filters you want to add to the data set.
[20]
Step3: Select the attributes from the data set you want to keep for analysis.
In this example, we have selected all the attributes for sample visualization
Again, depending on your needs you can select number of attributes you want for the
implementation.
STEPS IN WEKA
Depending on which classification algorithm you want to implement, select the particular
classifier
Step 4: Double click on the classifier to open the Naviebayes Object Editor
Step 6: After selecting the label, click Start to begin the execution of classification algorithm
Step 7: Now, the classification of various attributes based on the label will be displayed in
the result screen.
[21]
[22]
EXPERIMENT NO: 10
AIM: Given a case study: Interactive Data Analytics with Power BI.
1. HEATHROW
• Heathrow airport is an international airport in London. It is the second busiest
international airport in the world after Dubai international airport. And, also the
seventh-largest in terms of total passenger traffic.
THE CHALLENGE
• Being the world’s seventh busiest airport in overall passenger traffic, one can only
imagine the level of efficiency and efforts expected from the airport’s ground
management to keep the airport functioning properly. Managing over 2,00,000
passengers every day can be quite a challenging task for airport authorities and
ground staff. Every department needs to be in absolute coordination and sync to be
able to manage the passenger traffic and give them a smooth experience at the
airport. At such busy airports, every day brings new challenges and uncertainties
with it. Unexpected disruptions in the smooth workflow of operations at the airport
disturb the entire functioning. Issues can arise due to stormy weather, delayed
flights, canceled flights, shifts in jet streams, etc. disturbing the airport’s smooth
functioning. Such problems send the passengers as well as airport employees into
turmoil.
• The airport needed a central digitalized management system as a solution to this
problem. Such a system would use the large amounts of data being produced by
operational systems at the airport and transform it into useful visual insights. The
interpretations produced by the BI tool can be used by airport staff for better
functioning and passenger management.
THE CHANGE
• Heathrow group went with Microsoft Power BI as their business intelligence
software and Microsoft Azure for cloud services. The airport has deployed Microsoft
Azure technology to collect data from back-end operational systems at the airport.
These systems are check-in counters, baggage tracking systems, flight schedules,
and weather tracking systems, cargo tracking and many more.
• The operational data from these systems are forwarded to business intelligence
platforms like Power BI. In Power BI, users shape this data into useful information
that the airport staff can use.
[23]
• Services such as Azure Stream Analytics, Azure Data Lake Analytics, and Azure SQL
Database are used to extract, clean and prepare operational data in real-time. This
data is about flight movements, security queues, passenger transfers, and
immigration queues. Ultimately, Power BI uses data from these Azure services for
analysis and interpretation.
• Operational data from different data sources come into Power BI. Then Power BI
tools are used to transform that data into meaningful insights with the help of visual
reports, graphics, and dashboards. About 75,000 airport employees have
information on their fingertips by the virtue of Power BI.
• Let us understand this with the help of a real-world example. If there is a change in
the jet stream, it may delay about 20 flights in a day. This will result in about 6,000
passengers waiting at the airport at a given point of time. It will increase passenger
traffic and density at the airport. Power BI works like the centralized information
system. The airport uses it to inform about the sudden passenger influx. This
information goes out to different sections such as food outlets, immigration,
customs, gate attenders, baggage handlers at the airport. This will give them time
to prepare themselves to attend the passengers.
• With the presence of smart BI solutions like Power BI, airport staff is notified in
advance about the probable delays and the sudden rush of passengers at the airport.
This help management groups and other employees to take suitable actions in
advance like increasing the food stock, adding extra passenger buses, increasing
the ground staff, directing the passengers to the waiting area, etc. to avoid any
last-minute hustle.
• Thus, with the help of a powerful BI tool like Power BI, Heathrow has been benefited
in more than one way. They are extremely happy and satisfied with the capabilities
of Power BI helping them give a hassle-free airport experience to their passengers.
Heathrow also is extending Power BI applications by trying to anticipate passenger
flow at the airport to avoid any unexpected disruptions for the passengers.
Step2: After that download the dataset for performing data analysis. Here I have
downloaded Sales of chocolates to importing to different countries and containing
info about the dealer
Step 3: Then open Power BI. The home page will look like below page
[24]
[25]
[26]