0% found this document useful (0 votes)
23 views15 pages

Data Science Python Notebook (1)

This document is a Jupyter notebook focused on data visualization using Python, specifically exploring weather data from Jaipur. It introduces essential Python libraries such as Pandas, NumPy, and Matplotlib, and guides users through data manipulation techniques including reading CSV files, exploring datasets, and sorting values. The notebook provides practical examples and additional resources to enhance understanding of data visualization in data science.

Uploaded by

shivankwadhwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

Data Science Python Notebook (1)

This document is a Jupyter notebook focused on data visualization using Python, specifically exploring weather data from Jaipur. It introduces essential Python libraries such as Pandas, NumPy, and Matplotlib, and guides users through data manipulation techniques including reading CSV files, exploring datasets, and sorting values. The notebook provides practical examples and additional resources to enhance understanding of data visualization in data science.

Uploaded by

shivankwadhwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

29/04/2023, 09:39 2023 Data Science Python Notebook.

ipynb - Colaboratory

Data Sciences

Introduction
Data visualization is part of data exploration, which is a critical step in the AI cycle. You will use this technique to gain understanding and
insights to the data you have gathered, and determine if the data is ready for further processing or if you need to collect more data or clean the
data.

You will also use this technique to present your results.

In this notebook, we will explore python packages crucial for Data Sciences. Packages like Pandas, NumPy and Matplotlib are used in the whole
process.

About the Notebook


This jupyter notebook focusses on Data Visualisation in Python. To let youth understand it in the best way possible, a lot of additional resources
have been provided in the notebook as links. The readers can simply go to those links to explore more on the subject.

Context
We will be working with Jaipur weather data obtained from Kaggle, a platform for data enthusiasts to gather, share knowledge and compete for
many prizes!

The data has been cleaned and simplified, so that we can focus on data visualization instead of data cleaning. Our data is stored in the file
named mydata.csv. This file contains weather information of Jaipur and is saved at the same location as the notebook.

What do you do next?

Side note: What is csv?


CSV (Comma-Separated Value) is a file containing a set of data, separated by commas.

We usually access these files using spreadsheet applications such as Excel or Google Sheet. Do you know how this is done?

Today, we will learn how to use Python to open csv files.

Use Python to open csv files


We will use the pandas library to work with our dataset. Pandas is a popular Python library for data science. It offers powerful and flexible data
structures to make data manipulationa and analysis easier.

Import Pandas

import pandas as pd #import pandas as pd means we can type "pd" to call the pandas library

Now that we have imported pandas, let's start by reading the csv file.

#saving the csv file into a variable which we will call data frame
dataframe = pd.read_csv("mydata.csv")

Exploring our data

Great! We have now a variable to contain our weather data. Let's explore our data. Use the .head() function to see the first few rows of data.

#dataframe.head() means we are getting the first 5 rows of data


# try running it to see what data is in the jaipur csv file
print (dataframe.head())

date mean_temperature max_temperature min_temperature \


0 2016-05-04 34 41 27
1 2016-05-05 31 38 24
2 2016-05-06 28 34 21
3 2016-05-07 30 38 23
4 2016-05-08 34 41 26

Mean_dew_pt mean_pressure max_humidity min_humidity max_dew_pt_1 \


0 6 1006.00 27 5 12

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 1/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
1 7 1005.65 29 6 13
2 11 1007.94 61 13 16
3 13 1008.39 69 18 17
4 10 1007.62 50 8 14

max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1 max_pressure_2 \


0 10 -2 -2 1009 1008
1 12 0 -2 1008 1009
2 13 6 0 1011 1008
3 16 9 6 1011 1011
4 17 6 9 1010 1011

min_pressure_1 min_pressure_2 rainfall


0 1000 1001 0.0
1 1001 1000 0.0
2 1003 1001 5.0
3 1004 1003 0.0
4 1002 1004 0.0

Display the first 10 rows of data by modifying the function above

print (dataframe.head(10))

date mean_temperature max_temperature min_temperature \


0 2016-05-04 34 41 27
1 2016-05-05 31 38 24
2 2016-05-06 28 34 21
3 2016-05-07 30 38 23
4 2016-05-08 34 41 26
5 2016-05-09 34 42 27
6 2016-05-10 34 41 27
7 2016-05-11 32 40 25
8 2016-05-12 34 42 27
9 2016-05-13 34 42 26

Mean_dew_pt mean_pressure max_humidity min_humidity max_dew_pt_1 \


0 6 1006.00 27 5 12
1 7 1005.65 29 6 13
2 11 1007.94 61 13 16
3 13 1008.39 69 18 17
4 10 1007.62 50 8 14
5 8 1006.73 32 7 12
6 11 1005.75 45 7 16
7 16 1007.10 51 12 18
8 16 1006.78 66 16 22
9 13 1003.83 58 9 20

max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1 max_pressure_2 \


0 10 -2 -2 1009 1008
1 12 0 -2 1008 1009
2 13 6 0 1011 1008
3 16 9 6 1011 1011
4 17 6 9 1010 1011
5 14 6 6 1010 1010
6 12 7 6 1008 1010
7 16 13 7 1010 1008
8 18 10 13 1011 1010
9 22 10 10 1007 1011

min_pressure_1 min_pressure_2 rainfall


0 1000 1001 0.0
1 1001 1000 0.0
2 1003 1001 5.0
3 1004 1003 0.0
4 1002 1004 0.0
5 1002 1002 0.0
6 1000 1002 0.3
7 1002 1000 0.8
8 1001 1002 2.0
9 998 1001 0.3

Find out your data type


You can use dtypes to find out the type of data (i.e. string, float, integer) you have.

dataframe.dtypes

date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 2/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
max_dew_pt_2 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object

Remove unwanted columns

Looks like there are 16 columns in this dataset and we don't need all of them for the purposes of this activity. One way to go about doing this, is
to drop the columns that we don't need. Pandas provide an easy way for us to drop columns using the ".drop" function.

dataframe = dataframe.drop(["max_dew_pt_2"], axis=1) # no output will be generated , the column will be removed

Let's print to ensure that the columns are dropped, try printing them with head() or dtypes.

dataframe.dtypes

date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object

Drop the following columns: (min_dew_pt_2, max_pressure_2, min_pressure_2)

dataframe = dataframe.drop(["min_dew_pt_2", "max_pressure_2", "min_pressure_2"], axis=1)

Now check again if these columns have been dropped

dataframe.dtypes

date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
max_pressure_1 int64
min_pressure_1 int64
rainfall float64
dtype: object

Great! We can now focus on this set of data!

Sorting values using pandas


Many times, you want to have a sense of range of data to help you understand more about it. Another feature of pandas dataframe is sorting of
values. You can do so by using the sort_values() function.
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 3/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

jaipur_weather = dataframe.sort_values(by='date',ascending = False)


print(jaipur_weather.head(5))

date mean_temperature max_temperature min_temperature \


675 2018-03-11 26 34 18
674 2018-03-10 26 34 19
673 2018-03-09 26 33 19
672 2018-03-08 24 32 15
671 2018-03-07 24 32 15

Mean_dew_pt mean_pressure max_humidity min_humidity max_dew_pt_1 \


675 4 1013.76 38 6 8
674 3 1014.16 37 8 6
673 1 1014.41 42 7 5
672 2 1014.07 55 5 8
671 4 1015.39 48 6 9

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall


675 0 1017 1009 0.0
674 -1 1017 1009 0.0
673 -5 1017 1011 0.0
672 -6 1017 1011 0.0
671 -3 1018 1012 0.0

What do you notice from the number? Look at the date. Can you see how the function help us sort data based on the date?

Sort the values in ascending order of mean temperature and print the first 5 rows

jaipur_weather = dataframe.sort_values(by='mean_temperature',ascending = True)


print(jaipur_weather.head(5))

date mean_temperature max_temperature min_temperature \


252 2017-01-11 10 18 3
253 2017-01-12 12 19 4
254 2017-01-13 12 20 4
255 2017-01-14 12 20 5
258 2017-01-17 12 20 5

Mean_dew_pt mean_pressure max_humidity min_humidity max_dew_pt_1 \


252 3 1017.00 94 17 9
253 -3 1017.54 70 13 2
254 -5 1017.24 75 4 2
255 -1 1017.75 70 10 1
258 3 1017.35 74 15 7

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall


252 -5 1019 1015 0.0
253 -7 1020 1015 0.0
254 -93 1020 1015 0.0
255 -8 1020 1016 0.0
258 -2 1019 1015 0.0

Look at the max and min temperature! See the range of temperature that one can experience within a day.

Sort the values in descending order of mean temperature and print the first 5 rows

jaipur_weather = dataframe.sort_values(by='mean_temperature',ascending = False)


print(jaipur_weather.head(5))

date mean_temperature max_temperature min_temperature \


32 2016-06-05 38 45 31
15 2016-05-19 38 46 29
31 2016-06-04 38 44 31
34 2016-06-07 38 45 30
35 2016-06-08 38 44 31

Mean_dew_pt mean_pressure max_humidity min_humidity max_dew_pt_1 \


32 5 1004.67 27 4 18
15 11 999.88 45 5 17
31 13 1004.93 34 10 18
34 13 1003.29 51 5 21
35 12 1002.83 47 4 22

min_dew_pt_1 max_pressure_1 min_pressure_1 rainfall


32 2 1007 999 0.0
15 6 1002 994 0.0
31 7 1008 999 0.0
34 5 1007 997 0.0
35 2 1006 996 0.0

df 1 d d (" d t " 3)
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 4/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
dframe1=pd.read_csv("mydata.csv",nrows=3)
print(dframe1)

date mean_temperature max_temperature min_temperature \


0 2016-05-04 34 41 27
1 2016-05-05 31 38 24
2 2016-05-06 28 34 21

Mean_dew_pt mean_pressure max_humidity min_humidity max_dew_pt_1 \


0 6 1006.00 27 5 12
1 7 1005.65 29 6 13
2 11 1007.94 61 13 16

max_dew_pt_2 min_dew_pt_1 min_dew_pt_2 max_pressure_1 max_pressure_2 \


0 10 -2 -2 1009 1008
1 12 0 -2 1008 1009
2 13 6 0 1011 1008

min_pressure_1 min_pressure_2 rainfall


0 1000 1001 0
1 1001 1000 0
2 1003 1001 5

dframe2=pd.read_csv("mydata.csv", usecols=['date','mean_temperature'])
print(dframe2.head(10))

date mean_temperature
0 2016-05-04 34
1 2016-05-05 31
2 2016-05-06 28
3 2016-05-07 30
4 2016-05-08 34
5 2016-05-09 34
6 2016-05-10 34
7 2016-05-11 32
8 2016-05-12 34
9 2016-05-13 34

Now we have a clearer picture of our dataset. Using these functions, we can analyze our data and gain insights of them.

However, we want to get an even better picture. We want to learn how to explore these data visually.

Let's now use the matplotlib library to help us with data visualization in Python.

Importing matplotlib
Matplotlib is a Python 2D plotting library that we can use to produce high quality data visualization. It is highly usable (as you will soon find out),
you can create simple and complex graphs with just a few lines of codes!

Now let's load matplotlib to start plotting some graphs

import matplotlib.pyplot as plt


import numpy as np

Scatter plot
Scatter plots use a collection of points on a graph to display values from two variables. This allow us to see if there is any relationship or
correlation between the two variables.

Let's see how mean temperature changes over the years!

x = dataframe.date
y = dataframe.mean_temperature

plt.scatter(x,y)
plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 5/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

Do you see that the x axis is filled with a thick line, and that there's no tick label available? This makes us unable to analyze the data.

Let's try to modify this scatter plot so that we can see the ticks!

Choose only several ticks


The first thing we are going to do is to then reduce the number of ticks/ points for the x axis. We do this using the np.arrange function as below:

plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 180)) #numpy.arange(start, stop, step)
plt.show()

What is the interval you use so that you can see all the dates? Do you notice that now we are only having very few ticks?

Let's try to rotate our ticks. See the example on Stackoverflow!

Note: Stackoverflow is a site where technical personnel gather and share their knowledge. You can search for any queries over the site and see
if there are already others who solve it!

Rotate our x ticks label so that we can see more ticks more clearly

plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=90)
plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 6/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

Now we can see the x-ticks clearly.

Notice how temperature changes according to the time of the year. Compare it with this website. Does it inform you when to best plant your
crop?

Giving label to the x and y axis


You can also give label to the x and y axis. This will make it easier for you to visualise and share your data.

plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels and set a font size


plt.xlabel ("Date", fontsize = 14)
plt.ylabel ("Mean Temperature", fontsize = 14)

plt.show()

Looks good!

Now, let's add a title.


See how to do it here.

plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels and set a font size


plt.xlabel ("Date", fontsize = 14)
plt.ylabel ("Mean Temperature", fontsize = 14)
plt.title('Mean Temperature at Jaipur')

plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 7/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

Task 11: Change the title size to be bigger than the x and y labels!

plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels, title and set a font size


plt.xlabel ("Date", fontsize = 14)
plt.ylabel ("Mean Temperature", fontsize = 14)
plt.title('Mean Temperature at Jaipur', fontsize = 20)

plt.show()

Change your marker shape!

# Change the default figure size (default figure size is 6.4 for the width and 4.8 for the height (in inches))
plt.figure(figsize=(10,10)) #figure(figsize=(WIDTH_SIZE,HEIGHT_SIZE))

plt.scatter(x,y, marker='*') #https://matplotlib.org/stable/api/markers_api.html check the website for more marker styles
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels, title and set a font size


plt.xlabel ("Date", fontsize = 24)
plt.ylabel ("Mean Temperature", fontsize = 24)
plt.title('Mean Temperature at Jaipur', fontsize = 30)

# Set the font size of the number labels on the axes


plt.xticks (fontsize = 12)
plt.yticks (fontsize = 12)

plt.xticks (rotation=30, horizontalalignment='right')

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 8/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

plt.show()

Changing color
You can also change the marker color. Check out the code below which show you how to do it!

# Change the default figure size


plt.figure(figsize=(10,10))

plt.scatter(x,y, c='green', marker='*')


plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels, title and set a font size


plt.xlabel ("Date", fontsize = 24)
plt.ylabel ("Mean Temperature", fontsize = 24)
plt.title('Mean Temperature at Jaipur', fontsize = 30)

# Set the font size of the number labels on the axes


plt.xticks (fontsize = 12)
plt.yticks (fontsize = 12)

plt.xticks (rotation=30, horizontalalignment='right')

plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 9/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

Saving plot
You can use plt.savefig("figurename.png") to save the figure. The command should be written before plt.show() command

plt.savefig("graph1.png")

<Figure size 640x480 with 0 Axes>

Line Plots

Besides showing relationship using scatter plot, time data as above can also be represented with a line plot. Let's see how this is done!

plt.figure(figsize=(10,10))
y = dataframe.mean_temperature

plt.plot(x,y, "o:r") #the points are marked with circle and connected via dotted lines in red color ; refer to https://www.w3s
plt.ylabel("Mean Temperature")
plt.xlabel("Time")

plt.xticks(np.arange(0, 731, 60) , rotation=30)


plt.xticks()

plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 10/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

Change the labels and add title so that it is clearer and easier for you to show this graph
to others

Drawing multiple lines in a plot

x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature

plt.plot(x,y_1, label = "Max temp")


plt.plot(x,y_2, label = "Min temp")

plt.xticks(np.arange(0, 731, 60))


plt.xticks (rotation=30)

plt.legend()
plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 11/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

Draw at least 3 line graphs in one plot!

x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature
y_3 = dataframe.mean_temperature

z = y_1-y_2

plt.plot(x,y_1, label = "Max temp")


plt.plot(x,y_2, label = "Min temp")
plt.plot(x,y_3, label = "Mean temp")
plt.plot(x,z, label = "range")

plt.xticks(np.arange(0, 731, 60))


plt.xticks (rotation=30)

plt.legend()
plt.show()

Bar Charts

import matplotlib.pyplot as plt


import numpy as np

plt.figure(figsize=(10,10))

plt.bar(x,y, align='center')

plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 12/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

Great! You have now gained the ability to visualize data using matplotlib.

#creating pie chart


z=[12,23,34,45,56]
plt.pie(z)
plt.show()

#By default the plotting of the first wedge starts from the x-axis and moves counterclockwise

z=[23,34,45,56]
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
myexplode = [0.2, 0, 0, 0]

plt.pie(z, labels = mylabels, explode = myexplode) #The explode parameter, if specified, and not None, must be an array with o
#Each value represents how far from the center each wedge is displayed
plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 13/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

z=[23,34,45,56]
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
myexplode = [0.2, 0, 0, 0]
mycolors = ["black", "pink", "b", "#4CAF50"]

plt.pie(z, labels = mylabels, explode = myexplode,colors = mycolors)


plt.legend()

plt.show()

#creating histogram
#A histogram is a graph showing frequency distributions.
#It is a graph showing the number of observations within each given interval.

plt.hist(y)
plt.show()

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 14/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory

check 0s completed at 09:13

https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 15/15

You might also like