Data Science Python Notebook (1)
Data Science Python Notebook (1)
ipynb - Colaboratory
Data Sciences
Introduction
Data visualization is part of data exploration, which is a critical step in the AI cycle. You will use this technique to gain understanding and
insights to the data you have gathered, and determine if the data is ready for further processing or if you need to collect more data or clean the
data.
In this notebook, we will explore python packages crucial for Data Sciences. Packages like Pandas, NumPy and Matplotlib are used in the whole
process.
Context
We will be working with Jaipur weather data obtained from Kaggle, a platform for data enthusiasts to gather, share knowledge and compete for
many prizes!
The data has been cleaned and simplified, so that we can focus on data visualization instead of data cleaning. Our data is stored in the file
named mydata.csv. This file contains weather information of Jaipur and is saved at the same location as the notebook.
We usually access these files using spreadsheet applications such as Excel or Google Sheet. Do you know how this is done?
Import Pandas
import pandas as pd #import pandas as pd means we can type "pd" to call the pandas library
Now that we have imported pandas, let's start by reading the csv file.
#saving the csv file into a variable which we will call data frame
dataframe = pd.read_csv("mydata.csv")
Great! We have now a variable to contain our weather data. Let's explore our data. Use the .head() function to see the first few rows of data.
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 1/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
1 7 1005.65 29 6 13
2 11 1007.94 61 13 16
3 13 1008.39 69 18 17
4 10 1007.62 50 8 14
print (dataframe.head(10))
dataframe.dtypes
date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 2/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
max_dew_pt_2 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object
Looks like there are 16 columns in this dataset and we don't need all of them for the purposes of this activity. One way to go about doing this, is
to drop the columns that we don't need. Pandas provide an easy way for us to drop columns using the ".drop" function.
dataframe = dataframe.drop(["max_dew_pt_2"], axis=1) # no output will be generated , the column will be removed
Let's print to ensure that the columns are dropped, try printing them with head() or dtypes.
dataframe.dtypes
date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object
dataframe.dtypes
date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
max_pressure_1 int64
min_pressure_1 int64
rainfall float64
dtype: object
What do you notice from the number? Look at the date. Can you see how the function help us sort data based on the date?
Sort the values in ascending order of mean temperature and print the first 5 rows
Look at the max and min temperature! See the range of temperature that one can experience within a day.
Sort the values in descending order of mean temperature and print the first 5 rows
df 1 d d (" d t " 3)
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 4/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
dframe1=pd.read_csv("mydata.csv",nrows=3)
print(dframe1)
dframe2=pd.read_csv("mydata.csv", usecols=['date','mean_temperature'])
print(dframe2.head(10))
date mean_temperature
0 2016-05-04 34
1 2016-05-05 31
2 2016-05-06 28
3 2016-05-07 30
4 2016-05-08 34
5 2016-05-09 34
6 2016-05-10 34
7 2016-05-11 32
8 2016-05-12 34
9 2016-05-13 34
Now we have a clearer picture of our dataset. Using these functions, we can analyze our data and gain insights of them.
However, we want to get an even better picture. We want to learn how to explore these data visually.
Let's now use the matplotlib library to help us with data visualization in Python.
Importing matplotlib
Matplotlib is a Python 2D plotting library that we can use to produce high quality data visualization. It is highly usable (as you will soon find out),
you can create simple and complex graphs with just a few lines of codes!
Scatter plot
Scatter plots use a collection of points on a graph to display values from two variables. This allow us to see if there is any relationship or
correlation between the two variables.
x = dataframe.date
y = dataframe.mean_temperature
plt.scatter(x,y)
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 5/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
Do you see that the x axis is filled with a thick line, and that there's no tick label available? This makes us unable to analyze the data.
Let's try to modify this scatter plot so that we can see the ticks!
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 180)) #numpy.arange(start, stop, step)
plt.show()
What is the interval you use so that you can see all the dates? Do you notice that now we are only having very few ticks?
Note: Stackoverflow is a site where technical personnel gather and share their knowledge. You can search for any queries over the site and see
if there are already others who solve it!
Rotate our x ticks label so that we can see more ticks more clearly
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=90)
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 6/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
Notice how temperature changes according to the time of the year. Compare it with this website. Does it inform you when to best plant your
crop?
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
plt.show()
Looks good!
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 7/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
Task 11: Change the title size to be bigger than the x and y labels!
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
plt.show()
# Change the default figure size (default figure size is 6.4 for the width and 4.8 for the height (in inches))
plt.figure(figsize=(10,10)) #figure(figsize=(WIDTH_SIZE,HEIGHT_SIZE))
plt.scatter(x,y, marker='*') #https://matplotlib.org/stable/api/markers_api.html check the website for more marker styles
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 8/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
plt.show()
Changing color
You can also change the marker color. Check out the code below which show you how to do it!
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 9/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
Saving plot
You can use plt.savefig("figurename.png") to save the figure. The command should be written before plt.show() command
plt.savefig("graph1.png")
Line Plots
Besides showing relationship using scatter plot, time data as above can also be represented with a line plot. Let's see how this is done!
plt.figure(figsize=(10,10))
y = dataframe.mean_temperature
plt.plot(x,y, "o:r") #the points are marked with circle and connected via dotted lines in red color ; refer to https://www.w3s
plt.ylabel("Mean Temperature")
plt.xlabel("Time")
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 10/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
Change the labels and add title so that it is clearer and easier for you to show this graph
to others
x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature
plt.legend()
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 11/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature
y_3 = dataframe.mean_temperature
z = y_1-y_2
plt.legend()
plt.show()
Bar Charts
plt.figure(figsize=(10,10))
plt.bar(x,y, align='center')
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 12/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
Great! You have now gained the ability to visualize data using matplotlib.
#By default the plotting of the first wedge starts from the x-axis and moves counterclockwise
z=[23,34,45,56]
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
myexplode = [0.2, 0, 0, 0]
plt.pie(z, labels = mylabels, explode = myexplode) #The explode parameter, if specified, and not None, must be an array with o
#Each value represents how far from the center each wedge is displayed
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 13/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
z=[23,34,45,56]
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
myexplode = [0.2, 0, 0, 0]
mycolors = ["black", "pink", "b", "#4CAF50"]
plt.show()
#creating histogram
#A histogram is a graph showing frequency distributions.
#It is a graph showing the number of observations within each given interval.
plt.hist(y)
plt.show()
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 14/15
29/04/2023, 09:39 2023 Data Science Python Notebook.ipynb - Colaboratory
https://colab.research.google.com/drive/1t6dZnaLbMdyIjY39F5ueQ8wrCQ0ypOv1#scrollTo=aYokM5PqFVdb&printMode=true 15/15