Exercise 3
Exercise 3
Aim:
Description:
1. Import Pandas:
Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Create an alias with the as keyword while importing:
import pandas as pd
import pandas as pd
print(pd.__version__)
output : 1.0.3
2. Pandas serious
A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any
type.
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Return the first value of the Series:
print(myvar[0])
Output: 1
Create Labels:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
3. DataFrames:
Data sets in Pandas are usually multi-dimensional tables, called DataFrames. Series is
like a column, a DataFrame is the whole table.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
Locate Row:
The DataFrame is like a table with rows and columns. Pandas use the loc attribute to
return one or more specified row(s).
To return row 0:
Named Indexes:
With the index argument, you can name your own indexes.
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Read CSV Files:
A simple way to store big data sets is to use CSV files (comma separated files). CSV files
contains plain text and is a well know format that can be read by everyone including Pandas.
Here a CSV file called 'data.csv’ is used. to_string() is used to print the entire DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Find max_rows:
The number of rows returned is defined in Pandas option settings. System's maximum
rows with the pd.options.display.max_rows statement.
import pandas as pd
print(pd.options.display.max_rows)
output: 60
4. Analyzing DataFrames:
The head() method returns the headers and a specified number of rows, starting from the top.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the
bottom.
print(df.tail())
5. Data Cleaning
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells. Since data sets can
be very big, and removing a few rows will not have a big impact on the result.
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
# the result that some rows have been removed (row 18, 22 and 28).
Another way of dealing with empty cells is to insert a new value instead. The fillna() method
allows us to replace empty cells with a value:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
# Empty cells got the value 130 (in row 18, 22 and 28)
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())
6. Data Correlations
A great aspect of the Pandas module is the corr() method. The corr() method calculates the
relationship between each column in your data set.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.corr())
7. Plotting
Pandas uses the plot() method to create diagrams. Pyplot is used as submodule of the Matplotlib
library to visualize the diagram on the screen.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
Scatter Plot:
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
A scatter plot needs an x- and a y-axis. Here "Duration" for the x-axis and "Calories" for the y-axis.
#Three lines to make our compiler able to draw:
import sys
import matplotlib
matplotlib.use('Agg')
import pandas as pd
kind = 'hist'
A histogram needs only one column. Here, histogram shows that how many workouts lasted
between 50 and 60 minutes?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df["Duration"].plot(kind = 'hist')
plt.show()