0% found this document useful (0 votes)
14 views

Exercise 3

good

Uploaded by

Ram Aypn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Exercise 3

good

Uploaded by

Ram Aypn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Exercise3: Working with Pandas data frames

Aim:

To perform functions for analyzing, cleaning, exploring, and manipulating data

Description:

1. Import Pandas:

Once Pandas is installed, import it in your applications by adding the import keyword:

import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Create an alias with the as keyword while importing:

Now the Pandas package can be referred to as pd instead of pandas

import pandas as pd

Checking Pandas Version:

The version string is stored under __version__ attribute.

import pandas as pd
print(pd.__version__)

output : 1.0.3

2. Pandas serious

A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any
type.
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Return the first value of the Series:

print(myvar[0])

Output: 1

Create Labels:

With the index argument can name the own labels.

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

Key/Value Objects as Series:

Create a simple Pandas Series from a dictionary:

import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
3. DataFrames:

Data sets in Pandas are usually multi-dimensional tables, called DataFrames. Series is
like a column, a DataFrame is the whole table.

Create a DataFrame from two Series:

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)

Locate Row:

The DataFrame is like a table with rows and columns. Pandas use the loc attribute to
return one or more specified row(s).

To return row 0:

#refer to the row index


print(df.loc[0])
To return row 0 and 1:

#use a list of indexes:


print(df.loc[[0, 1]])

Named Indexes:
With the index argument, you can name your own indexes.
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

Load a CSV file into a Pandas DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Read CSV Files:
A simple way to store big data sets is to use CSV files (comma separated files). CSV files
contains plain text and is a well know format that can be read by everyone including Pandas.
Here a CSV file called 'data.csv’ is used. to_string() is used to print the entire DataFrame

import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())

Find max_rows:

The number of rows returned is defined in Pandas option settings. System's maximum
rows with the pd.options.display.max_rows statement.

Check the number of maximum returned rows:

import pandas as pd
print(pd.options.display.max_rows)

output: 60

4. Analyzing DataFrames:

Viewing the Data

The head() method returns the headers and a specified number of rows, starting from the top.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))

Print the first 5 rows of the DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())

Print the last 5 rows of the DataFrame:

There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the
bottom.

print(df.tail())
5. Data Cleaning

Data cleaning means fixing bad data in the data set.

Bad data could be:


 Empty cells
 Data in wrong format
 Wrong data
 Duplicates

Empty Cells

Empty cells can potentially give you a wrong result when you analyze data.

Remove Rows

One way to deal with empty cells is to remove rows that contain empty cells. Since data sets can
be very big, and removing a few rows will not have a big impact on the result.

import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
# the result that some rows have been removed (row 18, 22 and 28).

Replace Empty Values:

Another way of dealing with empty cells is to insert a new value instead. The fillna() method
allows us to replace empty cells with a value:

import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)

# Empty cells got the value 130 (in row 18, 22 and 28)

To remove duplicates, use the drop_duplicates() method.

import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())

#Notice that row 12 has been removed from the result

6. Data Correlations
A great aspect of the Pandas module is the corr() method. The corr() method calculates the
relationship between each column in your data set.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.corr())

7. Plotting

Pandas uses the plot() method to create diagrams. Pyplot is used as submodule of the Matplotlib
library to visualize the diagram on the screen.

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
Scatter Plot:

Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
A scatter plot needs an x- and a y-axis. Here "Duration" for the x-axis and "Calories" for the y-axis.
#Three lines to make our compiler able to draw:
import sys
import matplotlib
matplotlib.use('Agg')
import pandas as pd

import matplotlib.pyplot as plt


df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()

#Two lines to make our compiler able to draw:


plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Histogram:

The kind argument is used to specify a histogram:

kind = 'hist'

A histogram needs only one column. Here, histogram shows that how many workouts lasted
between 50 and 60 minutes?

#Three lines to make our compiler able to draw:


import sys
import matplotlib
matplotlib.use('Agg')

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df["Duration"].plot(kind = 'hist')
plt.show()

#Two lines to make our compiler able to draw:


plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

You might also like