0% found this document useful (0 votes)
3 views7 pages

Week - 5 Pandas essentials

The document explains the differences between Series and DataFrame objects in Pandas, highlighting that a Series is a one-dimensional labeled array while a DataFrame is a two-dimensional table. It also discusses correlation and covariance, noting that covariance indicates the direction of a relationship between variables, whereas correlation measures both strength and direction in a standardized way. Additionally, it describes the .loc and .iloc functions for selecting data in DataFrames, emphasizing that .loc uses label-based indexing while .iloc uses position-based indexing.

Uploaded by

nghiemhoa4895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views7 pages

Week - 5 Pandas essentials

The document explains the differences between Series and DataFrame objects in Pandas, highlighting that a Series is a one-dimensional labeled array while a DataFrame is a two-dimensional table. It also discusses correlation and covariance, noting that covariance indicates the direction of a relationship between variables, whereas correlation measures both strength and direction in a standardized way. Additionally, it describes the .loc and .iloc functions for selecting data in DataFrames, emphasizing that .loc uses label-based indexing while .iloc uses position-based indexing.

Uploaded by

nghiemhoa4895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Pandas functions

Series and Dataframe Objects in Pandas


In Python's pandas library, the main difference between a DataFrame and a
Series object is their structure and intended use:

1. Series:

A Series is essentially a one-dimensional labeled array capable of holding


any data type (integers, strings, floats, etc.).

It has an index that labels each element in the series.

It can be thought of as a single column of data.

Example:Output:

python
Copy code
import pandas as pd
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
print(s)

css
Copy code
a 1
b 2
c 3
d 4
dtype: int64

Pandas functions 1
2. DataFrame:

A DataFrame is a two-dimensional labeled data structure with columns of


potentially different types (like a table with rows and columns).

Each column in a DataFrame is a Series, and the DataFrame is a collection


of Series aligned along a common index.

It can be thought of as a table (or spreadsheet) where rows and columns


can hold different types of data.

Example:Output:

python
Copy code
data = {'Column1': [1, 2, 3, 4], 'Column2': ['A', 'B',
'C', 'D']}
df = pd.DataFrame(data)
print(df)

css
Copy code
Column1 Column2
0 1 A
1 2 B
2 3 C
3 4 D

In summary:

A Series is a single-dimensional array, while a DataFrame is a multi-


dimensional table.

A DataFrame is essentially a collection of Series aligned in a tabular format.

Correlation and Covariance

Pandas functions 2
Correlation and Covariance are both measures used in statistics and data
analysis to describe the relationship between two variables. However, they differ
in their interpretation, scale, and how they measure this relationship:

1. Covariance:
Definition: Covariance measures the direction of the linear relationship
between two variables. It tells us whether two variables tend to increase or
decrease together.

F ormula : Cov(X, Y ) = n1​i = 1 ∑ n​(Xi​− Xˉ)(Yi​− Y ˉ)

Interpretation:

If the covariance is positive, both variables tend to increase or decrease


together.

If the covariance is negative, when one variable increases, the other tends
to decrease.

If the covariance is zero, there is no linear relationship between the


variables.

Scale: The magnitude of covariance is not standardized, meaning it can take


any value and is sensitive to the units of the variables. This makes it difficult to
compare across different datasets.

2. Correlation:
Definition: Correlation measures both the strength and the direction of the
linear relationship between two variables, but it is normalized and unit-free,
making it easier to interpret and compare across different datasets.

Formula (for Pearson correlation):

Corr(X, Y ) = σXσY Cov(X, Y )

Cov(X, Y )
Corr(X, Y ) = Cov(X, Y )σXσY Corr(X, Y ) = ​

σX σY
​ ​

Where σX\sigma_XσX and σY\sigma_YσY are the standard deviations of X and Y.

Pandas functions 3
Interpretation:

The correlation coefficient (r) ranges from 1 to 1:

rr

r=1r = 1r=1: Perfect positive correlation (variables increase together).

r=−1r = -1r=−1: Perfect negative correlation (one variable increases


while the other decreases).

r=0r = 0r=0: No linear correlation.

Scale: Correlation is standardized, meaning the values range between -1 and


1, making it easier to compare relationships across different variables or
datasets.

Key Differences:
Feature Covariance Correlation

Measures the direction of the Measures the strength and


Meaning
relationship between variables direction of the relationship

Unbounded (can be any positive or


Range Always between -1 and 1
negative number)

Easier to interpret due to


Interpretation Difficult to interpret due to scale
standardization

Depends on the units of the


Units Unit-free
variables

Easier to compare across


Comparison Difficult to compare across datasets
datasets

In summary:

Covariance tells you if two variables move together (positively or negatively)


but not how strongly.

Correlation tells you both the direction and strength of the linear relationship in
a more interpretable and standardized form.

.loc and .iloc

Pandas functions 4
In pandas, .loc and .iloc are used to select rows and columns from a DataFrame,
but they differ in how they index the data:

1. .loc : Label-based indexing


Definition: .loc is primarily used for selecting data by label or index name. It
includes both the start and end labels when selecting a range.

Behavior:

It allows for label-based indexing, which means you can select rows and
columns based on their explicit labels (index names or column names).

It supports slicing and selecting specific rows and columns by their labels.

Example:

python
Copy code
import pandas as pd
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])

# Using .loc to select by labels


df_loc = df.loc['row1', 'A'] # Selects the value in 'row
1' and column 'A'
print(df_loc)

Output:

Copy code
1

Selecting rows by label:

python
Copy code

Pandas functions 5
df.loc['row2'] # Selects all columns of row2

Selecting multiple rows and columns:

python
Copy code
df.loc['row1':'row2', ['A', 'B']] # Slices rows from
'row1' to 'row2' and columns 'A' and 'B'

2. .iloc : Position-based indexing


Definition: .iloc is used for selecting data by integer position. It excludes the
end index when selecting a range.

Behavior:

It allows for position-based indexing, which means you select rows and
columns based on their integer index positions, regardless of the actual
labels.

Like Python slicing, .iloc excludes the ending position when slicing
ranges.

Example:

python
Copy code
# Using .iloc to select by positions
df_iloc = df.iloc[0, 0] # Selects the value at the 0th ro
w and 0th column (first row, first column)
print(df_iloc)

Output:

Pandas functions 6
Copy code
1

Selecting rows by position:

python
Copy code
df.iloc[1] # Selects all columns of the second row (in
dex 1)

Selecting multiple rows and columns:

python
Copy code
df.iloc[0:2, 0:2] # Slices rows from position 0 to 1 a
nd columns from position 0 to 1

Key Differences:

Feature .loc (Label-based) .iloc (Position-based)

Position-based (integer
Indexing method Label-based (index/column names)
positions)

Start/end Includes both start and end in


Excludes the end in ranges
inclusion ranges

Type of input Row and column labels Integer index positions

When you know the label/index of When you want to select by


Use case
rows/columns integer positions

Summary:

Use .loc when you want to select rows and columns by labels.

Use .iloc when you want to select rows and columns by position.

Pandas functions 7

You might also like