0% found this document useful (0 votes)
694 views

Pandas Library Documentation

Uploaded by

samuel.yacoubb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
694 views

Pandas Library Documentation

Uploaded by

samuel.yacoubb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Pandas Library Documentation

Memory updated

Introduction to Pandas

Pandas is a powerful open-source data analysis and data manipulation library for Python. It provides
data structures and functions needed to manipulate structured data, including handling missing data,
filtering data, and aggregating data. Pandas is widely used in data science, machine learning, and
data analysis due to its ease of use and versatility.

Why Use Pandas?

Ease of Use: Pandas simplifies complex data manipulations and provides a straightforward
syntax.
Performance: Built on top of NumPy, it provides high-performance data structures for data
analysis.
Flexibility: Handles a variety of data formats, including CSV, Excel, SQL databases, and more.
Data Handling: Offers robust tools for cleaning, transforming, and analyzing data.

Getting Started with Pandas

Before you start using Pandas, ensure you have it installed. You can install it using pip:

bash

pip install pandas

Importing Pandas

To start using Pandas, you need to import it in your Python script or Jupyter notebook:

python

import pandas as pd

Data Structures in Pandas

Pandas provides two primary data structures:

1. Series: A one-dimensional labeled array capable of holding any data type.


2. DataFrame: A two-dimensional labeled data structure with columns of potentially different
types, similar to a table in a database or a spreadsheet.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/16
Creating a Series

A Series can be created from a list or an array. Here's how:

python

# Creating a Series from a list


data = [10, 20, 30, 40]
series = pd.Series(data)

print(series)

Output:

go

0 10
1 20
2 30
3 40
dtype: int64

Explanation:

The index of the Series is automatically generated (0, 1, 2, 3).


The data type is inferred from the values.

Creating a DataFrame

A DataFrame can be created from a dictionary of lists, where each key-value pair corresponds to a
column label and the data.

python

# Creating a DataFrame from a dictionary


data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 30, 22],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

print(df)

Output:

markdown

Name Age City


0 Alice 24 New York
1 Bob 30 Los Angeles
2 Charlie 22 Chicago

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/16
Explanation:

Each column is labeled (Name, Age, City).


The DataFrame automatically assigns an index.

Real-World Example: Analyzing a Dataset

Let's consider a dataset of students' grades. We'll analyze their performance using Pandas.

Step 1: Loading the Data

Assuming you have a CSV file named `students.csv` with the following content:

csv

Name,Math,English,Science
Alice,88,92,85
Bob,79,85,78
Charlie,95,90,92

You can load this data into a DataFrame:

python

# Loading data from a CSV file


df = pd.read_csv('students.csv')

print(df)

Output:

javascript

Name Math English Science


0 Alice 88 92 85
1 Bob 79 85 78
2 Charlie 95 90 92

Explanation:

`pd.read_csv()` is used to read the CSV file and convert it into a DataFrame.

Step 2: Exploring the Data

You can explore the data using several functions:

python

# Display the first few rows


print(df.head())

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/16
# Get summary statistics
print(df.describe())

# Get information about the DataFrame


print(df.info())

Output of `df.describe()`:

shell

Math English Science


count 3.000000 3.000000 3.000000
mean 87.333333 89.000000 85.000000
std 8.164965 3.605551 7.637626
min 79.000000 85.000000 78.000000
25% 83.500000 86.500000 81.500000
50% 88.000000 90.000000 85.000000
75% 91.500000 91.000000 88.500000
max 95.000000 92.000000 92.000000

Explanation:

`head()` shows the first five rows of the DataFrame.


`describe()` provides summary statistics for numerical columns.
`info()` gives a concise summary of the DataFrame.

Data Manipulation

Selecting Columns

You can select a column from the DataFrame as follows:

python

# Selecting the 'Math' column


math_scores = df['Math']
print(math_scores)

Output:

yaml

0 88
1 79
2 95
Name: Math, dtype: int64

Explanation:

You can access a column by using its label.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/16
Filtering Rows

To filter rows based on a condition:

python

# Filtering students who scored more than 85 in Math


high_math_scores = df[df['Math'] > 85]
print(high_math_scores)

Output:

javascript

Name Math English Science


0 Alice 88 92 85
2 Charlie 95 90 92

Explanation:

You can use boolean indexing to filter rows that meet a certain condition.

Continue

Let me know if you'd like me to continue with more advanced topics like data cleaning, grouping,
merging DataFrames, or visualizing data using Pandas!

Data Cleaning with Pandas

Data cleaning is a crucial step in data analysis. Pandas provides several tools to handle missing data,
duplicates, and outliers effectively.

Handling Missing Data

Missing values can skew your analysis. Here’s how to deal with them:

1. Identifying Missing Values: You can check for missing values using `isnull()` or `isna()`.

python

# Check for missing values


print(df.isnull().sum())

Output:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/16
go

Name 0
Math 0
English 0
Science 0
dtype: int64

Explanation:

This will show the number of missing values in each column.

2. Dropping Missing Values: If you want to remove rows with missing values:

python

# Drop rows with any missing values


df_cleaned = df.dropna()

print(df_cleaned)

3. Filling Missing Values: Alternatively, you can fill missing values with a specific value or the
mean/median of the column.

python

# Fill missing values with the mean


df['Math'].fillna(df['Math'].mean(), inplace=True)

Explanation:

`dropna()` removes rows with missing values.


`fillna()` replaces missing values with the specified method.

Removing Duplicates

Duplicated entries can lead to inaccurate analysis. Here’s how to find and remove them:

python

# Create a DataFrame with duplicates


data_with_duplicates = {
'Name': ['Alice', 'Bob', 'Alice'],
'Math': [88, 79, 88],
'English': [92, 85, 92],
'Science': [85, 78, 85]
}
df_duplicates = pd.DataFrame(data_with_duplicates)

# Check for duplicates


print(df_duplicates.duplicated())

# Remove duplicates

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 6/16
df_unique = df_duplicates.drop_duplicates()
print(df_unique)

Output:

python

0 False
1 False
2 True
dtype: bool

Explanation:

`duplicated()` returns a boolean Series indicating duplicate rows.


`drop_duplicates()` removes duplicate rows from the DataFrame.

Grouping Data

Grouping data is essential for performing operations on subsets of the data. The `groupby()` function
allows you to group data by one or more columns.

Example: Grouping by a Column

Let's say you want to find the average scores for each student.

python

# Grouping by 'Name' and calculating the mean of each group


grouped = df.groupby('Name').mean()
print(grouped)

Output:

javascript

Math English Science


Name
Alice 88.0 92.0 85.0
Bob 79.0 85.0 78.0
Charlie 95.0 90.0 92.0

Explanation:

The `groupby()` function groups the DataFrame by the specified column(s).


The `mean()` function computes the average for each group.

Merging DataFrames
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 7/16
Sometimes, you need to combine multiple DataFrames. Pandas offers functions like `merge()` and
`concat()` to help with this.

Example: Merging Two DataFrames

Assume you have another DataFrame containing additional student information.

python

# Additional DataFrame
data_additional = {
'Name': ['Alice', 'Bob', 'Charlie'],
'School': ['School A', 'School B', 'School C']
}
df_additional = pd.DataFrame(data_additional)

# Merging DataFrames on 'Name'


merged_df = pd.merge(df, df_additional, on='Name')
print(merged_df)

Output:

css

Name Math English Science School


0 Alice 88 92 85 School A
1 Bob 79 85 78 School B
2 Charlie 95 90 92 School C

Explanation:

`pd.merge()` combines two DataFrames based on a common column (in this case, 'Name').

Example: Concatenating DataFrames

If you have DataFrames with the same columns and want to stack them vertically, you can use
`concat()`.

python

# Concatenating DataFrames
data_more = {
'Name': ['David', 'Eva'],
'Math': [91, 84],
'English': [89, 90],
'Science': [88, 80]
}
df_more = pd.DataFrame(data_more)

# Concatenating the original DataFrame with the new one


concatenated_df = pd.concat([df, df_more], ignore_index=True)
print(concatenated_df)

Output:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 8/16
javascript

Name Math English Science


0 Alice 88 92 85
1 Bob 79 85 78
2 Charlie 95 90 92
3 David 91 89 88
4 Eva 84 90 80

Explanation:

`pd.concat()` stacks the DataFrames vertically, and `ignore_index=True` resets the index.

Data Visualization with Pandas

Pandas integrates well with libraries like Matplotlib for data visualization. You can create various plots
directly from DataFrames.

Example: Plotting Data

python

import matplotlib.pyplot as plt

# Bar plot for Math scores


df.plot(x='Name', y='Math', kind='bar', title='Math Scores by Student')
plt.ylabel('Scores')
plt.show()

Output:

This code will generate a bar plot showing the Math scores of each student.

Explanation:

The `plot()` method allows you to create various types of plots (line, bar, scatter, etc.) directly
from a DataFrame.

Conclusion

Pandas is an essential tool for data analysis and manipulation in Python. It provides powerful data
structures and functions for cleaning, exploring, and visualizing data efficiently.

With the knowledge of how to create Series and DataFrames, handle missing data, manipulate data,
group data, and visualize results, you are well on your way to mastering data analysis with Pandas!

If you'd like to dive deeper into any specific topic or if you have questions about other functionalities
in Pandas, let me know!

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 9/16
Advanced Features of Pandas

Now that you have a solid foundation in the basic functionalities of Pandas, let’s explore some
advanced features that can enhance your data manipulation and analysis capabilities.

Time Series Data

Pandas has powerful tools for handling time series data. It allows you to work with dates and times,
making it ideal for analyzing time-dependent data.

Creating a Time Series

You can create a time series by specifying a date range. For example, let's create a series of daily data
for a week:

python

# Create a date range


dates = pd.date_range(start='2024-01-01', periods=7, freq='D')

# Create a Series with the date range


temperature = pd.Series([22, 23, 21, 20, 19, 22, 24], index=dates)

print(temperature)

Output:

yaml

2024-01-01 22
2024-01-02 23
2024-01-03 21
2024-01-04 20
2024-01-05 19
2024-01-06 22
2024-01-07 24
Freq: D, dtype: int64

Explanation:

`pd.date_range()` creates a range of dates.


The Series is indexed by these dates, allowing easy access to time-based data.

Resampling Time Series Data

You can resample time series data to different frequencies (e.g., daily to monthly).

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 10/16
python

# Create a DataFrame with random data


data = {
'Date': pd.date_range(start='2024-01-01', periods=10, freq='D'),
'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
df_time_series = pd.DataFrame(data)
df_time_series.set_index('Date', inplace=True)

# Resample to weekly frequency and calculate the sum


weekly_data = df_time_series.resample('W').sum()
print(weekly_data)

Output:

yaml

Value
Date
2024-01-07 28
2024-01-14 10

Explanation:

The `resample()` method allows you to change the frequency of the time series.
In this case, we resampled the daily data to weekly data and calculated the sum.

Pivot Tables

Pivot tables are an excellent way to summarize and aggregate data. They allow you to reshape data
and extract meaningful insights.

Creating a Pivot Table

Let’s say you have a dataset of sales data:

python

# Sample sales data


sales_data = {
'Date': ['2024-01-01', '2024-01-02', '2024-01-01', '2024-01-02'],
'Product': ['A', 'A', 'B', 'B'],
'Sales': [100, 200, 150, 250]
}
df_sales = pd.DataFrame(sales_data)

# Create a pivot table


pivot_table = df_sales.pivot_table(values='Sales', index='Date', columns='Product', aggfun
c='sum', fill_value=0)
print(pivot_table)

Output:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 11/16
css

Product A B
Date
2024-01-01 100 150
2024-01-02 200 250

Explanation:

The `pivot_table()` function reshapes the data, aggregating sales by date and product.
`aggfunc='sum'` specifies that we want to sum the sales, and `fill_value=0` replaces any missing
values with zero.

Applying Functions to DataFrames

Pandas allows you to apply custom functions to your DataFrames, which can be useful for more
complex transformations.

Using `apply()`

You can use the `apply()` method to apply a function along an axis of the DataFrame.

python

# Define a custom function to calculate the square of a value


def square(x):
return x ** 2

# Apply the function to the 'Sales' column


df_sales['Sales Squared'] = df_sales['Sales'].apply(square)

print(df_sales)

Output:

css

Date Product Sales Sales Squared


0 2024-01-01 A 100 10000
1 2024-01-02 A 200 40000
2 2024-01-01 B 150 22500
3 2024-01-02 B 250 62500

Explanation:

The `apply()` method applies the `square()` function to each element in the 'Sales' column.

Using `applymap()`

To apply a function element-wise across a DataFrame, you can use `applymap()`.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 12/16
python

# Apply a function to square each value in the DataFrame


df_squared = df_sales.applymap(lambda x: x ** 2 if isinstance(x, (int, float)) else x)

print(df_squared)

Output:

css

Date Product Sales Sales Squared


0 2024-01-01 A 10000 10000
1 2024-01-02 A 40000 40000
2 2024-01-01 B 22500 22500
3 2024-01-02 B 62500 62500

Explanation:

The `applymap()` method applies the lambda function to every element in the DataFrame,
checking if the element is a number before squaring it.

Handling Categorical Data

Pandas provides efficient methods for working with categorical data, which can save memory and
improve performance.

Converting to Categorical

You can convert a column to a categorical type, which is particularly useful for columns with a limited
number of unique values.

python

# Creating a DataFrame with categorical data


data_categorical = {
'Category': ['A', 'B', 'A', 'C', 'B', 'C'],
'Value': [1, 2, 3, 4, 5, 6]
}
df_categoric = pd.DataFrame(data_categorical)

# Convert 'Category' to a categorical type


df_categoric['Category'] = df_categoric['Category'].astype('category')

print(df_categoric.info())

Output:

sql

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 13/16
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Category 6 non-null category
1 Value 6 non-null int64
dtypes: category(1), int64(1)
memory usage: 173.0 bytes

Explanation:

The `astype('category')` method converts the 'Category' column to a categorical data type,
reducing memory usage.

Analyzing Categorical Data

You can easily analyze categorical data using methods like `value_counts()`.

python

# Count the occurrences of each category


category_counts = df_categoric['Category'].value_counts()
print(category_counts)

Output:

css

A 2
B 2
C 2
Name: Category, dtype: int64

Explanation:

`value_counts()` returns the count of unique values in the specified column.

Real-World Example: Customer Segmentation

Let’s say you have a dataset of customer purchases, and you want to segment customers based on
their purchase behavior.

Step 1: Loading Customer Data

Assuming you have a CSV file named `customers.csv`:

csv

CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)


1,Female,23,40,60
2,Male,30,50,70
3,Female,35,60,80

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 14/16
4,Male,40,80,90
5,Female,25,70,60

You can load this data into a DataFrame:

python

df_customers = pd.read_csv('customers.csv')
print(df_customers)

Output:

java

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)


0 1 Female 23 40 60
1 2 Male 30 50 70
2 3 Female 35 60 80
3 4 Male 40 80 90
4 5 Female 25 70 60

Explanation:

The CSV is loaded into a DataFrame for analysis.

Step 2: Data Cleaning

You can check for missing values and clean the data as needed.

python

# Check for missing values


print(df_customers.isnull().sum())

Output:

java

CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64

Explanation:

This shows that there are no missing values in the dataset.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 15/16
Step 3: Analyzing the Data

You can create a pivot table to analyze spending scores based on gender:

python

# Create a pivot table to summarize spending scores by gender


spending_by_gender = df_customers.pivot_table(values='Spending Score (1-100)', index='Gend
er', aggfunc='mean')
print(spending_by_gender)

Output:

java

Spending Score (1-100)


Gender
Female 66.67
Male 80.00

Explanation:

The pivot table shows the average spending score for each gender.

Step 4: Visualization

Finally, you can visualize the spending scores using a bar plot:

python

# Bar plot of average spending scores


spending_by_gender.plot(kind='bar', title='Average Spending Score by Gender')
plt.ylabel('Average Spending Score')
plt.show()

Output:

A bar plot displaying the average spending score by gender.

Conclusion

Pandas is an incredibly powerful library for data manipulation and analysis in Python. From basic
operations to advanced functionalities like time series analysis, pivot tables, and handling categorical
data, Pandas provides the tools necessary to make data analysis efficient and effective.

As you continue to explore Pandas, try applying these concepts to your datasets. If you have any
specific questions or topics you'd like to delve into further, feel free to ask!
ChatGPT can make mistakes. Check important info.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 16/16

You might also like