0% found this document useful (0 votes)
27 views

pandas (1)

Uploaded by

krushnasil123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

pandas (1)

Uploaded by

krushnasil123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Pandas

Pandas is a powerful and open-source Python library. The Pandas library is used for
data manipulation and analysis. Pandas consist of data structures and functions to
perform efficient operations on data.
Pandas is well-suited for working with tabular data, such as spreadsheets or SQL
tables.
The Pandas library is an essential tool for data analysts, scientists, and engineers
working with structured data in Python.

It is built on top of the NumPy library which means that a lot of the structures of
NumPy are used or replicated in Pandas
The data produced by Pandas is often used as input for plotting functions in
Matplotlib, statistical analysis in SciPy, and machine learning algorithms in
Scikit-learn.
Pandas is used throughout the data analysis workflow. With pandas, you can:
 Import datasets from databases, spreadsheets, comma-separated values (CSV)
files, and more.
 Clean datasets, for example, by dealing with missing values.
 Tidy datasets by reshaping their structure into a suitable format for analysis.
 Aggregate data by calculating summary statistics such as the mean of columns,
correlation between them, and more.
 Visualize datasets and uncover insights.
 pandas also contains functionality for time series analysis and analyzing text
data.
• Data Structures in Pandas Library
• Pandas generally provide two data structures for manipulating data. They
are:
• Series
• DataFrame
• A DataFrame is a 2-dimensional data structure that can store data of
different types (including characters, integers, floating point values,
categorical data and more) in columns. It is similar to a spreadsheet, a
SQL table or the data.frame in R
• Each column in a DataFrame is a Series
Installing pandas
pip install pandas

Checking the pandas version


import pandas
print(pandas.__version__)

import pandas
data = [1, 2, 3, 4]
ser = pandas.DataFrame(data)
print(ser)

Example 1
import pandas as pd
mydataset = { 'cars': ["BMW", "Volvo", "Ford"], 'passings': [3, 7, 2]}
myvar = pd.DataFrame(mydataset)
print(myvar)
import pandas as pd
data = { "calories": [420, 380, 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
print(df)

Pandas use the loc attribute to return one or more specified row(s)

import pandas as pd

data = { "calories": [420, 380, 390], "duration": [50, 40, 45] }

df = pd.DataFrame(data)

print(df.loc[1])

print(df.loc[[0, 1]]) - When using [], the result is a Pandas DataFrame.

df.iloc[:3] # Accesses the first three rows


Named Indexes- With the index argument, you can
name your own indexes
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
print(df.loc["day2"]) - Use the named index in the loc attribute to return
the specified row(s)
Pandas Read CSV
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
print(df.to_string()) - use to_string() to print the entire DataFrame.
You can check your system's maximum rows with the
pd.options.display.max_rows statement
import pandas as pd
print(pd.options.display.max_rows)

In my system the number is 60, which means that if the DataFrame contains more
than 60 rows, the print(df) statement will return only the headers and the first and
last 5 rows
Increase the maximum number of rows to display the entire DataFrame
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
Viewing Data
df.head()
Shows the first 5 rows
df.tail()
Shows the last 5 rows.
df.shape
Gives the dimensions (rows, columns)
Inspecting Columns
df.columns
Lists all column names
df.dtypes
Shows data types for each column
Condition-based Selection

df[df['Age'] > 25] # Select rows where Age is greater than 25

Modifying DataFrames : Adding a Column

df['Country'] = ['USA', 'USA', 'USA']

Updating Values:

df.loc[df['Name'] == 'Alice', 'City'] = 'San Francisco‘


Removing Columns:
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise

Explanation of Parameters:
 labels: Specifies labels to drop. It can be used as an alternative to index or columns by specifying eith
row or column labels.
 axis: Defines which axis to drop from. Use 0 for rows and 1 for columns. The default is 0.
 index: Specifies row labels to drop.
 columns: Specifies column labels to drop, which is used in the example provided.
 level: Useful when working with MultiIndex (hierarchical) DataFrames to select labels at a specific level.
 inplace: If set to True, it performs the operation in place without returning a new DataFrame.
Default is False.
 errors: If set to 'raise', an error is raised if labels aren’t found. If set to 'ignore', no error is raised if the

specified labels do not exist. df.drop(columns=['Country'], inplace=True)


print(df.describe()) # Gives basic statistics for numerical columns

Other Useful methods


df.mean() # Mean of each columndf.value_counts('City')

Checking for Missing Values:


df.isnull().sum() # Shows the count of missing values per column

Filling Missing Values


df['Age'].fillna(df['Age'].mean(), inplace=True) # Replace NaNs with column
mean

Dropping Missing Data

df.dropna(inplace=True)
Visualization in Python

Data visualization in Python refers to the practice of transforming data into


graphical representations, such as charts, graphs, and plots, to make
complex information easier to understand and interpret.
Using libraries like Matplotlib, Seaborn, and Plotly, Python provides
extensive tools for creating a wide variety of visualizations, from basic line
and bar graphs to advanced scatter plots, heatmaps, and interactive
dashboards.
The primary purpose of data visualization is to make data analysis more
accessible by highlighting patterns, trends, and outliers in a visual format.
Matplotlib is a widely used plotting library for Python that provides a flexible way
to create static, animated, and interactive visualizations.
It allows users to generate a variety of plots, such as line graphs, scatter plots,
bar charts, histograms, and more.
Import matplotlib.plot as plt

1. Plotting Graphs
Basic Line Plot: Start with a basic plt.plot() to show how Matplotlib handles line
graphs.
Used to display trends or change over time. Ideal for continuous data or time
series where you want to track the movement or trends over intercals. ( Ex.
Stock prices over days, temperature changes over hours., monthly sales revenue)

plt.plot() to show how Matplotlib handles line graphs.


import matplotlib.pyplot as plt
# Line plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.title("Basic Line Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

# Controlling Graph
Axis Limits: Use plt.xlim() and plt.ylim() to set specific axis limits.
Line Styles and Colors: Customize line styles and colors using parameters in plt.plot()
(e.g., color, linestyle, linewidth).
Grid and Background: Add grid lines with plt.grid() and background color using
plt.gca().set_facecolor().
# Customizing plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], color='red', linestyle='--', linewidth=2)
plt.xlim(0, 5)
plt.ylim(0, 20)
plt.grid(True)
plt.title("Controlled Graph")
plt.show()

#Adding TextTitle and Labels: Show how to set the title (plt.title()), axis
labels (plt.xlabel() and plt.ylabel()).
Annotations: Use plt.annotate() to add annotations directly on specific data
points.
Legend: Demonstrate plt.legend() to label different series on a graph.
# Adding text to the plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], label="Data")
plt.title("Plot with Text and Annotations")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.annotate('Highest Point', xy=(4, 16), xytext=(3, 12),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.legend()
plt.show()
Scatter Plot:
Use plt.scatter() for scatter plots, helpful in demonstrating data
distributions.

To show the relationship or correlation between two variables.


Useful for identifying patterns, clusters, and potential outliers, and for
examining the relationship between two continuous variables.
Ex : Exam scores Vs. study hours, height Vs. weight, sgpa Vs. Attendance
etc.,
import matplotlib.pyplot as plt
# Sample data
age = [22, 25, 26, 30, 32, 35, 40, 42, 45, 50]
income = [5000, 7000, 8000, 12000, 15000, 18000, 20000, 22000, 24000, 26000]
# Plot
plt.scatter(age, income, color='blue', marker='o')
plt.title('Income vs Age')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.show()
Bar Chart: Introduce plt.bar() for bar charts, ideal for categorical data

Used to compare quantities of discrete categories.


Suitable for categorical data where you want to show the quantity or
frequency of each category.
Ex: No.of students per class, sales by product category

Histogram
To show the distribution of a continuos variable.
Helps in understanding frequency distribution, range and shape of data,

Ex: Age distribution and income distribution with in a population


import matplotlib.pyplot as plt
# Sample data
categories = ['Apples', 'Bananas', 'Oranges', 'Grapes']
quantities = [25, 15, 30, 20]
# Plot
plt.bar(categories, quantities, color='green')
plt.title('Fruit Quantities')
plt.xlabel('Fruit')
plt.ylabel('Quantity')
plt.show()
Box Plot
To display the distribution, median, and outliers in data
Useful in statistics for comparing distribution across multiple groups,
spotting outliers, and observing the spread and skewness of data
Ex : Exam scores of students from different classes, income levels across
different regions.
import matplotlib.pyplot as plt
import numpy as np

# Sample data
data = [np.random.normal(50, 5, 100), np.random.normal(55, 10, 100),
np.random.normal(60, 15, 100)]

# Plot
plt.boxplot(data, patch_artist=True, labels=['Group 1', 'Group 2', 'Group 3'])
plt.title('Distribution of Test Scores by Group')
plt.ylabel('Test Scores')
plt.show()
Pie chart
To show parts of a whole.
Used for representing data as proportions or percentages of a whole, best
when the segments are limited to a few categories

Ex: budget allocation across department, market share distribution among


companies
import matplotlib.pyplot as plt

# Sample data
sizes = [25, 35, 20, 20]
labels = ['Category A', 'Category B', 'Category C', 'Category D']

# Plot
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
A histogram is a type of bar chart used in statistics to show the frequency
distribution of a continuous variable by grouping data into ranges, or "bins."
Each bin represents a range of values, and the height of each bar in the
histogram reflects the number of observations (frequency) within that range
Key Aspects of Histograms:
Purpose:
Histograms are used to understand the distribution, spread, and shape of a
dataset. They are particularly helpful for identifying the central tendency,
variability, skewness, and the presence of any outliers.
They provide a visual summary of data, making it easy to spot patterns such as
normal distribution, skewed distribution, or bimodal distribution.
Histogram
import matplotlib.pyplot as plt
import numpy as np

# Generate random data


data = np.random.normal(0, 1, 1000)

# Plot
plt.hist(data, bins=30, color='purple', edgecolor='black')
plt.title('Distribution of Random Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

You might also like