pandas (1)
pandas (1)
Pandas is a powerful and open-source Python library. The Pandas library is used for
data manipulation and analysis. Pandas consist of data structures and functions to
perform efficient operations on data.
Pandas is well-suited for working with tabular data, such as spreadsheets or SQL
tables.
The Pandas library is an essential tool for data analysts, scientists, and engineers
working with structured data in Python.
It is built on top of the NumPy library which means that a lot of the structures of
NumPy are used or replicated in Pandas
The data produced by Pandas is often used as input for plotting functions in
Matplotlib, statistical analysis in SciPy, and machine learning algorithms in
Scikit-learn.
Pandas is used throughout the data analysis workflow. With pandas, you can:
Import datasets from databases, spreadsheets, comma-separated values (CSV)
files, and more.
Clean datasets, for example, by dealing with missing values.
Tidy datasets by reshaping their structure into a suitable format for analysis.
Aggregate data by calculating summary statistics such as the mean of columns,
correlation between them, and more.
Visualize datasets and uncover insights.
pandas also contains functionality for time series analysis and analyzing text
data.
• Data Structures in Pandas Library
• Pandas generally provide two data structures for manipulating data. They
are:
• Series
• DataFrame
• A DataFrame is a 2-dimensional data structure that can store data of
different types (including characters, integers, floating point values,
categorical data and more) in columns. It is similar to a spreadsheet, a
SQL table or the data.frame in R
• Each column in a DataFrame is a Series
Installing pandas
pip install pandas
import pandas
data = [1, 2, 3, 4]
ser = pandas.DataFrame(data)
print(ser)
Example 1
import pandas as pd
mydataset = { 'cars': ["BMW", "Volvo", "Ford"], 'passings': [3, 7, 2]}
myvar = pd.DataFrame(mydataset)
print(myvar)
import pandas as pd
data = { "calories": [420, 380, 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
print(df)
Pandas use the loc attribute to return one or more specified row(s)
import pandas as pd
df = pd.DataFrame(data)
print(df.loc[1])
In my system the number is 60, which means that if the DataFrame contains more
than 60 rows, the print(df) statement will return only the headers and the first and
last 5 rows
Increase the maximum number of rows to display the entire DataFrame
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
Viewing Data
df.head()
Shows the first 5 rows
df.tail()
Shows the last 5 rows.
df.shape
Gives the dimensions (rows, columns)
Inspecting Columns
df.columns
Lists all column names
df.dtypes
Shows data types for each column
Condition-based Selection
Updating Values:
Explanation of Parameters:
labels: Specifies labels to drop. It can be used as an alternative to index or columns by specifying eith
row or column labels.
axis: Defines which axis to drop from. Use 0 for rows and 1 for columns. The default is 0.
index: Specifies row labels to drop.
columns: Specifies column labels to drop, which is used in the example provided.
level: Useful when working with MultiIndex (hierarchical) DataFrames to select labels at a specific level.
inplace: If set to True, it performs the operation in place without returning a new DataFrame.
Default is False.
errors: If set to 'raise', an error is raised if labels aren’t found. If set to 'ignore', no error is raised if the
df.dropna(inplace=True)
Visualization in Python
1. Plotting Graphs
Basic Line Plot: Start with a basic plt.plot() to show how Matplotlib handles line
graphs.
Used to display trends or change over time. Ideal for continuous data or time
series where you want to track the movement or trends over intercals. ( Ex.
Stock prices over days, temperature changes over hours., monthly sales revenue)
# Controlling Graph
Axis Limits: Use plt.xlim() and plt.ylim() to set specific axis limits.
Line Styles and Colors: Customize line styles and colors using parameters in plt.plot()
(e.g., color, linestyle, linewidth).
Grid and Background: Add grid lines with plt.grid() and background color using
plt.gca().set_facecolor().
# Customizing plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], color='red', linestyle='--', linewidth=2)
plt.xlim(0, 5)
plt.ylim(0, 20)
plt.grid(True)
plt.title("Controlled Graph")
plt.show()
#Adding TextTitle and Labels: Show how to set the title (plt.title()), axis
labels (plt.xlabel() and plt.ylabel()).
Annotations: Use plt.annotate() to add annotations directly on specific data
points.
Legend: Demonstrate plt.legend() to label different series on a graph.
# Adding text to the plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], label="Data")
plt.title("Plot with Text and Annotations")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.annotate('Highest Point', xy=(4, 16), xytext=(3, 12),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.legend()
plt.show()
Scatter Plot:
Use plt.scatter() for scatter plots, helpful in demonstrating data
distributions.
Histogram
To show the distribution of a continuos variable.
Helps in understanding frequency distribution, range and shape of data,
# Sample data
data = [np.random.normal(50, 5, 100), np.random.normal(55, 10, 100),
np.random.normal(60, 15, 100)]
# Plot
plt.boxplot(data, patch_artist=True, labels=['Group 1', 'Group 2', 'Group 3'])
plt.title('Distribution of Test Scores by Group')
plt.ylabel('Test Scores')
plt.show()
Pie chart
To show parts of a whole.
Used for representing data as proportions or percentages of a whole, best
when the segments are limited to a few categories
# Sample data
sizes = [25, 35, 20, 20]
labels = ['Category A', 'Category B', 'Category C', 'Category D']
# Plot
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
A histogram is a type of bar chart used in statistics to show the frequency
distribution of a continuous variable by grouping data into ranges, or "bins."
Each bin represents a range of values, and the height of each bar in the
histogram reflects the number of observations (frequency) within that range
Key Aspects of Histograms:
Purpose:
Histograms are used to understand the distribution, spread, and shape of a
dataset. They are particularly helpful for identifying the central tendency,
variability, skewness, and the presence of any outliers.
They provide a visual summary of data, making it easy to spot patterns such as
normal distribution, skewed distribution, or bimodal distribution.
Histogram
import matplotlib.pyplot as plt
import numpy as np
# Plot
plt.hist(data, bins=30, color='purple', edgecolor='black')
plt.title('Distribution of Random Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()