Unit 3_Numpy_VP
Unit 3_Numpy_VP
Before using NumPy, you need to make sure it's installed. You can install it using
pip:
import numpy as np
By convention, it's common to import NumPy as np for brevity.
print(new_arr)
Output: [3 4 5 6 7]
VISHNU PRIYA P M 3
Creating NumPy Arrays
You can create NumPy arrays using various methods:
range_arr
VISHNU PRIYA P M
= np.arange(0, 10, 2) # Creates an array with values [0, 2, 4, 6, 8] 4
BASIC ARRAY OPERATIONS
Once you have NumPy arrays, you can perform various operations on them:
1. Element-wise Operations:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = a + b # Element-wise addition: [5, 7, 9]
d = a * b # Element-wise multiplication: [4, 10, 18]
VISHNU PRIYA P M 5
2. Indexing and Slicing:
VISHNU PRIYA P M 6
Slicing:Slicing allows you to access a range or subset of elements from
an array. It is done using the syntax arr[start:end], where start is the
index where the slice begins (inclusive), and end is where it stops
(exclusive).
VISHNU PRIYA P M 8
Slicing with Steps:You can also specify a step value, which tells how
many elements to skip in the slice. The syntax is arr[start:end:step].
Example:
index 5.
3. Array Shape and Reshaping:
The shape of an array tells us how many elements it contains along each
dimension (or axis). You can check the shape of an array using
the .shape attribute.
Reshaping:
Reshaping allows you to change the shape of an array without changing
its data. You can convert a 1D array to a 2D array, or a 2D array to a 3D
array, etc., as long as the total number of elements stays the same.
Example:
VISHNU PRIYA P M 10
# Creating a 1D array with 6 elements
arr = np.array([1, 2, 3, 4, 5, 6])
print(reshaped_arr)
Reshape Rules:
When reshaping an array, the new shape must contain the same total number of
elements as the original array. For example, if you have an array with 12 elements,
you could reshape it to:A 2x6 array (2 rows x 6 columns)A 3x4 array (3 rows x 4
columns)A 4x3 array (4 rows x 3 columns)
Example
reshaped_arr = arr.reshape(3, 4)
print(reshaped_arr)
Flattening an Array:If you want to convert a multi-dimensional array back into
a 1D array, you can flatten it using the .flatten() method.
Example
print(flat_arr)
O/P
[1 2 3 4 5 6]
Agregation functions are used to perform calculations on an entire array or along a specific
axis (e.g., summing all elements, finding the maximum, etc.). These functions are essential
for data analysis and numerical computations.
Common Aggregation Functions:Here are some of the most commonly used aggregation
functions in NumPy:
1. Sum:The sum() function adds all the elements of an array.
2. Mean:The mean() function calculates the average of the elements.
3. Maximum and Minimum:max() gives the maximum value in the array.min() gives the
minimum value in the array.
4. Product:The prod() function returns the product of all elements in the array (i.e.,
multiplies all elements together).
5. Standard Deviation and Variance:std() calculates the standard deviation (how spread out
the numbers are).var() calculates the variance (the square of the standard deviation).
6. Cumulative Sum and Product:cumsum() gives the cumulative sum (the sum of the
elements up to each index).cumprod() gives the cumulative product (the product of
elements up to each index).
VISHNU PRIYA P M 13
NumPy allows you to perform operations on entire arrays, making code more concise and
efficient. Here's how you can achieve the same result using NumPy:
import numpy as np
VISHNU PRIYA P M 15
INTRODUCTION TO PANDAS DATA STRUCTURES
Pandas is a popular Python library for data manipulation and analysis. It provides two
primary data structures: the DataFrame and the Series. These data structures are
designed to handle structured data, making it easier to work with datasets in a tabular
format.
DataFrame:
a column.
Here's a basic example of how to create a DataFrame using
Pandas:
import pandas as pd
df = pd.DataFrame(data)
A Series is a one-dimensional labeled array that can hold data of any data type.
It is like a column in a DataFrame or a single variable in statistics.
Series objects are commonly used for time series data, as well as other one-dimensional
data.
Key characteristics of a Pandas Series:
Homogeneous Data: Unlike Python lists or NumPy arrays, a Pandas Series enforces
homogeneity, meaning all the data within a Series must be of the same data type. For
example, if you create a Series with integer values, all values within that Series will be
integers.
Labeled Data: Series have two parts: the data itself and an associated index. The index
provides labels or names for each data point in the Series. By default, Series have a numeric
index starting from 0, but you can specify custom labels if needed.
VISHNU PRIYA P M 18
Size and Shape: A Series has a size (the number of elements) and shape (1-dimensional) but
does not have columns or rows like a DataFrame.
import pandas as pd
0 10
# Create a Series from a list 1 20
data = [10, 20, 30, 40, 50] 2 30
series = pd.Series(data) 3 40
4 50
# Display the Series dtype: int64
print(series)
VISHNU PRIYA P M 19
Some common tasks you can perform with Pandas:
Data Loading: Pandas can read data from various sources, including CSV files, Excel
spreadsheets, SQL databases, and more.
Data Cleaning: You can clean and preprocess data by handling missing values, removing
duplicates, and transforming data types.
Data Selection: Easily select specific rows and columns of interest using various indexing
techniques.
Data Aggregation: Perform groupby operations, calculate statistics, and aggregate data
based on specific criteria.
Data Visualization: You can use Pandas in conjunction with visualization libraries like
Matplotlib and Seaborn to create informative plots and charts.
VISHNU PRIYA P M 20
DataFrame
Here's how you can work with DataFrames in Python using Pandas:
1. Import Pandas:
First, you need to import the Pandas library.
import pandas as pd
2. Creating a DataFrame:
You can create a DataFrame in several ways. Here
are a few common methods:
From a dictionary:
df = pd.read_csv('file.csv')
3. Viewing Data:
You can use various methods to view and explore your DataFrame:
VISHNU PRIYA P M 23
6. Data Analysis:
Pandas provides various functions for data
analysis, such as describe(), groupby(), agg(), and
more.
7. Saving Data:
You can save the DataFrame to a CSV file or other
df.to_csv('output.csv',
formats: index=False)
VISHNU PRIYA P M 24
INDEX OBJECTS-INDEXING, SELECTION, AND FILTERING
• Label-based indexing:
2. Selection:
You can use various methods to select specific data based on conditions or criteria.
df[(df['Column1'] > 5) & (df['Column2'] < 10)] # Rows where 'Column1' > 5 and 'Column2' < 10
VISHNU PRIYA P M 26
3. Filtering:
Filtering allows you to create a boolean mask based on a
condition and then apply that mask to your DataFrame to
select rows meeting the condition.
filtered_df = df[condition]
VISHNU PRIYA P M 27
A boolean mask is like a checklist that goes through each row in your DataFrame and
marks whether it meets the condition (True) or not (False).
Boolean Mask Example:
Meets Condition?
Name Age Score
(Age > 25)
Alice 24 85 False
Bob 27 90 True
Charlie 22 88 False
David 32 95 True
df.set_index('Column_Name', inplace=True)
VISHNU PRIYA P M 28
5. Resetting the Index:
If you've set a column as the index and want to revert to the default integer-based index, you
can use the .reset_index() method.
df.reset_index(inplace=True)
6. Multi-level Indexing:
You can create DataFrames with multi-level indexes, allowing you to work with more complex
hierarchical data structures.
Index objects in Pandas are versatile and powerful for working with data because
they enable you to access and manipulate your data in various ways, whether it's for
data retrieval, filtering, or restructuring.
VISHNU PRIYA P M 29
ARITHMETIC AND DATA ALIGNMENT IN PANDAS
Arithmetic and data alignment in Pandas refer to how mathematical operations are performed
between Series and DataFrames when they have different shapes or indices. Pandas
automatically aligns data based on the labels of the objects involved in the operation, which
ensures that the result of the operation maintains data integrity and is aligned correctly. Here are
some key aspects of arithmetic and data alignment in Pandas:
1. Automatic Alignment:
When you perform mathematical operations (e.g., addition, subtraction, multiplication, division)
between two Series or DataFrames, Pandas aligns the data based on their labels (index or column
names). It aligns the data based on common labels and performs the operation only on matching
labels.
3. DataFrame Alignment:
The same principles apply to DataFrames when performing operations between them. The
alignment occurs both for rows (based on the index) and columns (based on column names).
Automatic alignment in Pandas is a powerful feature that simplifies data manipulation and
allows you to work with datasets of different shapes without needing to manually align them. It
ensures that operations are performed in a way that maintains the integrity and structure of
your data.
VISHNU PRIYA P M 32
ARITHMETIC AND DATA ALIGNMENT IN NUMPY
NumPy, like Pandas, performs arithmetic and data alignment when working with arrays.
However, unlike Pandas, NumPy is primarily focused on numerical computations with
homogeneous arrays (arrays of the same data type). Here's how arithmetic and data alignment
work in NumPy:
Automatic Alignment:
NumPy arrays perform element-wise operations, and they automatically align data based on
the shape of the arrays being operated on. This means that if you perform an operation
between two NumPy arrays of different shapes, NumPy will broadcast the smaller array to
match the shape of the larger one, element-wise.
import numpy as np
If the arrays have a different number of dimensions, pad the smaller shape with ones on the left
side.
Compare the shapes element-wise, starting from the right. If dimensions are equal or one of them
is 1, they are compatible.
If the dimensions are incompatible, NumPy raises a "ValueError: operands could not be broadcast
together" error.
Element-Wise Operations:
NumPy performs arithmetic operations element-wise by default. This means that each element in
the resulting array is the result of applying the operation to the corresponding elements in the
input arrays.
VISHNU PRIYA P M 34
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
APPLYING FUNCTIONS AND MAPPING
In NumPy, you can apply functions and perform element-wise operations on arrays using various
techniques, including vectorized functions, np.apply_along_axis(), and the np.vectorize() function.
Additionally, you can use the np.vectorize() function for mapping operations. Here's an overview
of these approaches:
Vectorized Functions:
NumPy is designed to work efficiently with vectorized operations, meaning you can apply
functions to entire arrays or elements of arrays without the need for explicit loops. NumPy
provides built-in functions that can be applied element-wise to arrays.
import numpy as np
# Applying
VISHNU PRIYA P Ma function element-wise 35
import numpy as np
VISHNU PRIYA P M 36
np.vectorize():
The np.vectorize() function allows you to create a vectorized version of a Python function, which
can then be applied element-wise to NumPy arrays.
import numpy as np
import numpy as np
Sorting and ranking are common data manipulation operations in data analysis and are widely
supported in Python through libraries like NumPy and Pandas. These operations help organize
data in a desired order or rank elements based on specific criteria. Here's how to perform
sorting and ranking in both libraries:
Sorting in NumPy:
In NumPy, you can sort NumPy arrays using the np.sort() and np.argsort() functions.
np.sort(): This function returns a new sorted array without modifying the original array.
import numpy as np
sorted_arr
VISHNU PRIYA P M = np.sort(arr) 39
np.argsort(): This function returns the indices that would sort the array. You can use these
indices to sort the original array.
import numpy as np
indices = np.argsort(arr)
sorted_arr = arr[indices]
Sorting in Pandas:
In Pandas, you can sort Series and DataFrames using the sort_values() method. You can specify
the column(s) to sort by and the sorting order.
import pandas as pd
df = pd.DataFrame(data)
VISHNU PRIYA P M 40
NumPy doesn't have a built-in ranking function, but you can use np.argsort() to get the ranking
of elements. You can then use these rankings to create a ranked array.
import numpy as np
indices = np.argsort(arr)
ranked_arr = np.argsort(indices) + 1 # Add 1 to start ranking from 1 instead of 0
Ranking in Pandas:
In Pandas, you can rank data using the rank() method. You can specify the sorting order and
how to handle ties (e.g., assigning the average rank to tied values).
import pandas as pd
df = pd.DataFrame(data)
VISHNU PRIYA P M 41
# Rank by 'Age' column in descending order and assign average rank to tied values
df['Rank'] = df['Age'].rank(ascending=False, method='average')
SUMMARIZING AND COMPUTING DESCRIPTIVE STATISTICS
1. Summary Statistics:
NumPy provides functions to compute summary statistics directly on arrays.
import numpy as np
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)
VISHNU PRIYA P M 42
2. Percentiles and Quartiles:
You can compute specific percentiles and quartiles using the np.percentile() function.
VISHNU PRIYA P M 43
CORRELATION AND COVARIANCE
In NumPy, you can compute correlation and covariance between arrays using the np.corrcoef()
and np.cov() functions, respectively. These functions are useful for analyzing relationships and
dependencies between variables. Here's how to use them:
import numpy as np
VISHNU PRIYA P M 45
Computing Covariance:
Covariance measures the degree to which two variables change together. Positive values
indicate a positive relationship (both variables increase or decrease together), while negative
values indicate an inverse relationship (one variable increases as the other decreases).
import numpy as np
Both np.corrcoef() and np.cov() can accept multiple arrays as input, allowing you to compute
correlations
VISHNU PRIYA P M and covariances for multiple variables simultaneously. For example, if you have a 46
dataset with multiple columns, you can compute the correlation matrix or covariance matrix for
all pairs of variables.
HANDLING MISSING DATA
Handling missing data in NumPy is an important aspect of data analysis and manipulation.
NumPy provides several ways to work with missing or undefined values, typically represented
as NaN (Not-a-Number). Here are some common techniques for handling missing data in
NumPy:
Using np.nan: NumPy represents missing data using np.nan. You can create arrays with missing
values like this:
import numpy as np
VISHNU PRIYA P M 47
Checking for Missing Data: You can check for missing values using the np.isnan() function. For
example:
arr[~np.isnan(arr)] # Returns
Replacing Missing Data: anreplace
You can array without
missingNaN values.
values with a specific value using
np.nan_to_num() or np.nanmean(). For example:
mean = np.nanmean(arr)
arr[np.isnan(arr)] = mean
VISHNU PRIYA P M 48
Ignoring Missing Data: Sometimes, you may want to perform operations while ignoring missing
values. You can use functions like np.nanmax(), np.nanmin(), np.nansum(), etc., which ignore
NaN values when computing the result.
Interpolation: If you have a time series or ordered data, you can use interpolation methods to
fill missing values. NumPy provides functions like np.interp() for this purpose.
Masked Arrays: NumPy also supports masked arrays (numpy.ma) that allow you to work with
missing data more explicitly by creating a mask that specifies which values are missing. This
can be useful for certain computations.
VISHNU PRIYA P M 49
HIERARCHICAL INDEXING
Hierarchical indexing in NumPy is often referred to as "MultiIndexing" and allows you to work with
multi-dimensional arrays where each dimension has multiple levels or labels. This is particularly
useful when you want to represent higher-dimensional data with more complex hierarchical
structures.
You can create a MultiIndex in NumPy using the numpy.MultiIndex class. Here's a basic example:
import numpy as np
You can access data from this DataFrame using hierarchical indexing. For example:
VISHNU PRIYA P M 51
Some common operations with hierarchical indexing include:
Slicing: You can perform slices at each level of the index, allowing you to select specific subsets
of the data.
Stacking and Unstacking: You can stack or unstack levels to convert between a wide and long
format, which can be useful for different types of analyses.
Swapping Levels: You can swap levels to change the order of the levels in the index.
Grouping and Aggregating: You can group data based on levels of the index and perform
aggregation functions like mean, sum, etc.
Reordering Levels: You can change the order of levels in the index.
Resetting Index: You can reset the index to move the hierarchical index levels back to columns.
VISHNU PRIYA P M 52
Hierarchical indexing is especially valuable when dealing with multi-dimensional
data, such as panel data or data with multiple categorical variables. It allows for
more expressive data organization and manipulation. You can also use the
pd.MultiIndex class from the pandas library, which provides more advanced
functionality for working with hierarchical data structures, including various
methods for creating and manipulating MultiIndex objects.
VISHNU PRIYA P M 53