0% found this document useful (0 votes)

1 views5 pages

COVID-19 Data Analysis With Pandas and NumPy

This tutorial demonstrates how to analyze COVID-19 data using the Novel Corona Virus 2019 Dataset with Pandas and NumPy, focusing on data cleaning, transformation, and analysis. It covers loading the dataset, normalizing column names, handling missing values, and creating time-series features such as daily new cases and rolling averages. The analysis includes identifying top countries by cases and visualizing trends, with the final cleaned data saved for future use.

Uploaded by

nikban2902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views5 pages

COVID-19 Data Analysis With Pandas and NumPy

Uploaded by

nikban2902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

COVID-19 Data Analysis with Pandas and NumPy

This tutorial uses the Novel Corona Virus 2019 Dataset (from Kaggle) as a case study to illustrate data
cleaning, transformation, and analysis of COVID-19 time-series data 1 2 . The main CSV
( covid_19_data.csv ) contains daily cumulative counts (Confirmed, Deaths, Recovered) by country/
region from January 22, 2020 onward 1 2 . We will load this data into pandas, clean the columns and
dates, engineer useful time-series features, and perform analysis (top countries, trends, etc.) using pandas
and NumPy.

Dataset Loading & Cleaning

First, load the Kaggle data and inspect its columns. According to the data description, ObservationDate
is in MM/DD/YYYY format, and Confirmed , Deaths , and Recovered are cumulative counts for that
date 1 2 . We normalize column names, handle missing values (e.g. empty “Province/State”), and parse
dates.

• Load data: Read the CSV into a pandas DataFrame. (Assume covid_19_data.csv is downloaded
locally.)
• Rename columns: Remove spaces/slashes (e.g. rename Province/State → Province_State )
and standardize case.
• Handle missing: For example, fill empty provinces with None or drop if not needed.
• Parse dates: Convert ObservationDate (and Last Update ) to datetime objects. The Kaggle
data’s date strings use two-digit years (e.g. 01/22/2020 ), so we ensure the correct format 1 .
• Standardize names: If needed, map variant country names to a standard form (e.g. “Iran (Islamic
Republic of)” → “Iran”) as part of cleaning 3 .

import pandas as pd
import numpy as np

# Load and inspect the raw data

df = pd.read_csv('covid_19_data.csv')
print(df.columns.tolist())
print(df.head(2))

# Clean column names: remove slashes/spaces, convert to simple names

df.columns = [col.strip().replace('/', '_').replace(' ', '_') for col in df.columns]
# Handle missing Province/State (fill with NaN)
df['Province_State'] = df['Province_State'].replace('', np.nan)
# Parse dates (ObservationDate and Last_Update)
df['ObservationDate'] = pd.to_datetime(df['ObservationDate'], format='%m/%d/%Y', errors='coerce')
df['Last_Update'] = pd.to_datetime(df['Last_Update'], errors='coerce')

1
# Example of standardizing country names (as needed)
df['Country_Region'] = df['Country_Region'].replace({
"Iran (Islamic Republic of)": "Iran",
"Republic of Korea": "South Korea",
"Republic of Ireland": "Ireland",
"Holy See": "Vatican City",
"('St. Martin',)": "St. Martin",
"occupied Palestinian territory": "Palestine"
})

The cleaned DataFrame now has standard column names and date fields in datetime form. Missing
provinces are handled and country names normalized 1 3 . This prepares the data for aggregation and
analysis.

Data Transformation
Next, we transform the cleaned data into useful time-series aggregates. Since the original data is
cumulative by country/date, we will group by country (and date) and then create new time-series features
such as daily new cases and rolling averages.

• Aggregate by country/date: Group the data by ObservationDate and Country_Region ,

summing the cumulative counts. This yields one row per country per date with total Confirmed,
Deaths, Recovered 4 .
• Time-series format: Optionally pivot or index by date and country to create a time-series DataFrame
for each country.
• New cases: Compute daily new cases by differencing the cumulative Confirmed counts per country
(e.g. df.groupby('Country')['Confirmed'].diff() ). This converts cumulative totals into
daily increments.
• 7-day rolling average: Compute a 7-day rolling mean of the daily new cases to smooth out weekly
fluctuations 5 .

# Aggregate cumulative counts by country and date

country_daily = df.groupby(['ObservationDate','Country_Region'], as_index=False)[['Confirmed','Dea
country_daily.rename(columns={'Country_Region':'Country'}, inplace=True)

# Compute daily new cases per country

country_daily['NewConfirmed'] = country_daily.groupby('Country')['Confirmed'].diff().fillna(countr

# Compute 7-day rolling average of new cases

country_daily['7day_avg_new'] = country_daily.groupby('Country')['NewConfirmed'] \
.transform(lambda x: x.rolling(7, min_periods=1).mean())

After these transformations, each row has Confirmed , Deaths , Recovered (cumulative), as well as
NewConfirmed (daily cases) and 7day_avg_new (smoothed daily cases). We have essentially turned the
cumulative time series into a daily time series with additional features. The concept of using a 7-day rolling
average is common in COVID analysis to dampen reporting noise 5 .

2
Data Analysis
With the data prepared, we can analyze key trends:

• Top countries by cases/deaths: Identify the countries with the largest totals. For example,
summing up to the latest date often shows that countries like the US, India, and Brazil had the
highest confirmed cases 6 . We can sort by Confirmed or Deaths to get the top 10.
• Continents/regions: If continent info is available or added (via a country-to-continent mapping),
group totals by continent. In general, Asia has the highest total cases (due to population) followed by
Europe, then the Americas 7 .
• Peak dates and trends: We can examine the time-series to find when each country’s new cases
peaked. For example, many countries saw a first peak in early to mid-2020 and additional waves
later. Using our daily counts, we can identify the dates with maximum NewConfirmed for each
country or overall (this often corresponds to well-known waves).

# Example: Top 10 countries by total confirmed cases (as of last date in data)
latest_date = country_daily['ObservationDate'].max()
latest_data = country_daily[country_daily['ObservationDate'] == latest_date]
top_countries = latest_data.sort_values('Confirmed', ascending=False).head(10)
print(top_countries[['Country','Confirmed','Deaths','Recovered']])

The analysis confirms global trends: for example, United States often leads in case count, followed by
India, Brazil, and others 6 . We can also compute total cases per continent to see that Asia and Europe
dominate the counts 7 . This high-level analysis helps identify which regions were hardest hit and when.

Visualization
We visualize the results using matplotlib or seaborn. Typical plots include:

• Line plots of daily new cases: For each of the top affected countries, plot the daily new cases
(possibly on the same figure for comparison). Overlay the 7-day rolling average to show smoothed
trends 5 .
• Bar charts for comparisons: A bar chart of total confirmed cases or deaths per country (e.g. top 10)
highlights the scale of the epidemic by country.
• Heatmaps or pivot charts: We can create a pivot table (e.g. countries × dates) and use a heatmap to
show intensity of cases over time, or a choropleth world map if geographic data is available. These
visualizations make patterns and differences clearer.

import matplotlib.pyplot as plt

# Example: Line plot of daily new cases (7-day avg) for top 3 countries
top3 = top_countries['Country'].iloc[:3].tolist()
plt.figure(figsize=(8,5))
for country in top3:
data = country_daily[country_daily['Country']==country]

3
plt.plot(data['ObservationDate'], data['7day_avg_new'], label=country)
plt.legend()
plt.title('Daily New COVID-19 Cases (7-day avg) for Top Countries')
plt.xlabel('Date'); plt.ylabel('New Cases (7-day avg)')
plt.tight_layout()
plt.show()

Similarly, a bar chart of the top countries might look like the image below, which compares total cases or
deaths by country:

Figure: Heatmap of COVID-19 intensity worldwide (for illustration). In practice, we would generate our own charts
(bar plots, line plots) from the processed data.

(Note: In the above we would produce and embed our own charts; the figure here is illustrative of a
heatmap-style visualization of cases.)

Export
Finally, we save the cleaned data and analysis results for future use:

• Save cleaned data: Export the cleaned daily country-level data to CSV (e.g.
country_daily.to_csv('covid_country_daily.csv', index=False) ). This preserves our
aggregation and feature columns.
• Save summaries: We can also save summary tables (e.g. top 10 countries by cases) to CSV.
• Notebook export (optional): To share the analysis, one could convert the Jupyter notebook to PDF
(e.g. using nbconvert ) as a report.

# Example: Save cleaned country-day data to CSV

country_daily.to_csv('covid_country_daily.csv', index=False)

This completes the end-to-end tutorial: we loaded the Kaggle COVID-19 dataset, cleaned and transformed it
using pandas/NumPy, performed country- and continent-level analysis (with citations for context 1 6 ),
and visualized the key trends (smoothing with a 7-day average 5 ). The final cleaned data and results are
saved for easy sharing and further modeling.

Sources: The dataset description is available on Kaggle 1 2 , and similar analyses have been
documented in public resources 3 6 7 5 , which align with the methods shown here.

1 2 Data Source Recommendation: Novel Coronavirus 2019 Dataset – Research

https://research.binus.ac.id/2020/03/data-source-recommendation-novel-coronavirus-2019-dataset/

3 Analyzing Novel Corona Virus COVID-19 Dataset » Loren on the Art of MATLAB - MATLAB &
4 Simulink
https://blogs.mathworks.com/loren/2020/03/16/analyzing-novel-corona-virus-covid-19-dataset/

4
5 Coronavirus (COVID-19) Cases - Our World in Data
https://ourworldindata.org/covid-cases

6 Global Pandemic in Numbers: A COVID-19 Exploratory Data Analysis | by Vansh Mahindra | Apr,
7 2025 | Medium
https://medium.com/@vanshmahindra/global-pandemic-in-numbers-a-covid-19-exploratory-data-analysis-9198867a6225

Rogress of Covid 19 Vaccination 1
No ratings yet
Rogress of Covid 19 Vaccination 1
39 pages
COVID-19-Data-Analysis-Using-Python
No ratings yet
COVID-19-Data-Analysis-Using-Python
10 pages
Clustering Analysis of Countries Using The COVID-1
No ratings yet
Clustering Analysis of Countries Using The COVID-1
10 pages
Region and Domain Region and Domain
No ratings yet
Region and Domain Region and Domain
3 pages
COVID 19 Some Challenges Some Data 1
No ratings yet
COVID 19 Some Challenges Some Data 1
26 pages
Analysis and Prediction of COVID-19 For Different Regions and Countries Methods
No ratings yet
Analysis and Prediction of COVID-19 For Different Regions and Countries Methods
6 pages
COVID
No ratings yet
COVID
19 pages
Project On Covid Data
No ratings yet
Project On Covid Data
5 pages
Assignment Report Presentation
No ratings yet
Assignment Report Presentation
12 pages
Science Process Skills
100% (1)
Science Process Skills
17 pages
Eda 21524785
No ratings yet
Eda 21524785
32 pages
Pyr Agossou FR
No ratings yet
Pyr Agossou FR
12 pages
M23aid027 DCS Ass2
No ratings yet
M23aid027 DCS Ass2
14 pages
Covid Report PDF
No ratings yet
Covid Report PDF
17 pages
Análisis de Propagación Del Coronavirus: Angel Villamizar
No ratings yet
Análisis de Propagación Del Coronavirus: Angel Villamizar
16 pages
Covid Analysis
No ratings yet
Covid Analysis
32 pages
Ikhwan Salihin E-Portfolio
No ratings yet
Ikhwan Salihin E-Portfolio
6 pages
lecture3
No ratings yet
lecture3
53 pages
Covid19 Visualization
No ratings yet
Covid19 Visualization
2 pages
Corona Virus Analysis
No ratings yet
Corona Virus Analysis
27 pages
Jupyter Notebook2
No ratings yet
Jupyter Notebook2
15 pages
Week12 Slides
No ratings yet
Week12 Slides
46 pages
Assignment Sujith S
No ratings yet
Assignment Sujith S
13 pages
COVID19 Analysis Tableau
No ratings yet
COVID19 Analysis Tableau
13 pages
Spatial Disparities in COVID-19 Vaccination Coverage in Bangladesh 8july21
No ratings yet
Spatial Disparities in COVID-19 Vaccination Coverage in Bangladesh 8july21
34 pages
CS FINAL
No ratings yet
CS FINAL
85 pages
DAC Phase4
No ratings yet
DAC Phase4
4 pages
mini
No ratings yet
mini
6 pages
Data Analytics Assignment 1
No ratings yet
Data Analytics Assignment 1
11 pages
Co Vids QL Present N 0710
No ratings yet
Co Vids QL Present N 0710
27 pages
NM
No ratings yet
NM
23 pages
report_MSA_Practice02
No ratings yet
report_MSA_Practice02
29 pages
DAC_Phase1
No ratings yet
DAC_Phase1
5 pages
My P Report
No ratings yet
My P Report
14 pages
Mini Project DVBI
No ratings yet
Mini Project DVBI
3 pages
Da Phase1
No ratings yet
Da Phase1
6 pages
Visualizing COVID-19 Data Beautifully in Python (In 5 Minutes or Less!!) - by Nik Piepenbreier - Towards Data Science
No ratings yet
Visualizing COVID-19 Data Beautifully in Python (In 5 Minutes or Less!!) - by Nik Piepenbreier - Towards Data Science
8 pages
Corona Virus in India
No ratings yet
Corona Virus in India
29 pages
IP Project Covid-19 Impact
No ratings yet
IP Project Covid-19 Impact
25 pages
Covid Data For Pbi Dashboard
No ratings yet
Covid Data For Pbi Dashboard
2 pages
Intro To Py and ML - Part 2
No ratings yet
Intro To Py and ML - Part 2
10 pages
A Guide To Digital Fault Recording Event Analysis
100% (1)
A Guide To Digital Fault Recording Event Analysis
17 pages
Hutagalung 2021 J. Phys. Conf. Ser. 1783 012027
No ratings yet
Hutagalung 2021 J. Phys. Conf. Ser. 1783 012027
7 pages
Data Analysis and Interpretation Template
No ratings yet
Data Analysis and Interpretation Template
7 pages
Python Pandas Project
No ratings yet
Python Pandas Project
17 pages
I.p Project
No ratings yet
I.p Project
24 pages
Data Analysis in R
No ratings yet
Data Analysis in R
12 pages
Report - Data Visualization and Exploration
No ratings yet
Report - Data Visualization and Exploration
14 pages
Tools for Data Visualization
No ratings yet
Tools for Data Visualization
2 pages
Unacademy National Scholarship Admission Test
No ratings yet
Unacademy National Scholarship Admission Test
14 pages
Sample
No ratings yet
Sample
13 pages
Ashutosh Project
No ratings yet
Ashutosh Project
19 pages
Name
No ratings yet
Name
23 pages
2023 Physics Board Paper
No ratings yet
2023 Physics Board Paper
11 pages
Notes 2 PDF
No ratings yet
Notes 2 PDF
5 pages
Mustafa Oguzhan Akdogan Uzay 20220623
No ratings yet
Mustafa Oguzhan Akdogan Uzay 20220623
31 pages
Name
No ratings yet
Name
23 pages
Introduction R for DS
No ratings yet
Introduction R for DS
9 pages
Regression Analys
No ratings yet
Regression Analys
7 pages
Designing A Smart Honey Supply Chain For Sustainable Development
No ratings yet
Designing A Smart Honey Supply Chain For Sustainable Development
12 pages
COVID 19 Data Analyzer Unveiling Insights Through Data Processing and EDA
No ratings yet
COVID 19 Data Analyzer Unveiling Insights Through Data Processing and EDA
8 pages
COVID-19-Data-Analyzer-Unveiling-Insights-Through-Data-Processing-and-EDA
No ratings yet
COVID-19-Data-Analyzer-Unveiling-Insights-Through-Data-Processing-and-EDA
8 pages
Assignment 7 Covid 19 .Ipynb - Colab
No ratings yet
Assignment 7 Covid 19 .Ipynb - Colab
6 pages
Python Pandas Data Analysis
No ratings yet
Python Pandas Data Analysis
36 pages
Msc.1-Circ.1228
No ratings yet
Msc.1-Circ.1228
25 pages
assignment 8_
No ratings yet
assignment 8_
2 pages
Core Topics HL Chapters Summaries
No ratings yet
Core Topics HL Chapters Summaries
8 pages
Successful 10-Second One-Legged Stance Performance Predicts Survival in Middle-Aged and Older Individuals 2022
No ratings yet
Successful 10-Second One-Legged Stance Performance Predicts Survival in Middle-Aged and Older Individuals 2022
7 pages
Estimation Method For The Fatigue Limit of Case Hardened Steels
No ratings yet
Estimation Method For The Fatigue Limit of Case Hardened Steels
6 pages
DX Diag
No ratings yet
DX Diag
30 pages
HC Series Product Brochure V1.2 20230927
No ratings yet
HC Series Product Brochure V1.2 20230927
12 pages
COVID
No ratings yet
COVID
2 pages
Air Pollution Tutorial Questions_Solutions - Copy
No ratings yet
Air Pollution Tutorial Questions_Solutions - Copy
8 pages
Eng Gr11 Maths Nov Exam p1
No ratings yet
Eng Gr11 Maths Nov Exam p1
9 pages
Assignment 3: Forecasting Question 5 - 33
No ratings yet
Assignment 3: Forecasting Question 5 - 33
15 pages
DAC_phase4
No ratings yet
DAC_phase4
6 pages
05.petchem Engg
No ratings yet
05.petchem Engg
101 pages
Releasing Module: Standard Features
No ratings yet
Releasing Module: Standard Features
10 pages
Multi Spindle
No ratings yet
Multi Spindle
10 pages
Crop Price Prediction Using Machine Learning
No ratings yet
Crop Price Prediction Using Machine Learning
5 pages
Uudd 15-2021
No ratings yet
Uudd 15-2021
82 pages
Unit 2a
No ratings yet
Unit 2a
12 pages
Eaton 5px Ups Upgrade Instructions
No ratings yet
Eaton 5px Ups Upgrade Instructions
14 pages
PRIMELOG
No ratings yet
PRIMELOG
2 pages
152941214438 m Nvs 6
No ratings yet
152941214438 m Nvs 6
1 page
Donati CRANE SET INGLESE
No ratings yet
Donati CRANE SET INGLESE
28 pages
MIPS Instruction Reference: NPC To PC
No ratings yet
MIPS Instruction Reference: NPC To PC
9 pages
Battery Do and Donts
No ratings yet
Battery Do and Donts
1 page
MT80 User Manual V1.3
No ratings yet
MT80 User Manual V1.3
17 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet

COVID-19 Data Analysis With Pandas and NumPy

Uploaded by

COVID-19 Data Analysis With Pandas and NumPy

Uploaded by

COVID-19 Data Analysis with Pandas and NumPy

Dataset Loading & Cleaning

# Load and inspect the raw data

# Clean column names: remove slashes/spaces, convert to simple names

• Aggregate by country/date: Group the data by ObservationDate and Country_Region ,

# Aggregate cumulative counts by country and date

# Compute daily new cases per country

# Compute 7-day rolling average of new cases

import matplotlib.pyplot as plt

# Example: Save cleaned country-day data to CSV

1 2 Data Source Recommendation: Novel Coronavirus 2019 Dataset – Research

You might also like