0% found this document useful (0 votes)
1 views5 pages

COVID-19 Data Analysis With Pandas and NumPy

This tutorial demonstrates how to analyze COVID-19 data using the Novel Corona Virus 2019 Dataset with Pandas and NumPy, focusing on data cleaning, transformation, and analysis. It covers loading the dataset, normalizing column names, handling missing values, and creating time-series features such as daily new cases and rolling averages. The analysis includes identifying top countries by cases and visualizing trends, with the final cleaned data saved for future use.

Uploaded by

nikban2902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views5 pages

COVID-19 Data Analysis With Pandas and NumPy

This tutorial demonstrates how to analyze COVID-19 data using the Novel Corona Virus 2019 Dataset with Pandas and NumPy, focusing on data cleaning, transformation, and analysis. It covers loading the dataset, normalizing column names, handling missing values, and creating time-series features such as daily new cases and rolling averages. The analysis includes identifying top countries by cases and visualizing trends, with the final cleaned data saved for future use.

Uploaded by

nikban2902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

COVID-19 Data Analysis with Pandas and NumPy

This tutorial uses the Novel Corona Virus 2019 Dataset (from Kaggle) as a case study to illustrate data
cleaning, transformation, and analysis of COVID-19 time-series data 1 2 . The main CSV
( covid_19_data.csv ) contains daily cumulative counts (Confirmed, Deaths, Recovered) by country/
region from January 22, 2020 onward 1 2 . We will load this data into pandas, clean the columns and
dates, engineer useful time-series features, and perform analysis (top countries, trends, etc.) using pandas
and NumPy.

Dataset Loading & Cleaning


First, load the Kaggle data and inspect its columns. According to the data description, ObservationDate
is in MM/DD/YYYY format, and Confirmed , Deaths , and Recovered are cumulative counts for that
date 1 2 . We normalize column names, handle missing values (e.g. empty “Province/State”), and parse
dates.

• Load data: Read the CSV into a pandas DataFrame. (Assume covid_19_data.csv is downloaded
locally.)
• Rename columns: Remove spaces/slashes (e.g. rename Province/State → Province_State )
and standardize case.
• Handle missing: For example, fill empty provinces with None or drop if not needed.
• Parse dates: Convert ObservationDate (and Last Update ) to datetime objects. The Kaggle
data’s date strings use two-digit years (e.g. 01/22/2020 ), so we ensure the correct format 1 .
• Standardize names: If needed, map variant country names to a standard form (e.g. “Iran (Islamic
Republic of)” → “Iran”) as part of cleaning 3 .

import pandas as pd
import numpy as np

# Load and inspect the raw data


df = pd.read_csv('covid_19_data.csv')
print(df.columns.tolist())
print(df.head(2))

# Clean column names: remove slashes/spaces, convert to simple names


df.columns = [col.strip().replace('/', '_').replace(' ', '_') for col in df.columns]
# Handle missing Province/State (fill with NaN)
df['Province_State'] = df['Province_State'].replace('', np.nan)
# Parse dates (ObservationDate and Last_Update)
df['ObservationDate'] = pd.to_datetime(df['ObservationDate'], format='%m/%d/%Y', errors='coerce')
df['Last_Update'] = pd.to_datetime(df['Last_Update'], errors='coerce')

1
# Example of standardizing country names (as needed)
df['Country_Region'] = df['Country_Region'].replace({
"Iran (Islamic Republic of)": "Iran",
"Republic of Korea": "South Korea",
"Republic of Ireland": "Ireland",
"Holy See": "Vatican City",
"('St. Martin',)": "St. Martin",
"occupied Palestinian territory": "Palestine"
})

The cleaned DataFrame now has standard column names and date fields in datetime form. Missing
provinces are handled and country names normalized 1 3 . This prepares the data for aggregation and
analysis.

Data Transformation
Next, we transform the cleaned data into useful time-series aggregates. Since the original data is
cumulative by country/date, we will group by country (and date) and then create new time-series features
such as daily new cases and rolling averages.

• Aggregate by country/date: Group the data by ObservationDate and Country_Region ,


summing the cumulative counts. This yields one row per country per date with total Confirmed,
Deaths, Recovered 4 .
• Time-series format: Optionally pivot or index by date and country to create a time-series DataFrame
for each country.
• New cases: Compute daily new cases by differencing the cumulative Confirmed counts per country
(e.g. df.groupby('Country')['Confirmed'].diff() ). This converts cumulative totals into
daily increments.
• 7-day rolling average: Compute a 7-day rolling mean of the daily new cases to smooth out weekly
fluctuations 5 .

# Aggregate cumulative counts by country and date


country_daily = df.groupby(['ObservationDate','Country_Region'], as_index=False)[['Confirmed','Dea
country_daily.rename(columns={'Country_Region':'Country'}, inplace=True)

# Compute daily new cases per country


country_daily['NewConfirmed'] = country_daily.groupby('Country')['Confirmed'].diff().fillna(countr

# Compute 7-day rolling average of new cases


country_daily['7day_avg_new'] = country_daily.groupby('Country')['NewConfirmed'] \
.transform(lambda x: x.rolling(7, min_periods=1).mean())

After these transformations, each row has Confirmed , Deaths , Recovered (cumulative), as well as
NewConfirmed (daily cases) and 7day_avg_new (smoothed daily cases). We have essentially turned the
cumulative time series into a daily time series with additional features. The concept of using a 7-day rolling
average is common in COVID analysis to dampen reporting noise 5 .

2
Data Analysis
With the data prepared, we can analyze key trends:

• Top countries by cases/deaths: Identify the countries with the largest totals. For example,
summing up to the latest date often shows that countries like the US, India, and Brazil had the
highest confirmed cases 6 . We can sort by Confirmed or Deaths to get the top 10.
• Continents/regions: If continent info is available or added (via a country-to-continent mapping),
group totals by continent. In general, Asia has the highest total cases (due to population) followed by
Europe, then the Americas 7 .
• Peak dates and trends: We can examine the time-series to find when each country’s new cases
peaked. For example, many countries saw a first peak in early to mid-2020 and additional waves
later. Using our daily counts, we can identify the dates with maximum NewConfirmed for each
country or overall (this often corresponds to well-known waves).

# Example: Top 10 countries by total confirmed cases (as of last date in data)
latest_date = country_daily['ObservationDate'].max()
latest_data = country_daily[country_daily['ObservationDate'] == latest_date]
top_countries = latest_data.sort_values('Confirmed', ascending=False).head(10)
print(top_countries[['Country','Confirmed','Deaths','Recovered']])

The analysis confirms global trends: for example, United States often leads in case count, followed by
India, Brazil, and others 6 . We can also compute total cases per continent to see that Asia and Europe
dominate the counts 7 . This high-level analysis helps identify which regions were hardest hit and when.

Visualization
We visualize the results using matplotlib or seaborn. Typical plots include:

• Line plots of daily new cases: For each of the top affected countries, plot the daily new cases
(possibly on the same figure for comparison). Overlay the 7-day rolling average to show smoothed
trends 5 .
• Bar charts for comparisons: A bar chart of total confirmed cases or deaths per country (e.g. top 10)
highlights the scale of the epidemic by country.
• Heatmaps or pivot charts: We can create a pivot table (e.g. countries × dates) and use a heatmap to
show intensity of cases over time, or a choropleth world map if geographic data is available. These
visualizations make patterns and differences clearer.

import matplotlib.pyplot as plt

# Example: Line plot of daily new cases (7-day avg) for top 3 countries
top3 = top_countries['Country'].iloc[:3].tolist()
plt.figure(figsize=(8,5))
for country in top3:
data = country_daily[country_daily['Country']==country]

3
plt.plot(data['ObservationDate'], data['7day_avg_new'], label=country)
plt.legend()
plt.title('Daily New COVID-19 Cases (7-day avg) for Top Countries')
plt.xlabel('Date'); plt.ylabel('New Cases (7-day avg)')
plt.tight_layout()
plt.show()

Similarly, a bar chart of the top countries might look like the image below, which compares total cases or
deaths by country:

Figure: Heatmap of COVID-19 intensity worldwide (for illustration). In practice, we would generate our own charts
(bar plots, line plots) from the processed data.

(Note: In the above we would produce and embed our own charts; the figure here is illustrative of a
heatmap-style visualization of cases.)

Export
Finally, we save the cleaned data and analysis results for future use:

• Save cleaned data: Export the cleaned daily country-level data to CSV (e.g.
country_daily.to_csv('covid_country_daily.csv', index=False) ). This preserves our
aggregation and feature columns.
• Save summaries: We can also save summary tables (e.g. top 10 countries by cases) to CSV.
• Notebook export (optional): To share the analysis, one could convert the Jupyter notebook to PDF
(e.g. using nbconvert ) as a report.

# Example: Save cleaned country-day data to CSV


country_daily.to_csv('covid_country_daily.csv', index=False)

This completes the end-to-end tutorial: we loaded the Kaggle COVID-19 dataset, cleaned and transformed it
using pandas/NumPy, performed country- and continent-level analysis (with citations for context 1 6 ),
and visualized the key trends (smoothing with a 7-day average 5 ). The final cleaned data and results are
saved for easy sharing and further modeling.

Sources: The dataset description is available on Kaggle 1 2 , and similar analyses have been
documented in public resources 3 6 7 5 , which align with the methods shown here.

1 2 Data Source Recommendation: Novel Coronavirus 2019 Dataset – Research


https://research.binus.ac.id/2020/03/data-source-recommendation-novel-coronavirus-2019-dataset/

3 Analyzing Novel Corona Virus COVID-19 Dataset » Loren on the Art of MATLAB - MATLAB &
4 Simulink
https://blogs.mathworks.com/loren/2020/03/16/analyzing-novel-corona-virus-covid-19-dataset/

4
5 Coronavirus (COVID-19) Cases - Our World in Data
https://ourworldindata.org/covid-cases

6 Global Pandemic in Numbers: A COVID-19 Exploratory Data Analysis | by Vansh Mahindra | Apr,
7 2025 | Medium
https://medium.com/@vanshmahindra/global-pandemic-in-numbers-a-covid-19-exploratory-data-analysis-9198867a6225

You might also like