COVID-19 Data Analysis With Pandas and NumPy
COVID-19 Data Analysis With Pandas and NumPy
This tutorial uses the Novel Corona Virus 2019 Dataset (from Kaggle) as a case study to illustrate data
cleaning, transformation, and analysis of COVID-19 time-series data 1 2 . The main CSV
( covid_19_data.csv ) contains daily cumulative counts (Confirmed, Deaths, Recovered) by country/
region from January 22, 2020 onward 1 2 . We will load this data into pandas, clean the columns and
dates, engineer useful time-series features, and perform analysis (top countries, trends, etc.) using pandas
and NumPy.
• Load data: Read the CSV into a pandas DataFrame. (Assume covid_19_data.csv is downloaded
locally.)
• Rename columns: Remove spaces/slashes (e.g. rename Province/State → Province_State )
and standardize case.
• Handle missing: For example, fill empty provinces with None or drop if not needed.
• Parse dates: Convert ObservationDate (and Last Update ) to datetime objects. The Kaggle
data’s date strings use two-digit years (e.g. 01/22/2020 ), so we ensure the correct format 1 .
• Standardize names: If needed, map variant country names to a standard form (e.g. “Iran (Islamic
Republic of)” → “Iran”) as part of cleaning 3 .
import pandas as pd
import numpy as np
1
# Example of standardizing country names (as needed)
df['Country_Region'] = df['Country_Region'].replace({
"Iran (Islamic Republic of)": "Iran",
"Republic of Korea": "South Korea",
"Republic of Ireland": "Ireland",
"Holy See": "Vatican City",
"('St. Martin',)": "St. Martin",
"occupied Palestinian territory": "Palestine"
})
The cleaned DataFrame now has standard column names and date fields in datetime form. Missing
provinces are handled and country names normalized 1 3 . This prepares the data for aggregation and
analysis.
Data Transformation
Next, we transform the cleaned data into useful time-series aggregates. Since the original data is
cumulative by country/date, we will group by country (and date) and then create new time-series features
such as daily new cases and rolling averages.
After these transformations, each row has Confirmed , Deaths , Recovered (cumulative), as well as
NewConfirmed (daily cases) and 7day_avg_new (smoothed daily cases). We have essentially turned the
cumulative time series into a daily time series with additional features. The concept of using a 7-day rolling
average is common in COVID analysis to dampen reporting noise 5 .
2
Data Analysis
With the data prepared, we can analyze key trends:
• Top countries by cases/deaths: Identify the countries with the largest totals. For example,
summing up to the latest date often shows that countries like the US, India, and Brazil had the
highest confirmed cases 6 . We can sort by Confirmed or Deaths to get the top 10.
• Continents/regions: If continent info is available or added (via a country-to-continent mapping),
group totals by continent. In general, Asia has the highest total cases (due to population) followed by
Europe, then the Americas 7 .
• Peak dates and trends: We can examine the time-series to find when each country’s new cases
peaked. For example, many countries saw a first peak in early to mid-2020 and additional waves
later. Using our daily counts, we can identify the dates with maximum NewConfirmed for each
country or overall (this often corresponds to well-known waves).
# Example: Top 10 countries by total confirmed cases (as of last date in data)
latest_date = country_daily['ObservationDate'].max()
latest_data = country_daily[country_daily['ObservationDate'] == latest_date]
top_countries = latest_data.sort_values('Confirmed', ascending=False).head(10)
print(top_countries[['Country','Confirmed','Deaths','Recovered']])
The analysis confirms global trends: for example, United States often leads in case count, followed by
India, Brazil, and others 6 . We can also compute total cases per continent to see that Asia and Europe
dominate the counts 7 . This high-level analysis helps identify which regions were hardest hit and when.
Visualization
We visualize the results using matplotlib or seaborn. Typical plots include:
• Line plots of daily new cases: For each of the top affected countries, plot the daily new cases
(possibly on the same figure for comparison). Overlay the 7-day rolling average to show smoothed
trends 5 .
• Bar charts for comparisons: A bar chart of total confirmed cases or deaths per country (e.g. top 10)
highlights the scale of the epidemic by country.
• Heatmaps or pivot charts: We can create a pivot table (e.g. countries × dates) and use a heatmap to
show intensity of cases over time, or a choropleth world map if geographic data is available. These
visualizations make patterns and differences clearer.
# Example: Line plot of daily new cases (7-day avg) for top 3 countries
top3 = top_countries['Country'].iloc[:3].tolist()
plt.figure(figsize=(8,5))
for country in top3:
data = country_daily[country_daily['Country']==country]
3
plt.plot(data['ObservationDate'], data['7day_avg_new'], label=country)
plt.legend()
plt.title('Daily New COVID-19 Cases (7-day avg) for Top Countries')
plt.xlabel('Date'); plt.ylabel('New Cases (7-day avg)')
plt.tight_layout()
plt.show()
Similarly, a bar chart of the top countries might look like the image below, which compares total cases or
deaths by country:
Figure: Heatmap of COVID-19 intensity worldwide (for illustration). In practice, we would generate our own charts
(bar plots, line plots) from the processed data.
(Note: In the above we would produce and embed our own charts; the figure here is illustrative of a
heatmap-style visualization of cases.)
Export
Finally, we save the cleaned data and analysis results for future use:
• Save cleaned data: Export the cleaned daily country-level data to CSV (e.g.
country_daily.to_csv('covid_country_daily.csv', index=False) ). This preserves our
aggregation and feature columns.
• Save summaries: We can also save summary tables (e.g. top 10 countries by cases) to CSV.
• Notebook export (optional): To share the analysis, one could convert the Jupyter notebook to PDF
(e.g. using nbconvert ) as a report.
This completes the end-to-end tutorial: we loaded the Kaggle COVID-19 dataset, cleaned and transformed it
using pandas/NumPy, performed country- and continent-level analysis (with citations for context 1 6 ),
and visualized the key trends (smoothing with a 7-day average 5 ). The final cleaned data and results are
saved for easy sharing and further modeling.
Sources: The dataset description is available on Kaggle 1 2 , and similar analyses have been
documented in public resources 3 6 7 5 , which align with the methods shown here.
3 Analyzing Novel Corona Virus COVID-19 Dataset » Loren on the Art of MATLAB - MATLAB &
4 Simulink
https://blogs.mathworks.com/loren/2020/03/16/analyzing-novel-corona-virus-covid-19-dataset/
4
5 Coronavirus (COVID-19) Cases - Our World in Data
https://ourworldindata.org/covid-cases
6 Global Pandemic in Numbers: A COVID-19 Exploratory Data Analysis | by Vansh Mahindra | Apr,
7 2025 | Medium
https://medium.com/@vanshmahindra/global-pandemic-in-numbers-a-covid-19-exploratory-data-analysis-9198867a6225