This repository contains a Jupyter notebook that demonstrates the impact of comprehensive data preprocessing on the predictive modeling of COVID-19 mortality. The project focuses on leveraging advanced preprocessing techniques to improve the reliability and accuracy of predictions, comparing the performance of various machine learning models trained on custom versus standard preprocessing pipelines.
- Explore Computational Dependencies: Examine relationships among dataset columns to accurately fill in missing values and address inconsistencies through custom computations.
- Custom Preprocessing Pipeline: Construct a tailored preprocessing pipeline that includes various custom transformers to clean, normalize, and enhance the raw data.
- Impact Analysis: Assess and compare the performance of multiple linear and non-linear models using custom preprocessing versus standard preprocessing, highlighting the benefits of detailed preprocessing steps.
The dataset used in this project is sourced from:
-
Our World in Data (OWID) COVID-19 Dataset: Available at Our World in Data. The dataset provides comprehensive global data on COVID-19 cases, deaths, vaccinations, and other related metrics, serving as the foundation for the predictive modeling and analysis conducted in this notebook. When using this data, please cite as follows:
Edouard Mathieu, Hannah Ritchie, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Saloni Dattani, Diana Beltekian, Esteban Ortiz-Ospina and Max Roser (2020) - "Coronavirus Pandemic (COVID-19)". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/coronavirus' [Online Resource]
To run the notebook and reproduce the results, install the required Python packages:
!pip install pandas matplotlib seaborn ipywidgets plotly qgrid
├── data/
│ ├── original_india_data.csv # Original data for India from OWID
│
├── standard/
│ ├── models/ # Trained models using the standard pipeline
│ ├── plots/ # Plots for learning curves and global outlier processing
│ ├── tables/
│ │ ├── feature_importance/ # Feature importance tables from round 0 to 21
│ │ ├── feature_correlation_with_target.csv # Correlation of features with the target variable
│ │ ├── standard_processed_india_data.csv # Processed data after applying the standard preprocessing pipeline
│ │ ├── standard_results.csv # Results from models trained with standard preprocessing
│
├── custom/
│ ├── models/ # Trained models using the custom pipeline
│ ├── plots/ # Plots for learning curves, computation and local outlier processing, and weekly pattern imputation
│ ├── tables/
│ │ ├── feature_importance/ # Feature importance tables from round 0 to 18
│ │ ├── feature_correlation_with_target.csv # Correlation of features with the target variable
│ │ ├── custom_processed_india_data.csv # Processed data after applying the custom preprocessing pipeline
│ │ ├── custom_results.csv # Results from models trained with custom preprocessing
│
├── Enhanced_predictive.ipynb # Main notebook for preprocessing and modeling
├── README.md # Project overview and instructions
└── LICENSE # License information
-
Introduction
- Problem Overview
- Objectives
-
Setup
- Installation of required packages
- Loading data
-
Data Preprocessing
-
Standard preprocessing pipeline
- Handling missing values and outliers
- Iterative feature selection
-
Custom preprocessing pipeline
- Weekly pattern imputation
- Local outlier processing
- Computation processing
- Iterative feature selection
-
-
Modeling
- Training multiple machine learning models
- Comparing models trained on custom vs. standard preprocessing
-
Results and Analysis
- Performance metrics (Test RMSE, Test R², RMSE Variance)
- Impact of preprocessing on model performance
-
Conclusion
- Summary of findings
The notebook demonstrates that comprehensive data preprocessing significantly improves the performance of predictive models in terms of accuracy and reliability. Custom preprocessing pipelines that take into account specific dataset characteristics yield better results compared to standard approaches.
If you use this code in your research, please cite the following paper:
Das, S., & Maji, S. (2024). Impact of Comprehensive Data Preprocessing on Predictive Modelling of COVID-19 Mortality. arXiv preprint arXiv:2408.08142. Retrieved from https://arxiv.org/abs/2408.08142
To use the notebook:
- Clone the repository.
- Install the necessary packages using the provided
pip
command. - Open the notebook in Jupyter and execute the cells to reproduce the results.
Contributions are welcome! If you have suggestions or improvements, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License.