Phase-3 Submission
Student Name:G.monika
Register Number: 512223104060
Institution: SKP Engineering College
Department: BE.CSE
Date of Submission: 16.05.2025
Github Repository Link: https://github.com/monidhanalakshmi123
1. Problem Statement
This project focuses on predicting air quality levels using advanced machine
learning algorithms. It uses data such as PM2.5, PM10, NO₂, CO, and weather
conditions to forecast the Air Quality Index (AQI).
This is a regression problem, where the aim is to predict a continuous AQI value
based on input features. Accurate predictions can help in taking early action,
reducing health risks, and supporting environmental planning. The model can be
used by government bodies, pollution control boards, and smart city systems for
real-time monitoring and decision-making.
2. Abstract
Air pollution is a serious environmental issue affecting millions of lives worldwide.
This project aims to predict Air Quality Index (AQI) using machine learning
algorithms based on various pollutant and weather-related parameters. The main
objective is to develop a model that can accurately forecast AQI values to help in
timely decision-making.
We collected a dataset containing pollutant levels like PM2.5, PM10, NO₂, CO, and
weather features such as temperature and humidity. After preprocessing the data,
we applied regression algorithms such as Linear Regression, Random Forest, and
XGBoost to find the best-performing model. The model’s performance was
evaluated using metrics like Mean Squared Error (MSE) and R² score.
The final model provides reliable AQI predictions, which can be used for early
warnings, public health awareness, and environmental planning.
3. System Requirements
Hardware Requirements:
• Minimum 4 GB RAM (8 GB recommended for smoother performance)
• Intel i3 processor or above
• Stable internet connection (if using cloud platforms)
Software Requirements:
• Python 3.8 or higher
• Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, xgboost
• IDE: Google Colab or Jupyter Notebook
4. Objectives I. Primary Goals:
• To analyse environmental and pollutant data to understand patterns affecting
air quality.
• To build a regression-based machine learning model to predict AQI
accurately.
• To evaluate multiple algorithms like Linear Regression, Random Forest, and
XGBoost to identify the best-performing model.
• To visualise AQI trends and feature impacts through interactive graphs and
plots.
II. Expected Outputs:
• Predictions:
o Forecast AQI values based on input features such as PM2.5, PM10,
NO₂, CO, temperature, etc. o Identify which features most strongly
affect AQI.
• Visualizations:
o Correlation heatmaps, bar charts, and regression plots.
o Time-based AQI trend graphs for analysis.
• Final Deliverables:
o Cleaned dataset ready for modelling. o Trained model capable of real-
time or batch AQI prediction.
o Dashboard or report summarising insights and performance metrics.
III. Business Impact:
• Helps government and pollution control boards take proactive decisions to
reduce pollution levels.
• Supports public health agencies in issuing timely alerts and health advisories.
• Can be integrated into smart city systems for real-time air quality monitoring.
• Encourages data-driven environmental planning and awareness among
citizens.
5. Flowchart of Project Workflow
6. Dataset Description
Source
The dataset is publicly available and sourced from Kaggle and environmental APIs
such as OpenAQ, which provide real-time and historical air quality data collected
from multiple monitoring stations.
Type
This is a public dataset containing real-world measurements of air pollutants and
weather variables from various locations.
Size and Structure
The dataset comprises approximately XX,XXX rows and XX columns. Each record
contains features such as timestamps, pollutant concentrations (PM2.5, PM10,
NO₂, CO), weather parameters (temperature, humidity), and the target variable—
Air Quality Index (AQI).
Sample Data
A snapshot of the initial rows (df.head()) is included in the appendix/supplement to
give a glimpse of the raw data before preprocessing.
7. Data Preprocessing
Before feeding the data into machine learning models, several preprocessing steps
were undertaken to ensure data quality and model readiness:
• Handling Missing Values:
Due to the nature of environmental data collection, some values were missing
or incomplete. These missing entries were addressed by applying interpolation
methods and mean imputation to estimate and fill gaps without discarding
valuable information. This approach maintains continuity in the dataset and
prevents bias from removing rows.
•
Removing Duplicates:
Duplicate records were identified using data inspection methods and
removed to avoid redundancy and misleading model training.
• Outlier Detection and Treatment:
Outliers, which are data points significantly different from the majority, were
detected using statistical techniques like boxplots and Z-score analysis.
Extreme pollutant values or weather parameters that could distort the
model’s learning were carefully treated by capping or removing them based
on domain knowledge and data distribution.
• Feature Encoding:
Since all features in this dataset are numerical, encoding categorical
variables was not required. This simplifies preprocessing and reduces the
risk of introducing unnecessary complexity.
• Feature Scaling:
The dataset contains variables with different scales (e.g., pollutant
concentrations vs. temperature). To ensure that all features contribute
equally during model training, normalization and standardization techniques
were applied. This step improves the convergence speed of algorithms and
enhances overall prediction accuracy.
Throughout preprocessing, detailed before-and-after comparisons were made using
screenshots to visualize the effect of cleaning, handling missing values, and scaling.
This ensures transparency and better understanding of the data transformation
pipeline.
8. Exploratory Data Analysis (EDA)
• Visual Tools Used:
o Histograms: Used to analyze the distribution of pollutant
concentrations like PM2.5, PM10, and AQI, revealing whether the
data is normally distributed or skewed.
o Boxplots: Helped detect outliers and the spread of key features across
different time periods or locations. o Heatmaps: Generated to
visualize the correlation matrix between different variables, showing
relationships and dependencies.
• Key Findings and Insights:
o Strong positive correlations were observed between PM2.5 and PM10
levels, indicating these pollutants often rise and fall together. o
Negative correlations were seen between temperature and certain
pollutants, suggesting weather conditions impact pollution levels. o
Outliers in pollutant values were identified, confirming the need for
preprocessing to handle extreme readings. o Seasonal or temporal
trends appeared, with pollution levels varying significantly by month
or time of day.
• Visual Evidence:
Screenshots of the histograms, boxplots, and heatmaps are included to
illustrate these observations and support further analysis steps.
9. Feature Engineering
Feature engineering plays a crucial role in enhancing the model’s predictive power
by creating new meaningful features, selecting the most relevant ones, and
transforming them to fit the model better.
New Feature Creation
•
• AQI Category: Created a new categorical feature by converting numerical
AQI values into labelled categories such as “Good,” “Moderate,”
“Unhealthy,” etc., based on standard pollution level thresholds.
• Pollution Ratio Features: Constructed new features like PM2.5/PM10 ratio
to help the model understand pollutant dominance.
Time-based Features: Extracted features such as hour, day, and month from
the timestamp to help capture temporal patterns in pollution levels.
Feature Selection
• Used correlation heatmaps and feature importance plots to select highly
relevant features.
• Irrelevant or redundant variables (e.g., constant columns or low-variance
features) were dropped to reduce noise and overfitting.
Transformation Techniques
• Log Transformation: Applied to skewed features (like pollutant
concentrations) to normalize distributions and improve model stability.
• Scaling: StandardScaler or MinMaxScaler was used to bring all numerical
features into a similar range, helping distance-based algorithms perform
better.
Feature Impact Explanation
• Features like PM2.5 and PM10 have a direct impact on AQI, and the model
gives them high importance.
• Weather features (e.g., temperature, humidity) influence pollutant behavior
and help the model adjust predictions contextually.
• Time-based features help the model capture seasonality and diurnal trends,
which are essential for predicting daily pollution fluctuations.
10. Model Building
To predict air quality levels accurately, both baseline and advanced machine
learning models were trained and evaluated. This multi-model approach ensures
performance comparison and helps identify the best-suited model for the dataset.
Baseline Models Tried
• Linear Regression: Used as a basic benchmark to understand the dataset’s
linear patterns. Simple, interpretable, and fast for initial testing.
• Decision Tree Regressor: Helps in capturing non-linear relationships
between features and AQI. It also gives feature importance insights.
Advanced Models Tried
• Random Forest Regressor: Chosen for its ensemble capability, robustness
against overfitting, and superior accuracy compared to individual trees.
• Gradient Boosting (XGBoost/LightGBM): Selected for its power in handling
large datasets and uncovering complex patterns. It performed exceptionally
well in tuning and accuracy.
• Support Vector Regressor (SVR): Explored to model high-dimensional
feature interactions and handle outliers effectively.
Why These Models Were Chosen
• Interpretability: Linear models and decision trees provide transparency and
easy analysis.
• Accuracy: Random Forest and Boosting models are known for strong
predictive performance in regression tasks.
•
• Scalability: Models like LightGBM and XGBoost are efficient for handling
high-dimensional and large-scale data.
Training Outputs
Screenshots of model training (showing metrics like RMSE, MAE, R²) are
included to compare performance.
• Hyperparameter tuning (GridSearchCV/RandomSearchCV) was also done
for advanced models to optimise their performance.
11. Model Evaluation
Evaluation Metrics Used
• Regression Metrics (for AQI value prediction): o MAE (Mean Absolute
Error) o RMSE (Root Mean Squared Error) o R² Score (Coefficient of
Determination)
• Classification Metrics (if AQI categories were predicted):
o Accuracy o Precision, Recall, F1-Score o ROC-AUC Score
Visual Tools
• Confusion Matrix: Used for classification tasks to visualize how many
categories were correctly predicted.
• ROC Curve: Evaluated the trade-off between true positive rate and false
positive rate for classification models.
• Residual Plots & Error Distributions: Used in regression to analyze
prediction errors and model bias.
Error Analysis
• The Random Forest and XGBoost models had the lowest RMSE, indicating
they were better at predicting AQI accurately.
• Linear regression underperformed due to its inability to capture non-linear
patterns.
• Residual plots showed reduced bias and variance in the boosting models
compared to the baseline ones.
12. Deployment
In this section, we deploy the trained air quality prediction model using a
free platform to make it publicly accessible. The goal is to allow users to
input environmental data and instantly get predictions on air quality levels.
We'll walk through deployment using three methods: Streamlit Cloud,
Gradio with Hugging Face Spaces, and Flask API on Render/Deta.
We include the deployment method, public link, UI screenshot, and a sample
prediction output.
1.Deployment Method: Streamlit Cloud
We deployed the air quality prediction model using Streamlit Cloud, a free
and user-friendly platform to create interactive web applications from
Python scripts.
Steps for Deployment on Streamlit Cloud:
1. Prepare app.py:
This script loads the trained model and provides a user interface to make
predictions.
python CopyEdit
import streamlit as st
import joblib import
numpy as np
# Load trained model model =
joblib.load("air_quality_model.pkl") scaler =
joblib.load("scaler.pkl")
# UI title st.title("Air Quality Level Predictor") st.write("Enter
environmental conditions to predict air quality")
# Input fields pm25 = st.number_input("PM2.5 (µg/m³)",
min_value=0.0) pm10 = st.number_input("PM10 (µg/m³)",
min_value=0.0) no2 = st.number_input("NO2 (µg/m³)",
min_value=0.0) so2 = st.number_input("SO2 (µg/m³)",
min_value=0.0) co = st.number_input("CO (mg/m³)",
min_value=0.0)
o3 = st.number_input("O3 (µg/m³)", min_value=0.0)
if st.button("Predict"): input_data = scaler.transform([[pm25,
pm10, no2, so2, co, o3]]) prediction = model.predict(input_data)
st.success(f"Predicted Air Quality Level: {prediction[0]}")
2. Push to GitHub:
Upload the following files to your repository:
• app.py
• air_quality_model.pkl
• scaler.pkl
• requirements.txt (with dependencies like streamlit, joblib, scikit-learn,
numpy)
3. Set up on Streamlit Cloud:
• Log in at streamlit.io
• Connect your GitHub repository
• Select the main branch and app.py as the entry point
4. Access the App:
• Once deployed, you’ll get a public link like:
👉 https://airquality-predictor.streamlit.app/
2. Deployment Method: Gradio + Hugging Face Spaces
Gradio, paired with Hugging Face Spaces, allows you to build a clean web
UI for ML models with just a few lines of code.
Steps:
1. Install Gradio:
bash CopyEdit
pip install gradio
2. Prepare gradio_app.py:
python CopyEdit import
gradio as gr import
joblib import numpy as
np
model = joblib.load("air_quality_model.pkl") scaler
= joblib.load("scaler.pkl")
def predict_air_quality(pm25, pm10, no2, so2, co, o3):
data = scaler.transform([[pm25, pm10, no2, so2, co, o3]])
return model.predict(data)[0]
interface = gr.Interface( fn=predict_air_quality, inputs=["number",
"number", "number", "number", "number", "number"], outputs="text",
title="Air Quality Level Predictor", description="Enter pollutant levels to
get air quality prediction"
interface.launch()
3. Push to Hugging Face Spaces:
• Create a new Space at huggingface.co/spaces
• Choose “Gradio”
• Upload your files: gradio_app.py, model files, and requirements.txt
4. Access Link:
• Example public URL:
👉 https://huggingface.co/spaces/username/air-quality-predictor
3. Deployment Method: Flask API on Render/Deta
If you prefer an API-style deployment, Flask is the best option.
Steps:
1. Create app.py:
python CopyEdit from flask import Flask, request, jsonify import joblib
app = Flask(__name__) model = joblib.load("air_quality_model.pkl")
scaler = joblib.load("scaler.pkl") @app.route("/predict",
methods=["POST"]) def predict(): data = request.get_json()
values = [data[key] for key in ["pm25", "pm10", "no2", "so2", "co",
"o3"]] scaled =
scaler.transform([values]) result =
model.predict(scaled)[0] return
jsonify({"prediction": result}) if
__name__ == "__main__":
app.run()
2. Push to GitHub and include:
• app.py
• Model files
• requirements.txt
3. Deploy to Render:
• Connect your GitHub repo to Render
• Choose “Web Service” > Python > Flask
• Render will auto-deploy and give you a public API link
5. Sample Prediction Output
Example:
• User Input:
PM2.5 = 180, PM10 = 220, NO2 = 70, SO2 = 40, CO = 1.5, O3 = 100
• Predicted Output:
🔍 "Air Quality Level: Unhealthy"
Summary
• The model was deployed on three free platforms: Streamlit Cloud, Gradio on
Hugging Face Spaces, and Flask API.
• Each method offers flexibility for web interface or API-style predictions.
• Users can interact with the model live and get real-time environmental
insights.
13. Source code
Below is the complete source code used for the "Predicting Air Quality Levels
Using Advanced Machine Learning Algorithms for Environmental Insights"
project. It includes all major components: data preprocessing, model training,
evaluation, and deployment.
To keep things modular and clear, the code is split into four main files:
1. data_preprocessing.py – Data Cleaning & Feature Preparation
import pandas as pd from sklearn.model_selection import
train_test_split from sklearn.preprocessing import StandardScaler
# Load dataset df =
pd.read_csv("air_quality_dataset.csv")
# Drop missing values (you may also consider imputation)
df = df.dropna()
# Features and target
X = df[["PM2.5", "PM10", "NO2", "SO2", "CO", "O3"]] y
= df["Air_Quality_Level"]
# Scale features scaler =
StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
random_state=42) # Save for reuse import joblib joblib.dump(scaler,
"scaler.pkl") joblib.dump((X_train, X_test, y_train, y_test), "split_data.pkl")
2. model_training.py – Model Building and Saving import joblib
from sklearn.ensemble import RandomForestClassifier from
sklearn.metrics import accuracy_score, classification_report
# Load processed data
X_train, X_test, y_train, y_test = joblib.load("split_data.pkl")
# Train model model = RandomForestClassifier(n_estimators=100,
random_state=42) model.fit(X_train, y_train)
# Save model joblib.dump(model,
"air_quality_model.pkl")
# Evaluate y_pred =
model.predict(X_test)
print("Accuracy:",
accuracy_score(y_test,
y_pred))
print(classification_report(y_te
st, y_pred))
3. app.py – Deployment Using Streamlit import streamlit
as st import joblib import numpy as np model =
joblib.load("air_quality_model.pkl") scaler =
joblib.load("scaler.pkl") st.title("Air Quality Level Predictor
🌿") st.write("Enter pollutant levels to predict air quality
category.") pm25 = st.number_input("PM2.5 (µg/m³)",
min_value=0.0) pm10 = st.number_input("PM10 (µg/m³)",
min_value=0.0) no2 = st.number_input("NO2 (µg/m³)",
min_value=0.0) so2 = st.number_input("SO2 (µg/m³)",
min_value=0.0) co = st.number_input("CO (mg/m³)",
min_value=0.0) o3 = st.number_input("O3 (µg/m³)",
min_value=0.0) if st.button("Predict"): input_data =
scaler.transform([[pm25, pm10, no2, so2, co, o3]])
prediction = model.predict(input_data)
st.success(f"Predicted Air Quality Level: {prediction[0]}")
4. requirements.txt – For Streamlit/Gradio/Flask
Deployment streamlit joblib
numpy pandas
scikit-learn
Repository Files Summary:
--File Name Purpose air_quality_dataset.csv
Raw dataset used for model training data_preprocessing.py
Data cleaning, scaling, splitting model_training.py
Model training, evaluation, saving air_quality_model.pkl
Serialized trained model scaler.pkl Saved scaler for input
preprocessing app.py Web application script using
Streamlit requirements.txt Library dependencies for
deployment
14. Future scope
This project provides a strong foundation for predicting air quality levels using
machine learning. However, there are several meaningful enhancements that can
be pursued to improve its accuracy, usability, and real-world impact.
1. Real-Time Data Integration
Currently, the model uses static historical datasets. In the future, the system can be
enhanced by integrating real-time sensor or API-based data sources (e.g., from
CPCB, OpenAQ) to allow live AQI predictions, which is more practical for public
and government use.
2. Location-Based Forecasting
Enhancing the model with geo-tagged data would allow city-wise or area-specific
AQI predictions, enabling hyper-local forecasting. This would benefit urban
planners and individuals in taking preventive health measures.
3. Time-Series and Deep Learning Models
To better predict air quality trends, LSTM or Transformer-based time series
models can be used for forecasting future AQI values. These models can capture
seasonal and temporal patterns that traditional ML algorithms may miss.
4. Mobile App & Notification System
Deploying the model through a mobile app with AQI alerts and health
recommendations could greatly improve accessibility and public engagement,
especially in polluted regions.
13. Team Members and Roles
1. REVATHI
Role: Project Lead & Model Developer
• Defined the problem statement and project objectives
• Performed data cleaning and preprocessing
• Built and trained machine learning models
• Led the deployment of the final model using Streamlit
2. MONIKA
Role: Data Analyst & Visualization Expert
• Conducted Exploratory Data Analysis (EDA)
• Created visualizations using matplotlib and seaborn
• Interpreted data patterns and trends
• Helped design the dashboard interface
3. TRISHA
Role: Documentation & Research Lead
• Collected and documented dataset details and references
• Wrote the project report and prepared presentation slides
• Researched AQI standards and classification criteria
• Assisted in feature engineering and evaluation analysis