Skip to content

Arshad-1907/state-analysis-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#State Housing and Income Analysis: ETL, Automation & Visualization

Overview

This project replicates and extends the "Analysis of States" Google Sheet using a robust Python ETL pipeline and data visualization. The goal is to automate the production of the OUTPUT tab, generate insights, and demonstrate scalable data engineering and analytics skills. Data sources: Provided CSVs and Excel files (see below) Output: Automated, reproducible CSV and SQLite database matching the OUTPUT tab, plus additional analysis and visualizations.

Data Sources

CENSUS_MHI_STATE.csv: Median household income by state CENSUS_POPULATION_STATE.csv: Population by state REDFIN_MEDIAN_SALE_PRICE.csv: Median home sale price by state and month KEYS.csv: Region/state mapping and metadata Analysis of States.xlsx: Reference Google Sheet and additional demographic data

ETL Pipeline

  1. Extract

Reads all raw data files (CSV and Excel) Cleans and normalizes columns for all 52 regions (states, DC, Puerto Rico) Handles special cases (e.g., missing Puerto Rico sale price) Outputs staged CSVs for population, income, and sale price

  1. Transform

Merges staged data on region Calculates: Population, income, and sale price ranks House affordability ratio (sale price / income) Natural language blurbs for each metric (matching OUTPUT tab) Produces output.csv matching the Google Sheet OUTPUT tab

  1. Load

Loads the final output into a SQLite database (state_analysis.db). Table: state_analysis (all processed state metrics and blurbs). Production Note:
The load script can be easily adapted to load data into PostgreSQL or MySQL by changing the SQLAlchemy connection string. This makes the pipeline scalable and ready for enterprise environments.

ETL Automation with Apache Airflow

To ensure the pipeline runs reliably and without manual intervention, the entire ETL process is automated using Apache Airflow: The ETL steps (extract, transform, load) are defined as tasks in an Airflow Directed Acyclic Graph (DAG) (etl_pipeline.py). Scheduling: The DAG is configured to run automatically at regular intervals (e.g., monthly or daily), so the pipeline always processes the latest available data. Task orchestration: Airflow manages dependencies, ensuring each step runs in sequence and only when previous steps succeed. Monitoring: Airflow’s web UI provides real-time visibility into pipeline runs, logs, and task statuses. Extensibility: The workflow can be easily extended with sensors (for new data), notifications, or integration with production databases like PostgreSQL/MySQL

Additional Analysis & Visualizations

To demonstrate further skills, the project includes a Jupyter notebook with: Descriptive statistics for all key metrics Correlation heatmaps (population, income, sale price, affordability) Histograms for household income and sale price distributions Barplots for top/bottom states by affordability and income Scatterplots (income vs. sale price) Time series line plots for Redfin median sale price trends by state Boxplots for affordability ratio (outlier detection) Equity analysis: Median income by race (from race/ethnicity tab) All code and visualizations are available in analysis_and_visualization.ipynb.

This project automates the extraction, transformation, and loading (ETL) of state-level housing and income data using Python and Airflow. Raw data from multiple sources—including Census population, household income, and Redfin median sale prices—is ingested, cleaned, merged, and enriched to produce a comprehensive output matching the "Analysis of States" Google Sheet. The pipeline calculates key metrics such as population, income, sale price ranks, and affordability ratios, and generates natural-language blurbs for each region. Results are saved as both CSV and SQLite database files, and the workflow is fully automated and scheduled using Airflow to ensure reliability and reproducibility. An accompanying Jupyter notebook provides additional analysis and visualizations, offering deeper insights into trends and disparities across states.

How to Run the Code Clone the Repository

bash git clone https://github.com/Arshad-1907/state-analysis-etl.git cd state-analysis-etl Set Up the Python Environment

Create and activate a virtual environment:

bash python3 -m venv .venv source .venv/bin/activate # On Mac/Linux .venv\Scripts\activate # On Windows Install dependencies:

bash pip install -r requirements.txt Run the ETL Pipeline

From the project root, run the scripts in order:

bash python etl/extract.py python etl/transform.py python etl/load.py This will read the raw data, process it, and save the final output files to data/output/output.csv and data/output/state_analysis.db.

Run the Pipeline with Airflow

If you want to use Airflow for automation, place etl_pipeline.py in your Airflow dags/ folder, start the Airflow scheduler and webserver, and trigger the DAG from the Airflow UI.

Explore the Analysis and Visualizations

Start Jupyter Notebook: bash jupyter notebook Open analysis_and_visualization.ipynb and run the notebook to see exploratory analysis and visualizations of the processed data.

Order of Execution:

Extract raw data (extract.py)

Transform and merge data (transform.py)

Load results to database/CSV (load.py)

Explore and visualize results (analysis_and_visualization.ipynb)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published