Chapter 2: Data Acquisition, Cleaning, and Exploration
Data Sources and Types:
o Structured Data: Relational databases (SQL), spreadsheets (CSV, Excel).
o Unstructured Data: Text (documents, emails), images, audio, video, social media
posts.
o Semi-structured Data: XML, JSON.
o Real-time vs. Batch Data.
Data Acquisition Methods:
o Database queries (SQL).
o APIs (Application Programming Interfaces).
o Web scraping.
o Data warehouses and data lakes.
o IoT sensors and streaming data.
Data Cleaning (Data Wrangling/Munging):
o Handling Missing Values: Imputation (mean, median, mode), deletion.
o Outlier Detection and Treatment: Statistical methods (Z-score, IQR), visualization.
o Data Transformation: Normalization, standardization, log transformation.
o Dealing with Noisy Data: Smoothing, binning.
o Removing Duplicates.
o Correcting Inconsistent Formats: Dates, spellings.
Exploratory Data Analysis (EDA):
o Descriptive Statistics: Mean, median, mode, standard deviation, variance, quartiles.
o Data Visualization:
Univariate: Histograms, box plots, density plots.
Bivariate: Scatter plots, bar plots, line plots.
Multivariate: Heatmaps, pair plots.
o Correlation Analysis: Understanding relationships between variables.
o Hypothesis Generation: Forming initial ideas about patterns and relationships in the
data.