Data Analysis CheatSheet
Data Analysis CheatSheet
CheatSheet
TABLE OF CONTENTS
1. Introduction to Data Analysis
What is Data Analysis?
Types of Data Analysis
Descriptive Analysis
Diagnostic Analysis
Predictive Analysis
Prescriptive Analysis
Data Analysis Process Overview
Data Collection
Data Cleaning
Data Exploration
Data Modeling
Data Interpretation and Reporting
Where xix_ixi are the data points and μ\muμ is the mean.
Standard Deviation (SD):
The standard deviation is the square root of the variance. It is easier to interpret because it is
in the same units as the data. A small SD means data points cluster near the mean; a large
SD means they are more spread out.
It’s widely used in medical testing, spam filtering, and machine learning.
Both tests compare means to check if there is a statistically significant difference between
groups.
Used for flexible, semi-structured or unstructured data, like MongoDB, Cassandra. Ideal for
handling large volumes of varied data formats.
6.2.1 Interpretation
A 95% Confidence Interval means we are 95% confident the true value lies within that range.
Where:
xˉ\bar{x}xˉ is the sample mean
ZZZ is the Z-score (e.g., 1.96 for 95%)
σ\sigmaσ is standard deviation
nnn is sample size
For proportions:
Where:
yyy = predicted output
xxx = input variable
mmm = slope (effect of x on y)
bbb = intercept
Additive Model:
When the magnitude of seasonality and trend are constant over time.
Example: Sales = 100 + 20(season) + noise.
Multiplicative Model:
When trend and seasonality increase proportionally.
Example: Sales = 100 × 1.2(season) × noise.
Decomposition helps isolate patterns and remove noise for better analysis.
Example: ARIMA(2,1,1)
7. ADVANCED DATA ANALYSIS TECHNIQUES
7.1.4 Time-Series Cross-Validation
Unlike random splits, time series uses time-based validation:
This approach respects time order and gives reliable forecast evaluation.
Logistic Regression
Predicts probabilities for binary classification (e.g., 0 or 1).
Uses sigmoid function:
8. DATA VISUALIZATION BEST PRACTICES
8.1 Designing Effective Charts
Data visualization is the art of converting raw data into visual formats like graphs and charts,
so insights are easier to understand and communicate. Choosing the right type of chart is
key.
Pie Chart
Used to show part-to-whole relationships.
Best when you want to highlight how a category contributes to a whole.
Example: Market share of 4 brands.
Avoid using for more than 5 categories, as it becomes hard to read.
Line Chart
Used to display trends over time.
Great for time-series data like daily temperatures or stock prices.
Connects data points with a line, making it easy to spot increases/decreases.
10.1.1 Pandas
A library used for working with structured data (tables).
Key features:
DataFrame and Series for storing data.
Easy to read CSV, Excel, JSON files.
Functions for filtering, grouping, merging, reshaping data.
10.1.2 NumPy
Stands for Numerical Python.
Provides support for multi-dimensional arrays and matrices.
Useful for:
Mathematical operations
Linear algebra, random number generation, statistics
10.1.3 Matplotlib
A basic plotting library in Python.
Used to create line charts, bar charts, scatter plots, etc.
Highly customizable for publication-ready graphs.
10.1.4 Seaborn
Built on top of Matplotlib.
Makes statistical plots look better and easier to build.
Good for heatmaps, boxplots, violin plots, pairplots, etc.
10.1.5 Plotly
Used for interactive plots in dashboards and web apps.
Supports zoom, hover, and other user interactions.
Ideal for real-time data and visual storytelling.
10.1.6 SciPy
Focuses on scientific and technical computing.
Includes modules for optimization, integration, signal processing, etc.
10.1.7 Statsmodels
Used for statistical tests and models.
Helps perform regression, hypothesis testing, ANOVA, and time series analysis.
10.2.1 dplyr
Part of the tidyverse.
Used for data manipulation: filtering, selecting, mutating, summarizing.
Syntax like filter(), select(), group_by(), summarize().
10.2.2 ggplot2
Most popular data visualization library in R.
Based on the Grammar of Graphics.
Allows creation of high-quality graphs and plots with minimal code.
10.2.3 tidyr
Helps in tidying messy data (e.g., from wide to long format).
Functions like pivot_longer() and pivot_wider() make reshaping easy.
10.2.4 caret
Stands for Classification and Regression Training.
Used for machine learning workflows.
Handles preprocessing, model training, and evaluation.
10.2.5 RMarkdown
Combine code, analysis, and report writing in one file.
Export to HTML, PDF, or Word.
Very useful for automated reports and documentation.