BA THEORY
BA THEORY
• Data is the foundation of all analytical processes. It represents raw facts that are
collected for reference or analysis. It can be:
o Structured (organized in tables, e.g., Excel, databases),
o Unstructured (text, images, audio), or
o Semi-structured (like XML, JSON).
• Data Science is the process of analyzing data to extract meaningful insights. It
combines:
o Mathematics & Statistics (for modeling),
o Programming (to handle and analyze data), and
o Domain Knowledge (to make insights actionable).
3. Classification of Analytics
Some examples:
5. Types of Data
Understanding data types is essential for selecting the right statistical or analytical technique.
Big Data refers to extremely large datasets that can’t be processed using traditional tools. It
includes social media data, transaction logs, sensors, etc.
UNIT 2
1. Data Preparation and Cleaning
This is the first step in data analysis—removing errors and formatting data for analysis.
• Examples: fixing typos, converting text to numbers, removing extra spaces, handling
missing values.
• Sort: Arranges data in ascending/descending order (e.g., sort sales from highest to
lowest).
• Filter: Displays only data that meets specific conditions (e.g., sales in one region).
3. Conditional Formatting
4. Text to Column
Splits text from one column into multiple columns using a delimiter (like commas or spaces).
5. Removing Duplicates
6. Data Validation
9. Moving Averages
11. Summarisation
Combines charts, slicers, and KPIs into a single view for real-time insights.
Visualization helps to explore and present data effectively. Here are the common types:
Box Plot Displays spread and outliers of numerical data using quartiles.
B. Measures of Dispersion
Measure Explanation
UNIT 5
📈 1. Simple Linear Regression Model
• Models the relationship between two variables: one independent (X) and one dependent
(Y).
• Equation:
Y=a+bX+εY = a + bX + \varepsilonY=a+bX+ε
o a = intercept
o b = slope (regression coefficient)
o ε = error term
Goal: Predict Y based on the value of X (e.g., predicting sales based on advertising).
• Confidence Interval: Range of values for the mean prediction of Y for a given X.
o Example: “We are 95% confident the average sales at ₹10k ads is between ₹20k–
₹25k.”
• Prediction Interval: Wider range that predicts a new individual outcome.
o Example: Predicting next month's sales for ₹10k ads.
• Example: Predicting house price based on size, location, and number of bedrooms.
• Each coefficient (b₁, b₂, etc.) shows the effect of that variable on Y, holding other variables
constant.
• Example: In housing data, if b₁ = 5000, it means each extra sq. ft. increases price by ₹5000.
⚠️ 5. Heteroscedasticity
🔀 6. Multicollinearity
• Text data (like reviews, tweets, emails) is unstructured and requires preprocessing before
analysis.
• Common steps: Tokenization, cleaning, removing stop words, stemming.
• Popular packages:
o tm (Text Mining),
o tidytext (tidy text processing),
o textclean,
o syuzhet (for sentiment analysis)
Example:
library(tidytext)
library(dplyr)
# Sample analysis
data("stop_words")
anti_join(stop_words)
🧰 10. Methods and Techniques of Textual Analysis
Method Description
Text Mining Extracting patterns from large text data (keywords, frequency, word clouds)
Categorization Classifying text into predefined categories (e.g., spam vs. non-spam)