0% found this document useful (0 votes)
4 views10 pages

BA THEORY

The document provides an overview of data science, data analysis, and analytics, outlining key concepts such as data types, big data characteristics, and various analytics classifications. It also discusses data preparation techniques, visualization methods, and regression models, emphasizing their applications in business and challenges faced in data analytics. Additionally, it covers textual data analysis, its significance, and methods for extracting insights from unstructured text.

Uploaded by

Sanjana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

BA THEORY

The document provides an overview of data science, data analysis, and analytics, outlining key concepts such as data types, big data characteristics, and various analytics classifications. It also discusses data preparation techniques, visualization methods, and regression models, emphasizing their applications in business and challenges faced in data analytics. Additionally, it covers textual data analysis, its significance, and methods for extracting insights from unstructured text.

Uploaded by

Sanjana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 1

1. Data and Data Science

• Data is the foundation of all analytical processes. It represents raw facts that are
collected for reference or analysis. It can be:
o Structured (organized in tables, e.g., Excel, databases),
o Unstructured (text, images, audio), or
o Semi-structured (like XML, JSON).
• Data Science is the process of analyzing data to extract meaningful insights. It
combines:
o Mathematics & Statistics (for modeling),
o Programming (to handle and analyze data), and
o Domain Knowledge (to make insights actionable).

2. Data Analysis vs. Data Analytics

• Data Analysis is more about examining datasets to discover trends, patterns, or


summaries.
o It’s mostly retrospective – tells what happened.
o Example: Analyzing last month's sales to find top-selling products.
• Data Analytics includes analysis but goes further – it uses advanced techniques to
predict future outcomes and suggest actions.
o Involves predictive and prescriptive techniques.
o Example: Forecasting next month’s sales using machine learning models.

3. Classification of Analytics

1. Descriptive Analytics – Summarizes historical data to understand trends.


o E.g., Monthly revenue reports.
2. Diagnostic Analytics – Explores data to find the reason behind outcomes.
o E.g., Why sales dropped in a region.
3. Predictive Analytics – Uses historical data to forecast future outcomes.
o E.g., Predicting customer churn.
4. Prescriptive Analytics – Suggests actions based on data-driven predictions.
o E.g., Recommending marketing strategies to boost retention.

4. Applications of Analytics in Business

Analytics helps companies:


• Make informed decisions (e.g., which product to launch),
• Improve efficiency (e.g., reduce delivery time),
• Understand customers better (e.g., personalized offers), and
• Gain a competitive edge.

Some examples:

• Marketing: Targeted advertising using customer data.


• Finance: Predicting loan defaults.
• Operations: Automating supply chain management.

5. Types of Data

Type Explanation Example


Nominal Categorical with no order Eye color, Country, Gender
Customer satisfaction: Poor, Fair,
Ordinal Ordered categories
Good
Scale Numeric data with measurable Temperature (Interval), Income
(Interval/Ratio) difference (Ratio)

Understanding data types is essential for selecting the right statistical or analytical technique.

6. Big Data and Its Characteristics

Big Data refers to extremely large datasets that can’t be processed using traditional tools. It
includes social media data, transaction logs, sensors, etc.

Key Characteristics (5 Vs):

• Volume – Massive data size.


• Velocity – High speed of data generation.
• Variety – Structured, semi-structured, unstructured.
• Veracity – Accuracy and trustworthiness.
• Value – Potential to extract insights.

7. Applications of Big Data

• Healthcare: Monitoring patient health in real-time using wearables.


• Retail: Tailoring offers using purchase history.
• Banking: Real-time fraud detection.
• Agriculture: Predicting crop yield using climate data.
• Government: Analyzing traffic patterns for smart cities.
8. Challenges in Data Analytics

Despite its benefits, analytics faces several challenges:

1. Data Quality: Incomplete or incorrect data leads to bad insights.


2. Integration: Combining data from various sources (e.g., CRM, ERP).
3. Security and Privacy: Ensuring compliance with laws like GDPR.
4. Scalability: Managing growing data efficiently.
5. Talent Shortage: Need for skilled data professionals.

UNIT 2
1. Data Preparation and Cleaning

This is the first step in data analysis—removing errors and formatting data for analysis.

• Examples: fixing typos, converting text to numbers, removing extra spaces, handling
missing values.

2. Sort and Filter

• Sort: Arranges data in ascending/descending order (e.g., sort sales from highest to
lowest).
• Filter: Displays only data that meets specific conditions (e.g., sales in one region).

3. Conditional Formatting

Applies visual formatting rules to highlight data automatically.

• Example: Highlight sales below ₹10,000 in red.

4. Text to Column
Splits text from one column into multiple columns using a delimiter (like commas or spaces).

• Example: Splitting "John,Smith" into First Name and Last Name.

5. Removing Duplicates

Eliminates repeated rows or values to ensure data uniqueness.

• Useful in cleaning customer lists or transaction records.

6. Data Validation

Sets rules for what data can be entered in a cell.

• Example: Restricting values to dates only or dropdown menus for departments.

7. Identifying Outliers in the Data

Outliers are data points that differ significantly from others.

• Detected using charts (box plots) or statistical methods (Z-score, IQR).


• Important because they can skew results.

8. Covariance and Correlation Matrix

• Covariance shows how two variables change together (direction).


• Correlation shows the strength and direction of a linear relationship between
variables (range: -1 to 1).
• Correlation Matrix is a table showing correlations between multiple variables.

9. Moving Averages

Smooths out short-term fluctuations to reveal trends over time.

• Used in forecasting and time-series analysis.


• Example: 3-month moving average of sales.
10. Finding the Missing Value from Data

• Methods: Replace with mean/median, forward fill, or use statistical models.


• Important for maintaining dataset completeness and avoiding errors in analysis.

11. Summarisation

Reduces large datasets into meaningful summaries (totals, averages, counts).

• Tools: Excel functions (SUM, AVERAGE), group-by in Python/Pandas, or SQL


aggregation.

12. Visualisation Techniques

Helps in understanding and communicating patterns in data.

Chart Type Use Case


Scatter Plot Shows relationship between two variables (e.g., sales vs. ads)
Line Chart Displays trends over time (e.g., monthly revenue)
Histogram Shows distribution of data (e.g., age of customers)

13. Pivot Tables

Summarise data dynamically by rows and columns.

• Example: Total sales by product and region.

14. Pivot Charts

Graphs based on pivot table data.

• Automatically update when pivot table changes.

15. Interactive Dashboards

Combines charts, slicers, and KPIs into a single view for real-time insights.

• Created in Excel, Power BI, or Tableau.


• Example: A sales dashboard with region-wise and monthly performance.
UNIT 3
UNIT 4
1. Importing Data File

You can import data into your analysis tool from:

• CSV (Comma-Separated Values): Most common, works in Excel, R, Python, etc.


o In R: read.csv("file.csv")
o In Python (Pandas): pd.read_csv("file.csv")
• Excel (.xlsx):
o In R: readxl::read_excel("file.xlsx")
o In Python: pd.read_excel("file.xlsx")
• Other formats: JSON, SQL, text files.

2. Data Visualisation Using Charts

Visualization helps to explore and present data effectively. Here are the common types:

Chart Type Purpose / Use Case

Shows frequency distribution of a single variable. Useful for


Histogram
understanding data distribution.

Bar Chart Compares categories (e.g., sales by product).

Box Plot Displays spread and outliers of numerical data using quartiles.

Line Graph Shows trends over time (e.g., monthly revenue).

Displays relationship between two continuous variables (e.g., height


Scatter Plot
vs. weight).

Pie Chart (less used in data


Shows parts of a whole – better alternatives usually exist.
science)

3. Data Description – Descriptive Statistics


A. Measures of Central Tendency

These show the center of a data set:

• Mean: Average value.


• Median: Middle value when sorted.
• Mode: Most frequent value.

B. Measures of Dispersion

These describe how spread out the data is:

• Range: Max – Min.


• Variance: Average squared deviation from mean.
• Standard Deviation (SD): Square root of variance; shows average distance from the mean.
• Interquartile Range (IQR): Middle 50% of data (Q3 − Q1); used in box plots.

4. Relationship Between Variables

Understanding how two variables move together:

Measure Explanation

Indicates direction of the linear relationship (positive/negative), but not


Covariance
strength.

Ranges from -1 to +1. Shows strength and direction of a linear


Correlation (r)
relationship.

Coefficient of Proportion of variance in one variable explained by another (0 to 1).


Determination (R²) Higher R² = stronger fit in regression models.

Tip: Correlation is standardized, while covariance is not.

UNIT 5
📈 1. Simple Linear Regression Model

• Models the relationship between two variables: one independent (X) and one dependent
(Y).
• Equation:

Y=a+bX+εY = a + bX + \varepsilonY=a+bX+ε
o a = intercept
o b = slope (regression coefficient)
o ε = error term

Goal: Predict Y based on the value of X (e.g., predicting sales based on advertising).

📊 2. Confidence and Prediction Intervals

• Confidence Interval: Range of values for the mean prediction of Y for a given X.
o Example: “We are 95% confident the average sales at ₹10k ads is between ₹20k–
₹25k.”
• Prediction Interval: Wider range that predicts a new individual outcome.
o Example: Predicting next month's sales for ₹10k ads.

🧮 3. Multiple Linear Regression

• Extends simple regression to include more than one independent variable.


• Equation:

Y=a+b1X1+b2X2+...+bnXn+εY = a + b_1X_1 + b_2X_2 + ... + b_nX_n + \varepsilonY=a+b1X1


+b2X2+...+bnXn+ε

• Example: Predicting house price based on size, location, and number of bedrooms.

📌 4. Interpretation of Regression Coefficients

• Each coefficient (b₁, b₂, etc.) shows the effect of that variable on Y, holding other variables
constant.
• Example: In housing data, if b₁ = 5000, it means each extra sq. ft. increases price by ₹5000.

⚠️ 5. Heteroscedasticity

• When the variance of errors is not constant across values of X.


• Problem: Violates regression assumptions and affects prediction accuracy.
• Detected using residual plots; corrected using transformations or robust standard errors.

🔀 6. Multicollinearity

• When independent variables are highly correlated with each other.


• Problem: Makes it hard to interpret coefficients; inflates standard errors.
• Detected using VIF (Variance Inflation Factor); fixed by removing or combining variables.
📚 7. Basics of Textual Data Analysis

• Text data (like reviews, tweets, emails) is unstructured and requires preprocessing before
analysis.
• Common steps: Tokenization, cleaning, removing stop words, stemming.

✅ Significance and Applications of Textual Analysis

• Significance: Helps extract insights from non-numeric data.


• Applications:
o Sentiment analysis (e.g., customer feedback)
o Topic detection (e.g., trending topics on Twitter)
o Spam filtering, chatbot training, etc.

🧠 8. Challenges in Textual Data Analysis

• Noise in text: slang, abbreviations, misspellings.


• Context understanding: Words have different meanings.
• Language diversity: Multilingual or regional variations.

🧪 9. Introduction to Textual Analysis Using R

• Popular packages:
o tm (Text Mining),
o tidytext (tidy text processing),
o textclean,
o syuzhet (for sentiment analysis)

Example:

library(tidytext)

library(dplyr)

# Sample analysis

data("stop_words")

tokens <- unnest_tokens(my_data, word, text_column) %>%

anti_join(stop_words)
🧰 10. Methods and Techniques of Textual Analysis
Method Description

Text Mining Extracting patterns from large text data (keywords, frequency, word clouds)

Categorization Classifying text into predefined categories (e.g., spam vs. non-spam)

Sentiment Analysis Detecting positive, negative, or neutral emotion in text

You might also like