Data Analytics imp
Data Analytics imp
b) What is an Outlier?
An outlier is a data point that is significantly different from other observations in a dataset, often
indicating errors or rare events.
g) What is Classification?
Classification is a supervised learning technique that categorizes data into predefined classes, such
as spam detection (spam or not spam).
b) Define Tokenization.
Tokenization is the process of breaking text into smaller parts called "tokens," such as words or
sentences, for easier analysis.
d) What is clustering?
Clustering is a technique in data analytics that groups similar data points together based on common
characteristics.
e) What is Frequent Itemset?
A Frequent Itemset is a set of items that appear together frequently in a dataset, often used in market
basket analysis.
g) What is outlier?
An outlier is a data point that is very different from other data points in a dataset, often due to errors
or rare events.
1. Spam Detection – Email services use supervised learning models to classify emails as spam
or non-spam based on labeled data.
2. Fraud Detection – Banks and financial institutions use it to identify fraudulent transactions
by analyzing past transaction data.
1. Chatbots & Virtual Assistants – NLP powers AI-based chatbots like Siri and Alexa to
understand and respond to human queries.
2. Sentiment Analysis – Businesses use NLP to analyze customer reviews and social media
comments to determine public opinion about their products.
i) Define classification.
Classification is a type of supervised learning where the model categorizes data into predefined
groups or labels. Examples include classifying emails as spam or not spam and identifying
handwritten digits.
j) Define Recall.
Recall, also known as Sensitivity, measures how well a model identifies all actual positive cases. It
is calculated as:
Recall = (True Positives) / (True Positives + False Negatives)
A high recall value means the model successfully detects most of the actual positive cases, making
it useful in applications like medical diagnosis and fraud detection.
d) What is TF-IDF?
TF-IDF (Term Frequency-Inverse Document Frequency) is a technique used in text analysis to
measure the importance of a word in a document relative to a collection of documents. It helps in
identifying relevant keywords.
1. Binary Logistic Regression – Used when the outcome has two categories (e.g., spam or not
spam).
2. Multinomial Logistic Regression – Used when there are three or more categories without
order (e.g., types of fruits).
3. Ordinal Logistic Regression – Used when categories have a meaningful order (e.g., ratings:
low, medium, high).
j) Define Precision.
Precision is a metric that measures how many of the predicted positive cases are actually correct.
Formula:
Precision = (True Positives) / (True Positives + False Positives)
A high precision means fewer false positives, making it important in applications like spam
detection
Q2
a) Explain the Term n-gram with Example
Example:
For the sentence "I love data science", the n-grams are:
POS (Part-of-Speech) Tagging is a process in NLP where each word in a sentence is labeled with
its grammatical category (noun, verb, adjective, etc.).
Example:
For the sentence "She plays football", the POS tags are:
POS tagging helps in speech recognition, text summarization, and machine translation.
Clustering is an unsupervised learning technique used to group similar data points together based
on patterns and similarities.
Types of Clustering:
1. Spam Detection – Machine learning is used in email filtering systems to classify emails as
spam or non-spam based on past data patterns.
2. Fraud Detection – ML models analyze transaction patterns to identify suspicious or
fraudulent activities in banking and e-commerce.
• Supervised Learning – A type of machine learning where the model is trained using labeled
data, meaning input-output pairs are provided (e.g., spam detection, image classification).
• Unsupervised Learning – A type of machine learning where the model finds hidden
patterns in data without labeled outputs (e.g., customer segmentation, anomaly detection).
• Underfitting occurs when a machine learning model is too simple and fails to capture
patterns in the data, leading to poor performance on both training and test data.
• Overfitting happens when a model is too complex and learns noise along with actual
patterns, performing well on training data but poorly on new test data.
A balanced model is needed to generalize well to unseen data.
b) What is Linear Regression? What type of Machine Learning applications can be solved
with Linear Regression?
Linear Regression is a supervised learning algorithm that finds a relationship between a dependent
variable (Y) and one or more independent variables (X) using a straight-line equation:
Y = mX + c (for simple linear regression).
Applications:
1. Predicting House Prices – Uses factors like area, number of rooms, and location.
2. Sales Forecasting – Predicts future sales based on historical data.
1. Efficient – It processes large datasets faster than Apriori by using a tree structure.
2. No Candidate Generation – Unlike Apriori, FP-Growth does not generate candidate
itemsets, making it memory-efficient.
3. Scalable – It works well with large databases and handles complex association rule mining
effectively.
• Dependent Variable – The variable being predicted or affected in an analysis (e.g., house
price in a price prediction model).
• Independent Variable – The variable(s) used to predict the dependent variable (e.g., area,
number of rooms, and location in a house price model).
1. Supervised Learning – Uses labeled data to train the model (e.g., spam detection).
2. Unsupervised Learning – Finds hidden patterns in data without labels (e.g., customer
segmentation).
3. Reinforcement Learning – Uses rewards and penalties to learn optimal actions (e.g., game-
playing AI).
Example: In a supermarket, if customers frequently buy "Bread" and "Butter" together, the
association rule can be:
Bread → Butter (Support: 30%, Confidence: 80%)
This means 30% of transactions contain both items, and when bread is bought, there is an 80%
chance butter is also bought.
This process helps in decision-making and is widely used in fields like business, healthcare, and
finance
Q3
The Data Analytics Life Cycle consists of structured steps used to process and analyze data to
extract meaningful insights. The main phases are:
1. Data Collection:
• Gather raw data from sources like databases, IoT devices, social media, or surveys.
• Ensure data is relevant and reliable.
• Perform Exploratory Data Analysis (EDA) using statistical methods and visualizations.
• Identify patterns, trends, and outliers in the dataset.
• Derive insights from the analysis and use them for business decision-making.
• Present findings using reports, dashboards, and visualizations.
This process is widely used in business intelligence, healthcare, and finance for better decision-
making.
Trend Analytics is the process of analyzing data patterns over time to identify trends and make
future predictions. It helps in understanding market shifts and consumer behavior.
• Stock Market Analysis: Predicting stock price movements based on past trends.
• Social Media Monitoring: Identifying trending hashtags and topics.
• E-commerce & Marketing: Understanding customer preferences to improve sales
strategies.
Trend analytics helps businesses in decision-making, market research, and forecasting future
opportunities.
• Y = Predicted value
• X = Independent variable
• m = Slope of the line
• c = Intercept
Example: Predicting house prices based on square footage. If larger houses generally cost more, the
model learns this pattern and predicts the price of new houses.
c) Describe Types of Data Analytics.
There are four main types of data analytics:
These analytics types help businesses and organizations make better decisions based on data
a) What are Frequent Itemsets & Association Rules? Describe with an example.
• Frequent Itemsets are groups of items that appear together frequently in a dataset. They are
commonly used in market basket analysis to find relationships between products.
• Association Rules are if-then statements that show the relationship between frequent
itemsets. They help in understanding patterns in data.
Example:
In a supermarket, if many customers buy "Milk" and "Bread" together, this forms a frequent itemset.
An association rule can be:
{Milk} → {Bread} (Support: 40%, Confidence: 70%)
This means 40% of transactions contain both items, and when milk is bought, there is a 70% chance
bread is also bought
Community detection is the process of identifying groups of closely connected nodes within a large
network, such as social media, biological networks, or recommendation systems. These
communities help in understanding relationships, influence, and hidden structures in networks.
Example: In social media, community detection helps identify groups of users with similar
interests, such as sports fans or tech enthusiasts.
Common Methods:
Community detection is widely used in social media analysis, fraud detection, and recommendation
systems.
b) Explain Apriori Algorithm
The Apriori Algorithm is a data mining algorithm used to find frequent itemsets and association
rules in large datasets. It works based on the principle that:
"If an itemset is frequent, then all of its subsets must also be frequent."
1. Find Frequent Itemsets: Identify items that appear together frequently using a minimum
support threshold.
2. Generate Candidate Itemsets: Combine frequent itemsets to create larger sets.
3. Prune Uncommon Itemsets: Remove itemsets that do not meet the minimum support.
4. Generate Association Rules: Extract rules from frequent itemsets based on confidence and
support.
Example:
In a supermarket, the Apriori algorithm may find the frequent itemset:
{Diaper, Beer} → Support: 30%
This means 30% of transactions contain both items, helping businesses in cross-selling strategies.
Social Media Analytics (SMA) involves analyzing data from platforms like Twitter, Facebook, and
Instagram to gain insights. However, several challenges arise:
1. Data Overload – Massive volumes of data make it difficult to process and analyze
meaningful insights.
2. Fake News & Misinformation – Identifying and filtering out false information is
challenging.
3. Privacy Issues – User data must be handled responsibly to comply with regulations like
GDPR.
4. Sentiment Analysis Complexity – Understanding sarcasm, slang, and multiple languages
accurately is difficult.
5. Real-Time Processing – Analyzing social media trends in real time requires high
computational power.
Despite these challenges, SMA is widely used in marketing, politics, and customer sentiment
analysis.
Q4
Data analytics is classified into different types based on its purpose and approach. Two important
types are:
1. Descriptive Analytics:
2. Predictive Analytics:
Expert finding is the process of identifying individuals who have deep knowledge and expertise in a
particular field. It is useful in research, organizations, and online communities.
1. Analyzing Publications & Research Papers – Experts are identified based on their work in
journals, patents, or conferences.
2. Social Media & Professional Networks – Platforms like LinkedIn and ResearchGate help
find domain experts.
3. Enterprise Knowledge Systems – Organizations maintain expert directories based on
employee skills.
4. Reputation-Based Ranking – Online forums (e.g., Stack Overflow) rank users based on
their contributions.
Social Media Analytics (SMA) involves analyzing user interactions, trends, and opinions on
platforms like Facebook, Twitter, and Instagram. However, several challenges arise:
1. Data Overload – Huge amounts of data are generated every second, making it difficult to
process and extract useful insights.
2. Fake News & Misinformation – Identifying and filtering false or misleading content is
challenging.
3. Privacy & Ethical Concerns – Handling personal user data must comply with regulations
like GDPR and ensure user privacy.
4. Sentiment Analysis Complexity – Understanding sarcasm, slang, and emojis accurately is
difficult for AI models.
5. Real-Time Processing – Analyzing and reacting to trends in real time requires powerful
computational resources.
6. Platform Algorithm Changes – Frequent updates in social media algorithms affect the
effectiveness of analytical models.
c) Explain Reinforcement Learning
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make
decisions by interacting with an environment. It receives rewards for good actions and penalties
for bad ones, aiming to maximize long-term rewards.
Key Components:
Example:
A self-driving car is an RL agent that learns to drive by getting rewards for staying in lanes and
penalties for collisions. Over time, it improves its driving skills.
Applications of RL:
Example:
o Two sentences:
1. "I love NLP."
2. "NLP is amazing."
o Bag of Words representation:
Example:
POS tagging is useful for chatbots, search engines, and sentiment analysis.
b) What is Logistic Regression? Explain it with Example.
Logistic Regression is a supervised learning algorithm used for binary classification problems,
where the output has only two possible values (e.g., Yes/No, True/False, 0/1). Unlike linear
regression, it uses the sigmoid function to predict probabilities instead of continuous values.
where:
A bank wants to predict whether a customer will default on a loan (Yes/No) based on factors like
income and credit score.
NLP involves several phases to process and understand human language. The key phases are:
1. Lexical Analysis:
o Breaks text into words and sentences (tokenization).
o Identifies parts of speech, root words, etc.
o Example: "Running" → "Run" (stemming).
2. Syntactic Analysis (Parsing):
o Checks the grammatical structure of a sentence.
o Identifies subject, verb, and object.
o Example: "He plays football" is correct, but "He play football" is incorrect.
3. Semantic Analysis:
o Extracts the meaning of words and sentences.
o Detects word relationships and meanings in context.
o Example: "Bank" can mean a financial institution or a riverbank.
4. Discourse Analysis:
o Understands sentences in relation to the previous ones.
o Example: "John went to the store. He bought milk." ("He" refers to John).
5. Pragmatic Analysis:
o Interprets sentences based on real-world knowledge.
o Example: "Can you open the door?" is a request, not a yes/no question.
b) Explain Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a process used to analyze datasets to find patterns, trends,
and relationships. It helps in understanding the data before applying machine learning models.
Steps in EDA:
Example:
Analyzing customer purchase data to find the most popular products and trends before building a
recommendation system.
EDA is crucial for data science, machine learning, and business analytics.
The Social Media Analytics (SMA) Life Cycle consists of steps to analyze data from social media
platforms for insights and decision-making.
Example:
A brand analyzes social media comments to understand customer sentiment and improve its
product.
Linear Regression is a statistical method used to find the relationship between a dependent
variable and one or more independent variables. It helps predict outcomes based on input
values. The equation for simple linear regression is:
Y=mX+C
where Y is the dependent variable, X is the independent variable, m is the slope (how much
Y changes with X), and C is the intercept. It is widely used in data analytics for trend
analysis, forecasting, and making predictions.
Natural Language Processing (NLP) is a branch of artificial intelligence that helps computers
understand, interpret, and generate human language. It is used in applications like chatbots,
voice assistants (like Siri, Alexa), translation tools (Google Translate), and sentiment
analysis. NLP involves techniques such as tokenization, stemming, and machine learning
models to process text data efficiently.
Text Analytics is the process of analyzing and extracting meaningful insights from large amounts of text
data. It involves techniques like text mining, sentiment analysis, and keyword extraction to understand
patterns, trends, and sentiments in text. Businesses use text analytics for customer feedback analysis, spam
detection, and social media monitoring. It helps in making data-driven decisions by converting unstructured
text into useful information.
A confusion matrix is a table used to evaluate the performance of a classification model. It shows
the actual vs. predicted classifications and includes four values:
ii) Accuracy:
Accuracy measures how well a classification model predicts correct results. It is calculated as:
Accuracy=(TP+TN+FP+FN)/(TP+TN)
iii) Precision:
Precision tells how many of the predicted positive cases are actually positive. It is calculated as:
Precision=(TP+FP)/TP
A higher precision means fewer false positives.
b) What is Machine Learning? Explain its Types.
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn
from data and make predictions without being explicitly programmed. It is used in applications like
recommendation systems, image recognition, and fraud detection.
1. Supervised Learning:
o The model learns from labeled data (input-output pairs).
o Example: Spam email detection (spam or not spam).
2. Unsupervised Learning:
o The model finds patterns in unlabeled data without predefined categories.
o Example: Customer segmentation in marketing.
3. Reinforcement Learning:
o The model learns by trial and error using rewards and penalties.
o Example: Self-driving cars learning to navigate.
Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and
regression tasks. It works by finding the best boundary (hyperplane) that separates different classes in the
data. SVM uses support vectors (data points closest to the boundary) to define this hyperplane. It can also
handle non-linear data using the kernel trick, which transforms data into a higher dimension for better
separation. SVM is widely used in image classification, text categorization, and medical diagnosis.
The Data Analytics life cycle consists of several steps to analyze data and gain insights. The key stages are:
1. Data Collection:
o Gathering raw data from different sources like databases, sensors, or files.
2. Data Cleaning:
o Removing errors, duplicates, and missing values to ensure data quality.
3. Data Exploration:
o Analyzing and summarizing data using visualization and statistical methods.
4. Data Modeling:
o Applying machine learning or statistical models to find patterns and make predictions.
5. Data Interpretation & Visualization:
o Presenting the results using charts, graphs, and reports to help decision-making.
6. Deployment & Monitoring:
o Implementing the model in real-world applications and monitoring its performance.