0% found this document useful (0 votes)
59 views8 pages

DATA analytics previous solved

The document outlines a data analytics examination for T.Y.B.Sc.(CS) students, covering definitions and concepts in data analytics, machine learning, and natural language processing. It includes questions on topics like tokenization, clustering, confusion matrices, and types of data analytics, along with practical applications and challenges in the field. The exam consists of multiple-choice and descriptive questions aimed at assessing students' understanding and application of data analytics concepts.

Uploaded by

borsesumit02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views8 pages

DATA analytics previous solved

The document outlines a data analytics examination for T.Y.B.Sc.(CS) students, covering definitions and concepts in data analytics, machine learning, and natural language processing. It includes questions on topics like tokenization, clustering, confusion matrices, and types of data analytics, along with practical applications and challenges in the field. The exam consists of multiple-choice and descriptive questions aimed at assessing students' understanding and application of data analytics concepts.

Uploaded by

borsesumit02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

T.Y.B.Sc.

(CS)
CS-364:Data Analy cs
(2019 Credit Pa ern) (Semester VI)

[max. marks: 35]


Q1. A empt any eight of the following. [8X1=8]
a) Define data analy cs.
 Data analy cs refers to the process of examining, interpre ng, and drawing insights from large
sets of data to uncover pa erns, trends, and meaningful informa on.

b) Define tokeniza on.


 Tokeniza on is the process of breaking down a stream of text or data into smaller units called
tokens.

c) Define machine learning.


 Machine learning is a type of ar ficial intelligence that enables computer systems to learn and
improve from experience without being explicitly programmed.

d) What is clustering.
 Clustering is a machine learning technique that involves grouping similar data points together.

e) What is frequent itemset.


 Frequent itemset is a set of items that occur together in a transac on or dataset frequently.

f) What is data characteriza on.


 Data characteriza on, also known as data profiling or data summariza on, refers to the process
of analyzing and understanding the main features, proper es, and structure of a dataset.

g) What is outlier.
 Outlier is an observa on that lies an abnormal distance from other values in a random sample
from a popula on.

h) What is Bag of words.


 Bag of words is a natural language processing technique used for text classifica on and
document analysis. It involves coun ng the frequency of words in a document and using these
counts as features for further analysis.

i) What is text analy cs.


 Text analy cs is the process of analyzing unstructured text data to extract meaningful insights
and pa erns. It involves techniques such as natural language processing, machine learning, and
sta s cal analysis.

j) Define trend analy cs.


 Trend analy cs is the process of analyzing data over me to iden fy pa erns and trends. It
involves techniques such as me series analysis, forecas ng, and anomaly detec on.
Q2.A empt any FOUR of the following. [4x2=8]
a) What is confusion matrix.
 A confusion matrix is a table that summarizes the performance of a classifica on model. It shows
the counts or propor ons of true posi ve, true nega ve, false posi ve, and false nega ve
predic ons. It helps evaluate the model's accuracy, precision, recall, specificity, and F1 score. The
matrix provides insights into the model's ability to correctly classify instances and iden fy any
biases or errors. It is par cularly useful for assessing performance in imbalanced datasets and
making informed decisions about model improvements.

b) Define support and confidence in associa on rule mining.


 Support:
The number of transac ons that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transac on.It is a measure of how frequently the collec on of
items occur together as a percentage of all transac ons.
Formula:
Support(A -> B) = (number of transac ons containing A and B) / (total number of transac ons).

Confidence:
It is the ra o of the no. of transac ons that includes all items in {B} as well as the no of
transac ons that includes all items in {A} to the no of transac ons that includes all items in {A}.
Formula:
Confidence (A -> B) = support(A -> B) / support(A).

c) Explain any two machine learning applica ons.


 Two machine learning applica ons are:
o 1. Recommender systems: Recommender systems use machine learning algorithms to
suggest products, services, or content to users based on their past behavior, preferences,
and interests. They are used in e-commerce, social media, and entertainment pla orms
to personalize user experiences and increase engagement and sales.
o 2. Image recogni on: Image recogni on is a type of computer vision that uses machine
learning algorithms to iden fy objects, people, and scenes in digital images or videos. It
has numerous applica ons in security, healthcare, transporta on, and entertainment
industries. For example, it can be used to detect faces in photos, diagnose medical
condi ons from X-rays, or iden fy traffic signs in self-driving cars.

d) Write a short note stop words.


 Stop words are common words that are o en removed from text data during natural language
processing to improve the efficiency and accuracy of the analysis. Examples of stop words
include "the," "and," "a," "an," "in," and "to." These words do not carry significant meaning and
can be safely ignored without losing the essence of the text. However, some stop words may be
important in certain contexts, and their removal may affect the accuracy of the analysis.
Therefore, it is important to choose an appropriate stop word list based on the specific needs of
the analysis.
e) Define supervise learning and unsupervised learning.

o Supervised learning: Supervised learning is a type of machine learning where the
algorithm learns to make predic ons from labeled data. The labeled data contains both
input features and the desired output, which is used to train the model. The goal of
supervised learning is to learn a mapping func on from input variables to output
variables, so that the model can make accurate predic ons on new, unseen data.

o Unsupervised learning: Unsupervised learning is a type of machine learning where the


algorithm learns to iden fy pa erns and rela onships in unlabeled data. The data does
not contain any predefined output, so the model must find the underlying structure on
its own. The goal of unsupervised learning is to discover hidden or latent variables that
explain the observed data, and to group similar data points together. Clustering and
dimensionality reduc on are common examples of unsupervised learning.

Q3.A empt any two of the following. [2x4=8]


a) What is predic on? Explain any one regression model in detail.
 Predic on is the process of using machine learning algorithms to make informed guesses about
the value of a new, unseen data point based on the pa erns and rela onships learned from a
labeled training dataset. Regression is a type of machine learning algorithm that is used for
predic ng con nuous numeric values based on input features.

One popular regression model is Linear Regression, which models the rela onship between a
dependent variable and one or more independent variables by fi ng a linear equa on to the
observed data. The goal of linear regression is to find the best-fi ng line that minimizes the sum
of squared errors between the predicted values and the actual values. The line is defined by the
slope and intercept, which are es mated from the training data using the method of least
squares.

The equa on of a simple linear regression model is: y = b0 + b1*x, where y is the dependent
variable, x is the independent variable, b0 is the intercept, and b1 is the slope. The slope
represents the change in y for every one-unit increase in x, and the intercept represents the
value of y when x is zero.

The model can be extended to mul ple linear regression, where there are more than one
independent variables. The equa on becomes: y = b0 + b1*x1 + b2*x2 + ... + bn*xn, where n is
the number of independent variables. The slope coefficients b1 to bn represent the change in y
for every one-unit increase in the corresponding x variable, holding all other variables constant.

Linear regression is widely used in fields such as economics, finance, engineering, and social
sciences to model and predict various phenomena, such as stock prices, housing prices, sales,
and customer behavior.
b) Differen ate between stemming and lemma za on.

Stemming Lemmatization

Stemming reduces words to their base Lemmatization reduces words to their base form
or root form by removing suffixes and (known as lemma) based on the word's context and
prefixes. part of speech.

It uses simple and fast rule-based It utilizes more advanced linguistic and language-
approaches. specific algorithms.

Stemmed words may not always be Lemmatized words are always valid words found in a
actual words. dictionary.

Stemming can result in loss of meaning


or ambiguity due to aggressive Lemmatization aims to preserve the meaning and
truncation. context of words.

Examples: stemming reduces "running,"


"runs," and "ran" to the common root Examples: lemmatization reduces "running," "runs,"
"run." and "ran" to the base form "run."

c) Describe the types of data analy cs.


 There are three main types of data analy cs: descrip ve, predic ve, and prescrip ve.

o Descrip ve analy cs: Descrip ve analy cs is the simplest type of analy cs that
summarizes the historical data to provide insights into what happened in the past. It
answers ques ons such as "What happened?" and "How many?" Examples of descrip ve
analy cs include summary sta s cs, frequency distribu ons, and data visualiza on.

o Predic ve analy cs: Predic ve analy cs is the type of analy cs that uses sta s cal
models and machine learning algorithms to analyze historical data and make predic ons
about future events. It answers ques ons such as "What is likely to happen?" and "How
likely?" Examples of predic ve analy cs include regression analysis, me series
forecas ng, and classifica on.

o Prescrip ve analy cs: Prescrip ve analy cs is the most advanced type of analy cs that
uses op miza on and simula on techniques to recommend ac ons that will achieve the
best possible outcome. It answers ques ons such as "What should we do?" and "How
can we op mize?" Examples of prescrip ve analy cs include linear programming,
decision trees, and Monte Carlo simula on.
Each type of analy cs has its own strengths and limita ons, and the choice of which type to use
depends on the specific business problem and the available data.

Q4.A empt any two of the following. [2x4=8]


a) Consider the following transac onal database and find out frequent itemsets using apriori
algorithm with minimum support count=2.
TID List_of_item_IDs
T1 I1,I2,I5
T2 I2,I4
T3 I2,I3
T4 I1,I2,I4
T5 I1,I3
T6 I2,I3
T7 I1,I3
T8 I1,I2,I3,I5
T9 I1,I2,3

b) Which are the challenges in social media analy cs?


 Social media analy cs faces several challenges due to the unique characteris cs of social media
data and the dynamic nature of online pla orms. Some key challenges include:
o 1. Volume and Velocity: The vast amount of data generated on social media pla orms
presents a challenge in terms of data collec on, storage, and processing. The high
velocity of data, with constant updates and real- me interac ons, requires efficient and
scalable analy cs solu ons.
o 2. Data Quality and Noise: Social media data can be noisy, containing spam, irrelevant
content, and user-generated noise. Ensuring data quality and filtering out noise are
crucial for accurate analysis and insights.
o 3. Data Privacy and Ethics: Social media analy cs raises concerns about data privacy,
consent, and ethical considera ons. Balancing the need for data access and analysis with
user privacy rights and ethical guidelines is an ongoing challenge.
o 4. Textual Analysis and Natural Language Processing: Analyzing unstructured text data
from social media poses challenges in understanding language nuances, sen ment
analysis, and dealing with slang, abbrevia ons, and informal language.
o 5. User Bias and Representa veness: Social media data may have biases due to self-
selec on, algorithmic filtering, or the characteris cs of ac ve users. Ensuring the
representa veness of data and mi ga ng biases is essen al for drawing accurate
conclusions and avoiding skewed insights.
o 6. Mul -Modality and Mul media Content: Social media pla orms include various types
of content, including text, images, videos, and audio. Analyzing and extrac ng insights
from mul -modal and mul media content adds complexity to the analy cs process.
o 7. Real-Time Monitoring and Crisis Management: Social media analy cs o en involves
monitoring and managing brand reputa on, crisis situa ons, and emerging trends in
real- me. Quickly iden fying and responding to online events or sen ments is cri cal
but challenging.
o 8. Data Integra on and Pla orm Heterogeneity: Integra ng data from mul ple social
media pla orms and sources, each with its own APIs, data formats, and access
restric ons, can be complex. Dealing with pla orm heterogeneity and ensuring data
consistency pose integra on challenges.

Addressing these challenges requires a combina on of technical exper se, domain


knowledge, data processing capabili es, and ethical considera ons to extract valuable
insights from social media data while naviga ng the complexi es and limita ons of the
pla orms.

c) Explain reinforcement learning.


 Reinforcement learning is a type of machine learning that is used to train an agent to make
decisions in an environment. The agent learns by interac ng with the environment and receiving
feedback in the form of rewards or penal es. The goal of reinforcement learning is to find the
op mal policy that maximizes the cumula ve reward over me.
The reinforcement learning process starts with the agent in a par cular state of the environment.
The agent takes an ac on in response to the state, and the environment transi ons to a new
state and provides a reward to the agent based on the ac on taken. The agent then updates its
policy based on the feedback received and repeats the process.
The key components of reinforcement learning are the policy, the reward func on, and the value
func on. The policy determines the ac on to take given the current state of the environment.
The reward func on provides feedback to the agent in the form of a scalar value that indicates
how good or bad the ac on was. The value func on es mates the expected cumula ve reward
from a par cular state.
Reinforcement learning has been successfully applied to a wide range of applica ons, including
robo cs, game playing, and autonomous vehicles. However, it can be challenging to apply in
prac ce due to the need for extensive training and the difficulty of designing a reward func on
that accurately reflects the desired behavior.

Q5.A empt any one of the following. [1x3=3]


a) Write a short note support vector machine.
 Support Vector Machine (SVM) is a popular supervised machine learning algorithm used for
classifica on and regression analysis. It is a binary classifier that separates data points into two
classes based on their features. SVM finds the best hyperplane that separates the classes by
maximizing the margin between the closest data points from each class. The data points that are
closest to the hyperplane are called support vectors.
SVM is a powerful algorithm that can handle both linear and non-linear data by using a
technique called kernel trick. Kernel trick maps the data points into a higher-dimensional space
where they can be separated by a hyperplane. SVM is widely used in various applica ons such as
image classifica on, bioinforma cs, text classifica on, and fraud detec on. However, SVM can be
computa onally expensive when dealing with large datasets, and it may not perform well when
the classes are overlapping or the data is noisy.
b) Explain lifecycle of data analy cs.

Phase 1: Discovery –
The data science team learn and inves gate the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
The team formulates ini al hypothesis that can be later tested with data.

Phase 2: Data Prepara on –


Steps to explore, preprocess, and condi on data prior to modeling and analysis.
It requires the presence of an analy c sandbox, the team execute, load, and transform, to get data
into the sandbox.
Data prepara on tasks are likely to be performed mul ple mes and not in predefined order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.

Phase 3: Model Planning –


Team explores data to learn about rela onships between variables and subsequently, selects key
variables and the most suitable models.
In this phase, data science team develop data sets for training, tes ng, and produc on purposes.
Team builds and executes models based on the work done in the model planning phase.
Several tools commonly used for this phase are – Matlab, STASTICA.

Phase 4: Model Building –


Team develops datasets for tes ng, training, and produc on purposes.
Team also considers whether its exis ng tools will suffice for running the models or if they need
more robust environment for execu ng models.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Commercial tools – Matlab , STASTICA.

Phase 5: Communica on Results –


A er execu ng model team need to compare outcomes of modeling to criteria established for
success and failure.
Team considers how best to ar culate findings and outcomes to various team members and
stakeholders, taking into account warning, assump ons.
Team should iden fy key findings, quan fy business value, and develop narra ve to summarize and
convey findings to stakeholders.
Phase 6: Opera onalize –
The team communicates benefits of project more broadly and sets up pilot project to deploy work in
controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the model in
produc on environment on small scale , and make adjustments before full deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.

You might also like