PRACTICE EXAM
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 1 OF 18
Marked out of 3.00
ng
In which dimensions are data science activities typically conducted?
ni
Select one:
ar
data prediction and business analytics
Le
descriptive dimension, prescriptive dimension and diagnostic dimension
feature engineering, model validation and hyperparameter tuning
e
data flow, data curation, and data analytics
nc
The correct answer is: data flow, data curation, and data analytics
ta
is
QUESTION 2 OF 18
D
Marked out of 3.00
IU
Which of the following defines a false positive output for a classification model?
am
Select one:
Ex
The classifier labels a “No” data record as “Yes”.
The classifier labels a “No” data record as “No”.
The classifier labels a “Yes” data record as “Yes”.
e
The classifier labels a “Yes” data record as “No”.
tic
ac
The correct answer is: The classifier labels a “Yes” data record as “No”.
Pr
14.02.2024 1/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 3 OF 18
Marked out of 3.00
g
in
Which of the following is a type of correlation analysis?
rn
Select one:
a
LDA correlation
Le
Bayesian correlation
Pearson correlation
PCA correlation
e
The correct answer is: Pearson correlation
nc
ta
is
QUESTION 4 OF 18
D
Marked out of 3.00
IU
E-Mail spam detection is a ...
am
Select one:
Ex
reinforcement problem.
categorization problem.
classification problem.
e
regression problem.
tic
The correct answer is: classification problem.
ac
Pr
14.02.2024 2/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 5 OF 18
Marked out of 3.00
g
in
What is one main goal of principal component analysis (PCA)?
rn
Select one:
a
high-dimensionality clustering
Le
dimensionality reduction
de-biasing
outlier detection
e
The correct answer is: dimensionality reduction
nc
ta
is
QUESTION 6 OF 18
D
Marked out of 3.00
IU
In which mathematical technique does the elbow method play an important role?
am
Select one:
Ex
principal component analysis (PCA)
K-means clustering
time-series forecasting
e
linear regression
tic
The correct answer is: K-means clustering
ac
Pr
14.02.2024 3/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 7 OF 18
Marked out of 3.00
g
in
In which sub-activity of data flow do accessibility, transparency and security play an important
role?
rn
Select one:
a
Le
data collection
data preservation
data storage
e
data access
nc
The correct answer is: data storage
ta
is
D
QUESTION 8 OF 18
Marked out of 3.00
IU
am
The correlation between the amount of exercise per week and body weight is -0.45. What can
be deduced about the relationship between weekly exercise and body weight?
Ex
Select one:
The majority of individuals do only few hours of exercises per week.
e
There is no relationship between the amount of exercise and the body weight.
tic
The majority of individuals have a low body weight.
Individuals who engage in more weekly exercise tend to have lower body weights.
ac
The correct answer is: Individuals who engage in more weekly exercise tend to have lower body weights.
Pr
14.02.2024 4/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 9 OF 18
Marked out of 3.00
g
in
Which type of data format is the following?
{
rn
"student": {
"name": "name_of_student",
a
"grade": 85,
Le
"enrolled": true
}
}
e
nc
Select one:
Apache Parquet
ta
Protobuf
is
JSON
D
XML
IU
The correct answer is: JSON
am
QUESTION 10 OF 18
Ex
Marked out of 3.00
What do the parameters {S, A, R, P, and V} represent in a Markov decision process?
e
tic
Select one:
ac
states, actions, rewards, policy, values
strategy, agent, reinforcement, prediction, verification
Pr
simulation, alternatives, returns, planning, visualization
scenarios, alternatives, restrictions, patterns, validation
The correct answer is: states, actions, rewards, policy, values
14.02.2024 5/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 11 OF 18
Marked out of 3.00
g
in
In normal distribution, 68% of the values are within …
rn
Select one:
a
four standard deviations.
Le
two standard deviations.
one standard deviation.
three standard deviations.
e
The correct answer is: one standard deviation.
nc
ta
is
QUESTION 12 OF 18
D
Marked out of 3.00
IU
Which of the following do you need to calculate the model's recall?
am
Select one:
Ex
only TN and FN
only TN and FP
only TP and FN
e
only TP and FP
tic
The correct answer is: only TP and FN
ac
Pr
14.02.2024 6/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 13 OF 18
Marked out of 3.00
g
in
Which of the following statements is correct for a histogram?
rn
Answer option 1: A smaller dataset is best visualized with a larger number of bins for
increased accuracy.
a
Answer option 2: The number of bins has a direct relationship with the size of each bin.
Le
Select one:
e
None of the answer options are correct.
nc
Both answer options are correct.
Only answer option 1 is correct.
ta
Only answer option 2 is correct.
is
The correct answer is: None of the answer options are correct.
D
IU
QUESTION 14 OF 18
am
Marked out of 3.00
Ex
In the context of support vector machines (SVM), what role do support vectors and the margin
play?
e
Select one:
tic
Support vectors are the outlier data points, and the margin measures the regularization strength in
SVM.
ac
Support vectors are the data records closest to the classification line, and the margin is the
distance between the two sides.
Pr
Support vectors are the features used for classification, and the margin represents the error in the
model.
Support vectors are the parameters defining the decision boundary, and the margin indicates the
lowest and highest value of the dataset.
The correct answer is: Support vectors are the data records closest to the classification line, and the
margin is the distance between the two sides.
14.02.2024 7/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 15 OF 18
Marked out of 6.00
g
in
Briefly explain the three main aspects used to identify a data science use case (DSUC) in a
business context.
a rn
Le
In a business context, the identification of a DSUC revolves around the following three key aspects:
e
nc
First, the achieved value refers to the potential benefits and return on investment (ROI) that a data
science initiative can bring to a business. Managers often assess the worth of a new project by
ta
considering how much it could enhance operational efficiency or positively impact the organization's
financial standing. (2 points)
is
The second aspect, effort, encompasses the resources required to execute a DSUC. This includes the
D
investment of time, manpower, and other resources necessary for the successful implementation of the
data science initiative. (2 points)
IU
Lastly, the identification of a DSUC involves considering the potential risks and uncertainties associated
with the initiative. Risks can arise from various factors, such as changes in market conditions,
am
advancements in technology, or alterations in regulatory landscapes. (2 points)
Ex
QUESTION 16 OF 18
e
Marked out of 6.00
tic
Suppose you have an attribute with values ranging from -1290 to 4870, and you want to
ac
normalize these values using decimal scaling. Explain the required steps.
Pr
1. Identify the maximum absolute value within the attribute range. This is 4870.(2 points)
2. Determine the scaling factor required to make the maximum absolute value less than 1. As the
maximum absolute value is 4870, we need to divide all values by 10000. (2 points)
3. Divide all values in the attribute by the scaling factor, i.e. by 10000.(2 points)
14.02.2024 8/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
QUESTION 17 OF 18
Marked out of 18.00
g
in
Sophia and Daniel are working on a data science project for an e-commerce company aiming
to improve customer satisfaction and sales. The project involves analyzing customer data to
rn
identify patterns, preferences, and potential areas for enhancement in the online shopping
experience. The dataset includes various customer-related variables such as purchase
a
history, browsing behavior, demographic information, and customer feedback.
Le
However, Sophia and Daniel are facing challenges related to data quality in this setting. One
of the customer variables in the dataset is "time spent on website", which represents the
e
amount of time a customer spends browsing the e-commerce platform during a session.
nc
Sophia and Daniel notice that there are missing values and outliers in this variable, potentially
affecting their analysis of customer engagement. ta
a) Briefly explain what is meant with duplicates and outliers. Illustrate your answer by refering
to the above described dataset.
is
b) Name four methods that are commonly utilized to resolve missing values.
D
c) Evaluate these methods in terms of their advantages and disadvantages for handling
missing values related to the variable "time spent on website."
IU
am
a)
Ex
Duplicate entries refer to instances where identical or highly similar records appear more than once in the
dataset. For instance, there might duplicate entries of customers in the customer database due to
technical issues or because some customers unintentionally created multiple accounts with the same or
e
very similar information. (3 points)
tic
Outliers are data points that significantly deviate from the rest of the dataset. Outliers might be found in
the time spent on the website (as explained in the example). Some customers may, for example, engage
in exceptionally long sessions. (3 points)
ac
b)
Pr
1. The removal of the records where there are missing values(1 point)
2. The replacement of the missing value with an interpolated value from the neighboring records(1 point)
3. The replacement of the missing value with the average value of its variable along all data records
(mean imputation) (1 point)
4. The replacement of the missing value with the most frequently observed value along all data records (1
point)
c)
The removal of records is easy but might reduce the dataset size considerably if a significant number of
14.02.2024 9/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
entries of the "time spent on website" have missing values. (2 points)
Interpolation preserves the dataset, but linear interpolation might not accurately capture the patterns that
can be found when analysing the time spent on a website. (2 points)
g
Mean Imputation is simple and easy to implement but it may not be suitable if extreme outliers (e.g.,
in
extremely long sessions) heavily influence the mean. (2 points)
rn
Mode imputation is not directly suitable for continuous variables like "time spent on website."(2 points)
a
Le
QUESTION 18 OF 18
Marked out of 18.00
e
nc
Emily and David are environmental researchers working on a project to analyze various types
ta
of environmental data for a conservation organization. They are tasked with categorizing
different environmental datasets into structured, unstructured, and semi-structured data.
is
However, they have different interpretations of these categories. Help them gain a clear
understanding.
D
Categorize the following environmental data into the three categories: structured,
IU
unstructured, and semi-structured data. Briefly explain why you assigned the data to a certain
categorization.
am
(1) Satellite imagery capturing deforestation patterns
(2) Research papers discussing climate change trends
(3) Excel spreadsheet containing bird species observations
Ex
(4) Field notes detailing observations of animal behavior
(5) Monthly rainfall log
(6) Drones capturing real-time footage of wildlife habitats
e
tic
ac
Pr
Structured data:
(3): The data is organized in a tabular format with predefined columns like Species, Location, and
ObservationDate. (3 points)
(5): This log is structured as it likely contains organized columns such as Month, Year, and Rainfall. Each
row represents data for a specific month, providing a clear and predefined structure. (3 points)
Unstructured data:
(2): Research papers are typically unstructured as they contain narrative text, references, and may lack a
specific predefined format. The information is presented in a more free-form manner. (3 points)
14.02.2024 10/11
DLMBDSA01-01 1433
EXAMID: 1221789
MASTERSOLUTION
(4): Field notes are considered unstructured data as they often consist of qualitative descriptions and
observational notes, lacking a standardized format. (3 points)
Semi-structured data:
g
(1): While the imagery itself may be unstructured, the accompanying metadata (such as location, date,
and resolution) provides a level of structure. This combination makes it semi-structured data. (3 points)
in
(6) The video footage from drones is unstructured, but metadata such as GPS coordinates and
rn
timestamps provides a partial structure. Therefore, it falls into the category of semi-structured data. (3
points)
a
Le
e
nc
ta
is
D
IU
am
Ex
e
tic
ac
Pr
14.02.2024 11/11