0% found this document useful (0 votes)
23 views15 pages

AI & ML Interview Preparation

The document provides a comprehensive guide on machine learning interview preparation, covering essential concepts such as bias-variance tradeoff, sampling techniques, overfitting and underfitting, and various statistical analyses. It also discusses data cleaning, feature selection methods, and the importance of handling missing values, along with examples of false positives and false negatives in different fields. Overall, it serves as a valuable resource for individuals preparing for machine learning interviews.

Uploaded by

Opinion FF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

AI & ML Interview Preparation

The document provides a comprehensive guide on machine learning interview preparation, covering essential concepts such as bias-variance tradeoff, sampling techniques, overfitting and underfitting, and various statistical analyses. It also discusses data cleaning, feature selection methods, and the importance of handling missing values, along with examples of false positives and false negatives in different fields. Overall, it serves as a valuable resource for individuals preparing for machine learning interviews.

Uploaded by

Opinion FF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SYNAPSE - The AI & ML Club

Machine Learning Interview


Preparation Q&A
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

Machine Learning Interview Q&A

1. What is Machine Learning? what are its applications?

Machine learning is a field of study, that allows machines to learn and


improve themself from data, without being explicitly programmed.

Here are some applications of machine learning -

Analyzing images of products on a production line to automatically


classify them
Automatically classifying news articles into fake and real
Automatically flagging offensive comments on discussion forums
Creating a chatbot or a personal assistance

2. What is bias and variance in the term of machine learning?

Bias: Bias is the average difference between the average prediction of


the model and true values. (In other words, Bias is the inability of a
model to learn and capture the relationship in training data.) High bias
can lead to underfitting.

Variance: Variance is the variability of model prediction on different


subsets of training data. A model with high variance pays too much
attention to capture the patterns in data. High variance can lead
overfitting.

3. What is the bias-variance tradeoff?


Bias variance trade-off is a fundamental concept of machine learning
that describes the balance between a model’s complexity and
predictive performance , this trade-off is crucial to understand that
model is generalize well or not.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

4. What are some of the techniques used for sampling? What is the
main advantage of sampling?

There are majorly two categories of sampling techniques based on


the usage of statistics, they are:
Probability Sampling techniques: Clustered sampling, Simple
random sampling, Stratified sampling.
Non-Probability Sampling techniques: Quota sampling,
Convenience sampling, snowball sampling, etc

Data analysis can not be done on a whole volume of data at a time


especially when it involves larger datasets. It becomes crucial to take
some data samples that can be used for representing the whole
population and then perform analysis on it. While doing this, it is very
much necessary to carefully take sample data out of the huge data
that truly represents the entire dataset.

5. List down the conditions for Overfitting and Underfitting.

Overfitting: The model performs well only for the sample training
data. If any new data is given as input to the model, it fails to provide
any result. These conditions occur due to low bias and high variance
in the model. Decision trees are more prone to overfitting.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

Underfitting: Here, the model is so simple that it is not able to


identify the correct relationship in the data, and hence it does not
perform well even on the test data. This can happen due to high bias
and low variance. Linear regression is more prone to Underfitting.

6. What are Eigenvectors and Eigenvalues?

Eigenvectors are column vectors or unit vectors whose


length/magnitude is equal to 1. They are also called right vectors.
Eigenvalues are coefficients that are applied on eigenvectors which
give these vectors different values for length or magnitude.

A matrix can be decomposed into Eigenvectors and Eigenvalues and


this process is called Eigen decomposition. These are then eventually
used in machine learning methods like PCA (Principal Component
Analysis) for gathering valuable insights from the given matrix
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

7. What is Cross-Validation?

Cross-Validation is a Statistical technique used for improving a


model’s performance. Here, the model will be trained and tested with
rotation using different samples of the training dataset to ensure that
the model performs well for unknown data. The training data will be
split into various groups and the model is run and validated against
these groups in rotation.

8. What are the differences between correlation and covariance?

Although these two terms are used for establishing a relationship and
dependency between any two random variables, the following are the
differences between them:

Correlation: This technique is used to measure and estimate the


quantitative relationship between two variables and is measured
in terms of how strong are the variables related.
Covariance: It represents the extent to which the variables
change together in a cycle. This explains the systematic
relationship between pair of variables where changes in one
affect changes in another variable.

Mathematically, consider 2 random variables, X and Y where the


means are represented as μX and μY respectively and standard
deviations are represented by σX and σY respectively and E
represents the expected value operator, then:
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

covarianceXY = E[(X-μX),(Y-μY)]
correlationXY = E[(X-μX),(Y-μY)]/(σXσY) so that

correlation(X,Y) = covariance(X,Y)/(covariance(X) covariance(Y))

Based on the above formula, we can deduce that the correlation is


dimensionless whereas covariance is represented in units that are
obtained from the multiplication of units of two variables.

The following image graphically shows the difference between


correlation and covariance:

9. How do you approach solving any data analytics based project?

Generally, we follow the below steps:


First step is to thoroughly understand the business
requirement/problem
Next, explore the given data and analyze it carefully. If you find
any data missing, get the requirements clarified from the
business.
Data cleanup and preparation step is to be performed next which
is then used for modeling. Here, the missing values are found and
the variables are transformed.
Run your model against the data, build meaningful visualization
and analyze the results to get meaningful insights.
Release the model implementation, track the results and
performance over a specified period to analyze the usefulness.
Perform cross-validation of the model.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

10. Why do we need selection bias?

Selection Bias happens in cases where there is no randomization


specifically achieved while picking a part of the dataset for analysis.
This bias tells that the sample analyzed does not represent the whole
population meant to be analyzed.

For example, in the below image, we can see that the sample that we
selected does not entirely represent the whole population that we
have. This helps us to question whether we have selected the right
data for analysis or not.

11. Why is data cleaning crucial? How do you clean the data?

While running an algorithm on any data, to gather proper insights, it is


very much necessary to have correct and clean data that contains
only relevant information. Dirty data most oen results in poor or
incorrect insights and predictions which can have damaging effects.

For example, while launching any big campaign to market a product, if


our data analysis tells us to target a product that in reality has no
demand and if the campaign is launched, it is bound to fail. This
results in a loss of the company’s revenue. This is where the
importance of having proper and clean data comes into the picture.

Data Cleaning of the data coming from different sources helps in


data transformation and results in the data where the data
scientists can work on.
Properly cleaned data increases the accuracy of the model and
provides very good predictions
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

If the dataset is very large, then it becomes cumbersome to run


data on it. The data cleanup step takes a lot of time (around 80%
of the time) if the data is huge. It cannot be incorporated with
running the model. Hence, cleaning data before running the
model, results in increased speed and efficiency of the model.

Data cleaning helps to identify and fix any structural issues in the
data. It also helps in removing any duplicates and helps to
maintain the consistency of the data.

12.What are the available feature selection methods for selecting


the right variables for building efficient predictive models?

While using a dataset in data science or machine learning algorithms,


it so happens that not all the variables are necessary and useful to
build a model. Smarter feature selection methods are required to
avoid redundant models to increase the efficiency of our model.
Following are the three main methods in feature selection:

A) Filter Methods:

These methods pick up only the intrinsic properties of features


that are measured via univariate statistics and not cross-validated
performance. They are straightforward and are generally faster
and require less computational resources when compared to
wrapper methods.
There are various filter methods such as the Chi-Square test,
Fisher’s Score method, Correlation Coefficient, Variance
Threshold, Mean Absolute Difference (MAD) method, Dispersion
Ratios, etc.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

B)Wrapper Methods:

These methods need some sort of method to search greedily on


all possible feature subsets, access their quality by learning and
evaluating a classifier with the feature.
The selection technique is built upon the machine learning
algorithm on which the given dataset needs to fit.

There are three types of wrapper methods, they are:

Forward Selection: Here, one feature is tested at a time and new


features are added until a good fit is obtained.

Backward Selection: Here, all the features are tested and the
nonfitting ones are eliminated one by one to see while checking
which works better.

Recursive Feature Elimination: The features are recursively


checked and evaluated how well they perform.

These methods are generally computationally intensive and


require highend resources for analysis. But these methods usually
lead to better predictive models having higher accuracy than
filter methods.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

C)Embedded Methods:

Embedded methods constitute the advantages of both filter and


wrapper methods by including feature interactions while
maintaining reasonable computational costs
These methods are iterative as they take each model iteration and
carefully extract features contributing to most of the training in
that iteration.
Examples of embedded methods: LASSO Regularization (L1),
Random Forest Importance.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

13. How will you treat missing values during data analysis?

The impact of missing values can be known aer identifying what kind
of variables have the missing values.

If the data analyst finds any pattern in these missing values, then
there are chances of finding meaningful insights.

In case of patterns are not found, then these missing values can
either be ignored or can be replaced with default values such as
mean, minimum, maximum, or median values.

If the missing values belong to categorical variables, then they are


assigned with default values. If the data has a normal distribution,
then mean values are assigned to missing values.

If 80% values are missing, then it depends on the analyst to either


replace them with default values or drop the variables.

14.What are the differences between univariate, bivariate and


multivariate analysis?

Statistical analyses are classified based on the number of variables


processed at a given time.

Univariate Analysis: This analysis deals with solving only one variable
at a time.

Example - Sales pie charts based on territory

Bivariate Analysis: This analysis deals with the statistical study of two
variables at a given time.

Example - Scatterplot of Sales and spend volume analysis study.

Multivariate Analysis: This analysis deals with statistical analysis of


more than two variables and studies the responses.

Example: Study of the relationship between human’s social media


habits and their self esteem which depends on multiple factors like
age, number of hours spent, employment status, relationship status,
etc.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

15. What is the difference between the Test set and validation set?

The test set is used to test or evaluate the performance of the trained
model. It evaluates the predictive power of the model. The validation
set is part of the training set that is used to select parameters for
avoiding model overfitting.

16.What do you understand by a kernel trick?

Kernel functions are generalized dot product functions used for the
computing dot product of vectors xx and yy in high dimensional
feature space. Kernal trick method is used for solving a non-linear
problem by using a linear classifier by transforming linearly
inseparable data into separable ones in higher dimensions.

17.How will you balance/correct imbalanced data?

There are different techniques to correct/balance imbalanced data. It


can be done by increasing the sample numbers for minority classes.
The number of samples can be decreased for those classes with
extremely high data points. Following are some approaches followed
to balance data:

Use the right evaluation metrics: In cases of imbalanced data, it is


very important to use the right evaluation metrics that provide
valuable information.
Specificity/Precision: Indicates the number of selected instances
that are relevant.
Sensitivity: Indicates the number of relevant instances that are
selected.
F1 score: It represents the harmonic mean of precision and
sensitivity.
MCC (Matthews correlation coefficient): It represents the
correlation coefficient between observed and predicted binary
classifications.
AUC (Area Under the Curve): This represents a relation between
the true positive rates and false-positive rates.

For example, consider the below graph that illustrates training data:

Here, if we measure the accuracy of the model in terms of getting


"0"s, then the accuracy of the model would be very high -> 99.9%, but
the model does not guarantee any valuable information. In such
cases, we can apply different evaluation metrics as stated above
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

Training Set Resampling: It is also possible to balance data by


working on getting different datasets and this can be achieved by
resampling. There are two approaches followed under-sampling that
is used based on the use case and the requirements:

Under-sampling: This balances the data by reducing the size of


the abundant class and is used when the data quantity is
sufficient. By performing this, a new dataset that is balanced can
be retrieved and this can be used for further modeling.

Over-sampling: This is used when data quantity is not sufficient.


This method balances the dataset by trying to increase the
samples size. Instead of getting rid of extra samples, new samples
are generated and introduced by employing the methods of
repetition, bootstrapping, etc.

Perform K-fold cross-validation correctly: Cross-Validation needs to


be applied properly while using over-sampling. The cross-validation
should be done before over-sampling because if it is done later, then
it would be like overfitting the model to get a specific result. To avoid
this, resampling of data is done repeatedly with different ratios.

18.What are some examples when false positive has proven


important than false negative ?

Before citing instances, let us understand what are false positives and
false negatives.
False Positives are those cases that were wrongly identified as an
event even if they were not. They are called Type I errors.
False Negatives are those cases that were wrongly identified as
non-events despite being an event. They are called Type II errors

Some examples where false positives were important than false


negatives are:

In the medical field: Consider that a lab report has predicted


cancer to a patient even if he did not have cancer. This is an
example of a false positive error. It is dangerous to start
chemotherapy for that patient as he doesn’t have cancer as
starting chemotherapy would lead to damage of healthy cells and
might even actually lead to cancer.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

In the e-commerce field: Suppose a company decides to start a


campaign where they give $100 gi vouchers for purchasing
$10000 worth of items without any minimum purchase
conditions. They assume it would result in at least 20% profit for
items sold above $10000. What if the vouchers are given to the
customers who haven’t purchased anything but have been
mistakenly marked as those who purchased $10000 worth of
products. This is the case of falsepositive error.

19.What are some examples when false positive has proven


important than false negative ?

Some examples where false negatives were important than false


positives are:

Criminal justice system: It's considered worse to convict an


innocent person (false positive) than to let a guilty person go free
(false negative).

Drug testing: It's more important to catch drug users (minimize


false negatives) than to have someone falsely accused of drug use
(minimize false positives).

20.Give one example where both false positives and false negatives
are important equally?

Banking fields: Lending loans are the main sources of income to the
banks. But if the repayment rate isn’t good, then there is a risk of
huge losses instead of any profits. So giving out loans to customers is
a gamble as banks can’t risk losing good customers but at the same
time, they can’t afford to acquire bad customers. This case is a classic
example of equal importance in false positive and false negative
scenarios.
SYNAPSE - The AI & ML Club AI & ML Interview Preparation

21. What is the importance of dimensionality reduction?

The process of dimensionality reduction constitutes reducing the


number of features in a dataset to avoid overfitting and reduce the
variance. There are mostly 4 advantages of this process:

This reduces the storage space and time for model execution.
Removes the issue of multi-collinearity thereby improving the
parameter interpretation of the ML model.
Makes it easier for visualizing data when the dimensions are
reduced.
Avoids the curse of increased dimensionality.

22. How is the grid search parameter different from the random
search tuning strategy?
Tuning strategies are used to find the right set of hyperparameters.
Hyperparameters are those properties that are fixed and model-
specific before the model is tested or trained on the dataset. Both
the grid search and random search tuning strategies are optimization
techniques to find efficient hyperparameters.

Grid Search:
Here, every combination of a preset list of hyperparameters is
tried out and evaluated.
The search pattern is similar to searching in a grid where the
values are in a matrix and a search is performed. Each parameter
set is tried out and their accuracy is tracked. aer every
combination is tried out, the model with the highest accuracy is
chosen as the best one.
The main drawback here is that, if the number of hyperparameters
is increased, the technique suffers. The number of evaluations
can increase exponentially with each increase in the
hyperparameter. This is called the problem of dimensionality in a
grid search.

Random Search:
In this technique, random combinations of hyperparameters set
are tried and evaluated for finding the best solution. For
optimizing the search, the function is tested at random
configurations in parameter space as shown in the image below.
In this method, there are increased chances of finding optimal
parameters because the pattern followed is random. There are
chances that the model is trained on optimized parameters
without the need for aliasing.
This search works the best when there is a lower number of
dimensions as it takes less time to find the right set.

You might also like