We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
1. Dataset
+ A dataset is a particular instance of data
that is used for analysis or model building at
any given time.
¢ A dataset comes in different flavors such as
numerical data, categorical data, text data,
image data, voice data, and video data.
* For beginning data science projects, the
most popular type of dataset is a dataset
containing numerical data that is typically
stored in a comma-separated values (CSV)
file format2. Data Wrangling
* Data wrangling is the process of converting
data from its raw form to a tidy form ready for
analysis.
* Data wrangling is an important step in data
preprocessing and includes several processes
like data importing, data cleaning, data
structuring, string processing, HTML parsing,
handling dates and times, handling missing
data, and text mining.3. Data Visualization
* Itis one of the main tools used to analyze
and study relationships between different
variables.
* Data visualization (e.g., scatter plots, line
graphs, bar plots, histograms, qqplots, smooth
densities, boxplots, pair plots, heat maps, etc.)
can be used for descriptive analytics.
* Data visualization is also used in machine
learning for data preprocessing and analysis,
feature selection, model building, model
testing, and model evaluation.
a4, Outliers
* An outlier is a data point that is very
different from the rest of the dataset.
* Outliers are very common and are expected
in large datasets.
* One common way to detect outliers in a
dataset is by using a box plot.
* Outliers can significantly degrade the
predictive power of a machine learning model.
* Advanced methods for dealing with outliers
include the RANSAC method.5. Data Imputation
* Most datasets contain missing values.
However, the removal of samples or dropping
of entire feature columns is simply not feasible
because we might lose too much valuable data.
* So, here we can use different interpolation
techniques to estimate the missing values from
the other training samples in our dataset.
* One of the most common interpolation
techniques is mean imputation, where we
simply replace the missing value with the mean
value of the entire feature column.6. Data Scaling
* Scaling your features will help improve the
quality and predictive power of your model.
* Without scaling your features, the model will
be biased towards a particular feature.
* In order to bring features to the same
scale, we could decide to use either
normalization or standardization of features.
e>°7. Data Partitioning
* In machine learning, the dataset is often
partitioned into training and testing sets.
* The model is trained on the training
dataset and then tested on the testing dataset.
* The testing dataset thus acts as the unseen
dataset, which can be used to estimate a
generalization error (the error expected when
the model is applied to a real-world dataset
after the model has been deployed).
x y
———
#import
train seaterein from sklearn.model_selection import
train_test_split
X_train,X_testy_train,y_test = train_test_split
{ aa
(Xy,test_size=0.3)
test8. Supervised Learning
* These are machine learning algorithms that
perform learning by studying the relationship
between the feature variables and the known
target variable.
Supervised learning has two subcategories:
a) Continuous Target Variables
* Linear Regression, KNeighbors regression
(KNR), and Support Vector Regression (SVR).
b) Discrete Target Variables
* Logistic Regression classifier, Support Vector
Machines (SVM), Decision tree classifier
EIQ
Annotations vet Prediction
These are apples eS?9. Unsupervised Learning
* In unsupervised learning, we deal with
unlabeled data or data of unknown structure.
* Using unsupervised learning techniques, we are
able to explore the structure of our data to extract
meaningful information without the guidance of a
known outcome variable or reward function.
* K-means clustering is an example of an
unsupervised learning algorithm.
Input data 988
es S%
Model10. Reinforcement Learning
* Reinforcement Learning(RL) is a type of
machine learning technique that enables an agent
to learn in an interactive environment by trial and
error using feedback from its own actions and
experiences.
* Reinforcement learning uses rewards and
punishment as signals for positive and negative
behavior.
ACTION
Re a
@
AGENT ENVIRONMENT
td
STATE, REWARD11. Cross-validation
* Cross-validation is a method of evaluating a
machine learning model’s performance across
random samples of the dataset.
* In k-fold cross-validation, the dataset is
randomly partitioned into training and testing sets.
¢ The model is trained on the training set and
evaluated on the testing set. The process is repeated
k-times.
¢ The average training and testing scores are then
calculated by averaging over the k-folds.