0% found this document useful (0 votes)
18 views15 pages

ML Google

1. The document discusses key machine learning concepts including supervised machine learning, labels, features, examples, models, regression vs classification, and linear regression. 2. It provides examples to illustrate these concepts such as predicting housing prices or spam detection. 3. Linear regression is used as a simple example to model the relationship between cricket chirps and temperature, showing how a model can be trained and make predictions.

Uploaded by

tarek hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

ML Google

1. The document discusses key machine learning concepts including supervised machine learning, labels, features, examples, models, regression vs classification, and linear regression. 2. It provides examples to illustrate these concepts such as predicting housing prices or spam detection. 3. Linear regression is used as a simple example to model the relationship between cricket chirps and temperature, showing how a model can be trained and make predictions.

Uploaded by

tarek hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Framing: Key ML Terminology

What is (supervised) machine learning? Concisely put, it is the following:


ML systems learn how to combine input to produce useful predictions on never-before-seen data.
‫ كيفية الجمع بين المدخالت إلنتاج تنبؤات مفيدة على بيانات لم يسبق رؤيتها من قبل‬ML ‫تتعلم أنظمة‬
Labels
A label is the thing we're predicting—the y variable in simple linear regression. The label could be the
future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about
anything.
‫ أو نوع الحيوان‬، ‫ يمكن أن يكون الملصق هو السعر المستقبلي للقمح‬.‫ في االنحدار الخطي البسيط‬y ‫ المتغير‬- ‫التسمية هي الشيء الذي نتوقعه‬
.‫ أو أي شيء تقريًبا‬، ‫ أو معنى مقطع صوتي‬، ‫الظاهر في الصورة‬

Features
A feature is an input variable—the x variable in simple linear regression. A simple machine learning
project might use a single feature, while a more sophisticated machine learning project could use
millions of features, specified as:
‫ بينما يمكن أن يستخدم‬، ‫ قد يستخدم مشروع التعلم اآللي البسيط ميزة واحدة‬.‫ في االنحدار الخطي البسيط‬x ‫ المتغير‬- ‫الميزة هي متغير إدخال‬
:‫مشروع التعلم اآللي األكثر تعقيًدا ماليين الميزات المحددة على النحو التالي‬
x1,x2,...xN
In the spam detector example, the features could include the following:
words in the email text
sender's address
time of day the email was sent
email contains the phrase "one weird trick."
:‫ يمكن أن تتضمن الميزات ما يلي‬، ‫في مثال مكتشف البريد العشوائي‬
‫كلمات في نص البريد اإللكتروني‬
‫عنوان المرسل‬
‫الوقت من اليوم الذي تم فيه إرسال البريد اإللكتروني‬
."‫يحتوي البريد اإللكتروني على عبارة "خدعة واحدة غريبة‬
Examples
An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We
break examples into two categories:
labeled examples
unlabeled examples
A labeled example includes both feature(s) and the label. That is:
labeled examples: {features, label}: (x, y)
Use labeled examples to train the model. In our spam detector example, the labeled examples would be
individual emails that users have explicitly marked as "spam" or "not spam."
For example, the following table shows 5 labeled examples from a data set containing information about
housing prices in California:
housingMedianAge totalRooms totalBedrooms medianHouseValue
(feature) (feature) (feature) (label)
15 5612 1283 66900
19 7650 1901 80100
17 720 174 85700
14 1501 337 73400
20 1454 326 65500
An unlabeled example contains features but not the label. That is:
unlabeled examples: {features, ?}: (x, ?)
Here are 3 unlabeled examples from the same housing dataset, which exclude medianHouseValue:
housingMedianAge totalRooms totalBedrooms
(feature) (feature) (feature)
42 1686 361
34 1226 180
33 1077 271
Once we've trained our model with labeled examples, we use that model to predict the label on
unlabeled examples. In the spam detector, unlabeled examples are new emails that humans haven't yet
labeled.
Models
A model defines the relationship between features and label. For example, a spam detection model
might associate certain features strongly with "spam". Let's highlight two phases of a model's life:
 Training means creating or learning the model. That is, you show the model labeled examples
and enable the model to gradually learn the relationships between features and label.
 Inference means applying the trained model to unlabeled examples. That is, you use the trained
model to make useful predictions (y'). For example, during inference, you can
predict medianHouseValue for new unlabeled examples.
‫ قد يربط نموذج اكتشاف الرسائل غير المرغوب فيها بعض الميزات بقوة بـ‬، ‫ على سبيل المثال‬.‫يحدد النموذج العالقة بين الميزات والتسمية‬
:‫ دعنا نسلط الضوء على مرحلتين من حياة النموذج‬."‫"البريد العشوائي‬
‫ أي أنك تعرض األمثلة المسمى النموذج وتمّك ن النموذج من التعرف تدريجًيا على العالقات بين‬.‫• التدريب يعني إنشاء أو تعلم النموذج‬
.‫الميزات والتسمية‬
‫ على سبيل‬.)' y( ‫ أي أنك تستخدم النموذج المدرب لعمل تنبؤات مفيدة‬.‫• االستدالل يعني تطبيق النموذج المدرب على األمثلة غير المسماة‬
.‫ لألمثلة الجديدة غير المسماة‬medianHouseValue ‫ يمكنك توقع‬، ‫ أثناء االستدالل‬، ‫المثال‬

Regression vs. classification


A regression model predicts continuous values. For example, regression models make predictions that
answer questions like the following:
 What is the value of a house in California?
 What is the probability that a user will click on this ad?
:‫ تقوم نماذج االنحدار بعمل تنبؤات تجيب على أسئلة مثل ما يلي‬، ‫ على سبيل المثال‬.‫يتنبأ نموذج االنحدار بقيم مستمرة‬ 
‫• ما هي قيمة المنزل في والية كاليفورنيا؟‬ 
‫• ما هو احتمال أن ينقر المستخدم على هذا اإلعالن؟‬ 
A classification model predicts discrete values. For example, classification models make predictions that
answer questions like the following:
 Is a given email message spam or not spam?
 Is this an image of a dog, a cat, or a hamster?
:‫ تضع نماذج التصنيف تنبؤات تجيب على أسئلة مثل ما يلي‬، ‫ على سبيل المثال‬.‫يتنبأ نموذج التصنيف بقيم منفصلة‬
‫• هل رسالة بريد إلكتروني معينة غير مرغوب فيها أم ليست بريًدا عشوائًيا؟‬
‫• هل هذه صورة لكلب أم قطة أم هامستر؟‬
Descending into ML: Linear Regression
It has long been known that crickets (an insect species) chirp more frequently on hotter days than on
cooler days. For decades, professional and amateur scientists have cataloged data on chirps-per-minute
and temperature. As a birthday gift, your Aunt Ruth gives you her cricket database and asks you to learn
a model to predict this relationship. Using this data, you want to explore this relationship.
First, examine your data by plotting it:
‫ على مدى‬.‫من المعروف منذ فترة طويلة أن الصراصير ( نوع من الحشرات) تغرد بشكل متكرر في األيام الحارة أكثر من األيام الباردة‬
‫ تمنحك عمتك روث قاعدة‬، ‫ كهدية عيد ميالد‬.‫ قام العلماء المحترفون والهواة بفهرسة بيانات عن غردات في الدقيقة ودرجة الحرارة‬، ‫عقود‬
.‫ تريد استكشاف هذه العالقة‬، ‫ باستخدام هذه البيانات‬.‫بيانات الصراصير الخاصة بها وتطلب منك تعلم نموذج للتنبؤ بهذه العالقة‬
:‫ افحص بياناتك عن طريق رسمها‬، ‫أوًال‬

Figure 1. Chirps per Minute vs. Temperature in Celsius.


As expected, the plot shows the temperature rising with the number of chirps. Is this relationship
between chirps and temperature linear? Yes, you could draw a single straight line like the following to
approximate this relationship:

Figure 2. A linear relationship.


True, the line doesn't pass through every dot, but the line does clearly show the relationship between
chirps and temperature. Using the equation for a line, you could write down this relationship as follows:
y = mx + b
where:
y is the temperature in Celsius—the value we're trying to predict.
m is the slope of the line.
x is the number of chirps per minute—the value of our input feature.
b is the y-intercept.
By convention in machine learning, you'll write the equation for a model slightly differently:
y′ = b + w1x1
where:
y′ is the predicted label (a desired output).
b is the bias (the y-intercept), sometimes referred to as w0.
w1 is the weight of feature 1. Weight is the same concept as the "slope" m in the traditional equation of
a line.
x1 is a feature (a known input).
To infer (predict) the temperature y′ for a new chirps-per-minute value x1, just substitute the x1 value
into this model.
Although this model uses only one feature, a more sophisticated model might rely on multiple features,
each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might
look as follows: y′ = b + w1x1 + w2x2 + w3x3

Descending into ML: Training and Loss


Training a model simply means learning (determining) good values for all the weights and the bias from
labeled examples. In supervised learning, a machine learning algorithm builds a model by examining
many examples and attempting to find a model that minimizes loss; this process is called empirical risk
minimization.
‫ تبني‬، ‫ في التعلم الخاضع لإلشراف‬.‫إن تدريب نموذج ما يعني ببساطة تعلم ( تحديد) القيم الجيدة لجميع األوزان والتحيز من األمثلة المصنفة‬
‫خوارزمية التعلم اآللي نموذًج ا من خالل فحص العديد من األمثلة ومحاولة العثور على نموذج يقلل من الخسارة ؛ تسمى هذه العملية بتقليل‬
.‫المخاطر التجريبية‬
Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's
prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the
loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on
average, across all examples. For example, Figure 3 shows a high loss model on the left and a low loss
model on the right. Note the following about the figure:
 The arrows represent loss.
 The blue lines represent predictions.
‫ إذا كان توقع النموذج‬.‫ أي أن الخسارة عبارة عن رقم يشير إلى مدى سوء توقع النموذج في مثال واحد‬.‫الخسارة هي عقوبة التنبؤ السيئ‬
‫ الهدف من تدريب النموذج هو العثور على مجموعة من األوزان والتحيزات‬.‫ فإن الخسارة أكبر‬، ‫ فإن الخسارة تساوي صفًر ا ؛ وإال‬، ‫مثالًيا‬
‫ نموذج خسارة عالية على اليسار ونموذج‬3 ‫ يوضح الشكل‬، ‫ على سبيل المثال‬.‫ عبر جميع األمثلة‬، ‫ في المتوسط‬، ‫ذات الخسارة المنخفضة‬
:‫ الحظ ما يلي حول الشكل‬.‫خسارة منخفضة على اليمين‬
.‫• األسهم تمثل الخسارة‬
.‫• تمثل الخطوط الزرقاء التوقعات‬

Figure 3. High loss in the left model; low loss in the right model.

Notice that the arrows in the left plot are much longer than their
‫الحظ أن األسهم الموجودة في المؤامرة اليسرى أطول‬
counterparts in the right plot. Clearly, the line in the right plot is a
‫ من الواضح أن‬.‫بكثير من نظيراتها في المؤامرة اليمنى‬
much better predictive model than the line in the left plot.
‫الخط الموجود في المخطط األيمن هو نموذج تنبؤي‬
You might be wondering whether you could create a mathematical
.‫أفضل بكثير من الخط الموجود في الرسم األيسر‬
function—a loss function—that would aggregate the individual losses
- ‫قد تتساءل عما إذا كان بإمكانك إنشاء دالة رياضية‬
in a meaningful fashion.
‫ من شأنها تجميع الخسائر الفردية بطريقة‬- ‫دالة خسارة‬
Squared loss: a popular loss function
.‫ذات مغزى‬
The linear regression models we'll examine here use a loss function
‫ دالة خسارة شائعة‬:‫الخسارة التربيعية‬
called squared loss (also known as L2 loss). The squared loss for a
‫تستخدم نماذج االنحدار الخطي التي سنقوم بفحصها‬
single example is as follows:
‫هنا دالة خسارة تسمى خسارة التربيعية (ُتعرف أيًضا‬
= the square of the difference between the label and the prediction .)L2 ‫باسم خسارة‬
= (observation - prediction(x))2
= (y - y')2
Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate
MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

where:
(x,y) is an example in which
x is the set of features (for example, chirps/minute, age, gender) that the model uses to make
predictions.
y is the example's label (for example, temperature).
prediction(x) is a function of the weights and bias in combination with the set of features x.
D is a data set containing many labeled examples, which are (x,y) pairs.
N is the number of examples in D.
Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the
best loss function for all circumstances.
Reducing Loss
To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely
used method for reducing loss, and is as easy and efficient as walking down a hill.
Reducing Loss: An Iterative Approach
The previous module introduced the concept of loss. Here, in this module, you'll learn how a machine
learning model iteratively reduces loss.
Iterative learning might remind you of the "Hot and Cold" kid's game for finding a hidden object
like a thimble. In this game, the "hidden object" is the best possible model. You'll start with a
wild guess ("The value of w1 is 0.") and wait for the system to tell you what the loss is. Then,
you'll try another guess ("The value of w1 is 0.5.") and see what the loss is. Aah, you're getting
warmer. Actually, if you play this game right, you'll usually be getting warmer. The real trick to
the game is trying to find the best possible model as efficiently as possible.
The following figure suggests the iterative trial-and-error process that machine learning
algorithms use to train a model:
Figure 1. An iterative approach to training a model.
We'll use this same iterative approach throughout the Machine Learning Crash Course, detailing
various complications, particularly within that stormy cloud labeled "Model (Prediction
Function)." Iterative strategies are prevalent in machine learning, primarily because they scale
so well to large data sets.
The "model" takes one or more features as input and returns one prediction (y′) as output. To
simplify, consider a model that takes one feature and returns one prediction:
y′=b+w1x1
What initial values should we set for b and w1? For linear regression problems, it turns out that
the starting values aren't important. We could pick random values, but we'll just take the
following trivial values instead:
b=0
w1 = 0
Suppose that the first feature value is 10. Plugging that feature value into the prediction
function yields:
y′=0+0⋅10=0
The "Compute Loss" part of the diagram is the loss function that the model will use. Suppose
we use the squared loss function. The loss function takes in two input values:
y′: The model's prediction for features x
y: The correct label corresponding to features x.
At last, we've reached the "Compute parameter updates" part of the diagram. It is here that the
machine learning system examines the value of the loss function and generates new values
for b and w1. For now, just assume that this mysterious box devises new values and then the
machine learning system re-evaluates all those features against all those labels, yielding a new
value for the loss function, which yields new parameter values. And the learning continues
iterating until the algorithm discovers the model parameters with the lowest possible loss.
Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When
that happens, we say that the model has converged.
Key Point:
A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively
adjusting those guesses until learning the weights and bias with the lowest possible loss.

Reducing Loss: Gradient Descent


The iterative approach diagram (Figure 1) contained a green hand-wavy box entitled "Compute
parameter updates." We'll now replace that algorithmic fairy dust with something more substantial.
Suppose we had the time and the computing resources to calculate the loss for all possible values of w1.
For the kind of regression problems we've been examining, the resulting plot of loss vs. w1 will always
be convex. In other words, the plot will always be bowl-shaped, kind of like this:
Figure 2. Regression problems yield convex loss vs. weight plots.

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That
minimum is where the loss function converges.
Calculating the loss function for every conceivable value of w1 over the entire data set would be an
inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in
machine learning—called gradient descent.
The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point
doesn't matter much; therefore, many algorithms simply set w1 to 0 or pick a random value. The
following figure shows that we've picked a starting point slightly greater than 0:
Figure 3. A starting point for gradient descent.
The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here
in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which
way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial
derivatives with respect to the weights.
Click the plus icon to learn more about partial derivatives and gradients.
Note that a gradient is a vector, so it has both of the following characteristics:
a direction
a magnitude
The gradient always points in the direction of steepest increase in the loss function. The gradient
descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly
as possible.
Figure 4. Gradient descent relies on negative gradients.
To determine the next point along the loss function curve, the gradient descent algorithm adds some
fraction of the gradient's magnitude to the starting point as shown in the following figure:
Figure 5. A gradient step moves us to the next point on the loss curve.
The gradient descent then repeats this process, edging ever closer to the minimum.
Note: When performing gradient descent, we generalize the above process to tune all the model
parameters simultaneously. For example, to find the optimal values of both w1 and the bias b, we
calculate the gradients with respect to both w1 and b. Next, we modify the values of w1 and b based on
their respective gradients. Then we repeat these steps until we reach minimum loss.

Reducing Loss: Learning Rate


As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms
multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to
determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01,
then the gradient descent algorithm will pick the next point 0.025 away from the previous point.
Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine
learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate
that is too small, learning will take too long:

Figure 6. Learning rate is too small.


Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce
haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:
Figure 7. Learning rate is too large.
There's a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how
flat the loss function is. If you know the gradient of the loss function is small then you can safely try a
larger learning rate, which compensates for the small gradient and results in a larger step size.

Figure 8. Learning rate is just right.


Click the plus icon to learn more about the ideal learning rate.
The ideal learning rate in one-dimension is 1f(x)″ (the inverse of the second derivative of f(x) at x).
The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial
derivatives).
The story for general convex functions is more complex.

Reducing Loss: Stochastic Gradient Descent


In gradient descent, a batch is the total number of examples you use to calculate the gradient
in a single iteration. So far, we've assumed that the batch has been the entire data set. When
working at Google scale, data sets often contain billions or even hundreds of billions of
examples. Furthermore, Google data sets often contain huge numbers of features.
Consequently, a batch can be enormous. A very large batch may cause even a single iteration to
take a very long time to compute.
A large data set with randomly sampled examples probably contains redundant data. In fact,
redundancy becomes more likely as the batch size grows. Some redundancy can be useful to
smooth out noisy gradients, but enormous batches tend not to carry much more predictive
value than large batches.
What if we could get the right gradient on average for much less computation? By choosing
examples at random from our data set, we could estimate (albeit, noisily) a big average from a
much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme--it uses
only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is
very noisy. The term "stochastic" indicates that the one example comprising each batch is
chosen at random.
Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch
iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random.
Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.
To simplify the explanation, we focused on gradient descent for a single feature. Rest assured
that gradient descent also works on feature sets that contain multiple features.

You might also like