ML Google
ML Google
Features
A feature is an input variable—the x variable in simple linear regression. A simple machine learning
project might use a single feature, while a more sophisticated machine learning project could use
millions of features, specified as:
بينما يمكن أن يستخدم، قد يستخدم مشروع التعلم اآللي البسيط ميزة واحدة. في االنحدار الخطي البسيطx المتغير- الميزة هي متغير إدخال
:مشروع التعلم اآللي األكثر تعقيًدا ماليين الميزات المحددة على النحو التالي
x1,x2,...xN
In the spam detector example, the features could include the following:
words in the email text
sender's address
time of day the email was sent
email contains the phrase "one weird trick."
: يمكن أن تتضمن الميزات ما يلي، في مثال مكتشف البريد العشوائي
كلمات في نص البريد اإللكتروني
عنوان المرسل
الوقت من اليوم الذي تم فيه إرسال البريد اإللكتروني
."يحتوي البريد اإللكتروني على عبارة "خدعة واحدة غريبة
Examples
An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We
break examples into two categories:
labeled examples
unlabeled examples
A labeled example includes both feature(s) and the label. That is:
labeled examples: {features, label}: (x, y)
Use labeled examples to train the model. In our spam detector example, the labeled examples would be
individual emails that users have explicitly marked as "spam" or "not spam."
For example, the following table shows 5 labeled examples from a data set containing information about
housing prices in California:
housingMedianAge totalRooms totalBedrooms medianHouseValue
(feature) (feature) (feature) (label)
15 5612 1283 66900
19 7650 1901 80100
17 720 174 85700
14 1501 337 73400
20 1454 326 65500
An unlabeled example contains features but not the label. That is:
unlabeled examples: {features, ?}: (x, ?)
Here are 3 unlabeled examples from the same housing dataset, which exclude medianHouseValue:
housingMedianAge totalRooms totalBedrooms
(feature) (feature) (feature)
42 1686 361
34 1226 180
33 1077 271
Once we've trained our model with labeled examples, we use that model to predict the label on
unlabeled examples. In the spam detector, unlabeled examples are new emails that humans haven't yet
labeled.
Models
A model defines the relationship between features and label. For example, a spam detection model
might associate certain features strongly with "spam". Let's highlight two phases of a model's life:
Training means creating or learning the model. That is, you show the model labeled examples
and enable the model to gradually learn the relationships between features and label.
Inference means applying the trained model to unlabeled examples. That is, you use the trained
model to make useful predictions (y'). For example, during inference, you can
predict medianHouseValue for new unlabeled examples.
قد يربط نموذج اكتشاف الرسائل غير المرغوب فيها بعض الميزات بقوة بـ، على سبيل المثال.يحدد النموذج العالقة بين الميزات والتسمية
: دعنا نسلط الضوء على مرحلتين من حياة النموذج.""البريد العشوائي
أي أنك تعرض األمثلة المسمى النموذج وتمّك ن النموذج من التعرف تدريجًيا على العالقات بين.• التدريب يعني إنشاء أو تعلم النموذج
.الميزات والتسمية
على سبيل.)' y( أي أنك تستخدم النموذج المدرب لعمل تنبؤات مفيدة.• االستدالل يعني تطبيق النموذج المدرب على األمثلة غير المسماة
. لألمثلة الجديدة غير المسماةmedianHouseValue يمكنك توقع، أثناء االستدالل، المثال
Figure 3. High loss in the left model; low loss in the right model.
Notice that the arrows in the left plot are much longer than their
الحظ أن األسهم الموجودة في المؤامرة اليسرى أطول
counterparts in the right plot. Clearly, the line in the right plot is a
من الواضح أن.بكثير من نظيراتها في المؤامرة اليمنى
much better predictive model than the line in the left plot.
الخط الموجود في المخطط األيمن هو نموذج تنبؤي
You might be wondering whether you could create a mathematical
.أفضل بكثير من الخط الموجود في الرسم األيسر
function—a loss function—that would aggregate the individual losses
- قد تتساءل عما إذا كان بإمكانك إنشاء دالة رياضية
in a meaningful fashion.
من شأنها تجميع الخسائر الفردية بطريقة- دالة خسارة
Squared loss: a popular loss function
.ذات مغزى
The linear regression models we'll examine here use a loss function
دالة خسارة شائعة:الخسارة التربيعية
called squared loss (also known as L2 loss). The squared loss for a
تستخدم نماذج االنحدار الخطي التي سنقوم بفحصها
single example is as follows:
هنا دالة خسارة تسمى خسارة التربيعية (ُتعرف أيًضا
= the square of the difference between the label and the prediction .)L2 باسم خسارة
= (observation - prediction(x))2
= (y - y')2
Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate
MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
where:
(x,y) is an example in which
x is the set of features (for example, chirps/minute, age, gender) that the model uses to make
predictions.
y is the example's label (for example, temperature).
prediction(x) is a function of the weights and bias in combination with the set of features x.
D is a data set containing many labeled examples, which are (x,y) pairs.
N is the number of examples in D.
Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the
best loss function for all circumstances.
Reducing Loss
To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely
used method for reducing loss, and is as easy and efficient as walking down a hill.
Reducing Loss: An Iterative Approach
The previous module introduced the concept of loss. Here, in this module, you'll learn how a machine
learning model iteratively reduces loss.
Iterative learning might remind you of the "Hot and Cold" kid's game for finding a hidden object
like a thimble. In this game, the "hidden object" is the best possible model. You'll start with a
wild guess ("The value of w1 is 0.") and wait for the system to tell you what the loss is. Then,
you'll try another guess ("The value of w1 is 0.5.") and see what the loss is. Aah, you're getting
warmer. Actually, if you play this game right, you'll usually be getting warmer. The real trick to
the game is trying to find the best possible model as efficiently as possible.
The following figure suggests the iterative trial-and-error process that machine learning
algorithms use to train a model:
Figure 1. An iterative approach to training a model.
We'll use this same iterative approach throughout the Machine Learning Crash Course, detailing
various complications, particularly within that stormy cloud labeled "Model (Prediction
Function)." Iterative strategies are prevalent in machine learning, primarily because they scale
so well to large data sets.
The "model" takes one or more features as input and returns one prediction (y′) as output. To
simplify, consider a model that takes one feature and returns one prediction:
y′=b+w1x1
What initial values should we set for b and w1? For linear regression problems, it turns out that
the starting values aren't important. We could pick random values, but we'll just take the
following trivial values instead:
b=0
w1 = 0
Suppose that the first feature value is 10. Plugging that feature value into the prediction
function yields:
y′=0+0⋅10=0
The "Compute Loss" part of the diagram is the loss function that the model will use. Suppose
we use the squared loss function. The loss function takes in two input values:
y′: The model's prediction for features x
y: The correct label corresponding to features x.
At last, we've reached the "Compute parameter updates" part of the diagram. It is here that the
machine learning system examines the value of the loss function and generates new values
for b and w1. For now, just assume that this mysterious box devises new values and then the
machine learning system re-evaluates all those features against all those labels, yielding a new
value for the loss function, which yields new parameter values. And the learning continues
iterating until the algorithm discovers the model parameters with the lowest possible loss.
Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When
that happens, we say that the model has converged.
Key Point:
A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively
adjusting those guesses until learning the weights and bias with the lowest possible loss.
Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That
minimum is where the loss function converges.
Calculating the loss function for every conceivable value of w1 over the entire data set would be an
inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in
machine learning—called gradient descent.
The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point
doesn't matter much; therefore, many algorithms simply set w1 to 0 or pick a random value. The
following figure shows that we've picked a starting point slightly greater than 0:
Figure 3. A starting point for gradient descent.
The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here
in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which
way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial
derivatives with respect to the weights.
Click the plus icon to learn more about partial derivatives and gradients.
Note that a gradient is a vector, so it has both of the following characteristics:
a direction
a magnitude
The gradient always points in the direction of steepest increase in the loss function. The gradient
descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly
as possible.
Figure 4. Gradient descent relies on negative gradients.
To determine the next point along the loss function curve, the gradient descent algorithm adds some
fraction of the gradient's magnitude to the starting point as shown in the following figure:
Figure 5. A gradient step moves us to the next point on the loss curve.
The gradient descent then repeats this process, edging ever closer to the minimum.
Note: When performing gradient descent, we generalize the above process to tune all the model
parameters simultaneously. For example, to find the optimal values of both w1 and the bias b, we
calculate the gradients with respect to both w1 and b. Next, we modify the values of w1 and b based on
their respective gradients. Then we repeat these steps until we reach minimum loss.