0% found this document useful (0 votes)
65 views31 pages

CMPT 413/713: Natural Language Processing: Nat Langlab

1. The document discusses different types of neural networks including feed-forward neural networks, recurrent neural networks, recursive neural networks, convolutional neural networks, and transformer and graph neural networks. 2. It provides an overview of the basic building block of neural networks - the artificial neuron, which performs weighted sums of inputs and passes the result through a nonlinear activation function. 3. Neural networks gain their power from nonlinear activation functions, which allow them to learn and represent nonlinear and complex relationships in data. Deeper networks can more compactly represent complex functions compared to shallow networks.

Uploaded by

Wenpei Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views31 pages

CMPT 413/713: Natural Language Processing: Nat Langlab

1. The document discusses different types of neural networks including feed-forward neural networks, recurrent neural networks, recursive neural networks, convolutional neural networks, and transformer and graph neural networks. 2. It provides an overview of the basic building block of neural networks - the artificial neuron, which performs weighted sums of inputs and passes the result through a nonlinear activation function. 3. Neural networks gain their power from nonlinear activation functions, which allow them to learn and represent nonlinear and complex relationships in data. Deeper networks can more compactly represent complex functions compared to shallow networks.

Uploaded by

Wenpei Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

SFU NatLangLab

CMPT 413/713: Natural Language Processing

Neural Network Basics II

Fall 2021

2021-09-28

Adapted from slides from Danqi Chen and Karthik Narasimhan

1
Neural networks for NLP
Feed-forward NNs Recurrent NNs Recursive NNs

Convolutional NNs Transformer Graph NNs

Always coupled with word embeddings…


2
An artificial neuron
• A neuron is a computational unit that has scalar inputs and an output
• Each input has an associated weight.
• The neuron multiples each input by its weight, sums them, applied a
nonlinear function to the result, and passes it to its output.

3
Neural networks
• The neurons are connected to each other, forming a network
• The output of a neuron may feed into the inputs of other neurons

• Feed forward network (FFN)

• Fully connected network (FCN)


4
Multiple neurons for XOR
y1 = g(h1 − 2h2) <latexit sha1_base64="CA8neyCJlgJZ+OXVNrDhh1JkDWI=">AAACO3icbVBNTxsxFPQG2tLtBwGOvTwRtUovkbcqai8g1F44AiKAlI0ir/M2sfB6V/bbirDK/+qlf6K3XrhwACGu3HFCkNqkI1kazbyx/SYptHLE+Z+gtrT87PmLlZfhq9dv3q7W19aPXV5aiW2Z69yeJsKhVgbbpEjjaWFRZInGk+Ts+8Q/+YHWqdwc0ajAbiYGRqVKCvJSr344aF58hG2IExwoU0l/lRuHHD5ATHhOFagUxnABsUbgEMdhtGDtAA9jNP2ncK/e4C0+BSySaEYabIb9Xv133M9lmaEhqYVznYgX1K2EJSU1jsO4dFgIeSYG2PHUiAxdt5ruPob3XulDmlt/DMFU/TtRicy5UZb4yUzQ0M17E/F/Xqek9Gu3UqYoCY18fCgtNVAOkyKhryxK0iNPhLTK/xXkUFghydc9KSGaX3mRHH9qRVstfvC5sfttVscKe8c2WZNF7AvbZXtsn7WZZD/ZJbtmN8Gv4Cq4De4eR2vBLLPB/kFw/wBznKmQ</latexit>

(
0 if z  0
g(z) =
1 if z > 0

h1 = g(x1 + x2)
h2 = g(x1 + x2 − 1)

5
Why nonlinearities
Learn to classify whether points should belong to the blue curve or red curve

Linear decision boundary

Linear decision boundary Non-linear decision boundary in transformed space

6 https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Expressiveness of neural networks
• Multilayer feed-forward neural nets with nonlinear activation
functions are universal approximators
• True for both shallow networks (infinitely wide) and (infinitely) deep
networks.
• Consider a network with just 1 hidden layer (with hard threshold
D
activation functions) and a linear output. By having 2 hidden units,
each of which responds to just one input configuration, can model
any boolean function with D inputs.

• Deep network can provide a more compact representation


7
Activation functions
Zero centered Advantages of ReLU?
sigmoid tanh ReLU
(rectified linear unit)
1 e2z 1
f (z) = f (z) = 2z
1+e
<latexit sha1_base64="4snGlhjxkE7I7roWcflso17LAcU=">AAACBHicbVDLSgMxFM3UV62vUZfdBItQEctEBN0IRTcuK9gHtGPJpJk2NJMZkozQDrNw46+4caGIWz/CnX9j2s5CWw9cOJxzL/fe40WcKe0431ZuaXlldS2/XtjY3NresXf3GiqMJaF1EvJQtjysKGeC1jXTnLYiSXHgcdr0htcTv/lApWKhuNOjiLoB7gvmM4K1kbp20S+Pj+Al7PgSkwSlCYLHkN4nJ+M07dolp+JMARcJykgJZKh17a9OLyRxQIUmHCvVRk6k3QRLzQinaaETKxphMsR92jZU4IAqN5k+kcJDo/SgH0pTQsOp+nsiwYFSo8AznQHWAzXvTcT/vHas/Qs3YSKKNRVktsiPOdQhnCQCe0xSovnIEEwkM7dCMsAmDm1yK5gQ0PzLi6RxWkFOBd2elapXWRx5UAQHoAwQOAdVcANqoA4IeATP4BW8WU/Wi/Vufcxac1Y2sw/+wPr8ASGfln4=</latexit>
z
e +1
<latexit sha1_base64="G6vDyYAGjQKIG18jOKg0BSVdk0w=">AAACDXicbZDLSgMxFIYz9VbrbdSlm2AVKmKZKYJuhKIblxXsBdpaMumZNjRzIckI7TAv4MZXceNCEbfu3fk2pu0I2vpD4OM/53ByfifkTCrL+jIyC4tLyyvZ1dza+sbmlrm9U5NBJChUacAD0XCIBM58qCqmODRCAcRzONSdwdW4Xr8HIVng36phCG2P9HzmMkqUtjrmgVsYHeEL3HIFoTHcxaVRgk+wnfzwseaOmbeK1kR4HuwU8ihVpWN+troBjTzwFeVEyqZthaodE6EY5ZDkWpGEkNAB6UFTo088kO14ck2CD7XTxW4g9PMVnri/J2LiSTn0HN3pEdWXs7Wx+V+tGSn3vB0zP4wU+HS6yI04VgEeR4O7TABVfKiBUMH0XzHtE52L0gHmdAj27MnzUCsVbato35zmy5dpHFm0h/ZRAdnoDJXRNaqgKqLoAT2hF/RqPBrPxpvxPm3NGOnMLvoj4+MbIhKZsQ==</latexit>
sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit>
sha1_base64="ntnODkgzo3T4kCfyY+K/0V4NGOg=">AAACAnicbZDLSgMxGIX/qbdaq45u3QSrUBHLTDe6EQQ3LivYC7RjyaSZNjSTGZKM0A7zAm58FTcuFPEh3Pk2phdBWw8EPs5J+PMfP+ZMacf5snIrq2vrG/nNwlZxe2fX3is2VJRIQusk4pFs+VhRzgSta6Y5bcWS4tDntOkPryd584FKxSJxp0cx9ULcFyxgBGtjde2joDw+QZeoE0hMUnqfVscZOkNu9sOnhrt2yak4U6FlcOdQgrlqXfuz04tIElKhCcdKtV0n1l6KpWaE06zQSRSNMRniPm0bFDikykun22To2Dg9FETSHKHR1P39IsWhUqPQNzdDrAdqMZuY/2XtRAcXXspEnGgqyGxQkHCkIzSpBvWYpETzkQFMJDN/RWSATS/aFFgwJbiLKy9Do1pxnYp760AeDuAQyuDCOVzBDdSgDgQe4Rle4c16sl6s91ldOWve2z78kfXxDYg8mDk=</latexit>
sha1_base64="fVWwQ1I2ofWM38dTM+TFffdN8og=">AAACDXicbZC7SgNBFIbPxluMt1VLm8EoRMSwm0YbIWhjGcFcIFnD7GQ2GZy9MDMrJMu+gI2vYmOhiK29nW/jJFlBE38Y+PjPOZw5vxtxJpVlfRm5hcWl5ZX8amFtfWNzy9zeacgwFoTWSchD0XKxpJwFtK6Y4rQVCYp9l9Ome3c5rjfvqZAsDG7UMKKOj/sB8xjBSltd88ArjY7QOep4ApOE3iaVUYpOkJ3+8LHmrlm0ytZEaB7sDIqQqdY1Pzu9kMQ+DRThWMq2bUXKSbBQjHCaFjqxpBEmd7hP2xoD7FPpJJNrUnSonR7yQqFfoNDE/T2RYF/Koe/qTh+rgZytjc3/au1YeWdOwoIoVjQg00VezJEK0Tga1GOCEsWHGjARTP8VkQHWuSgdYEGHYM+ePA+NStm2yva1VaxeZHHkYQ/2oQQ2nEIVrqAGdSDwAE/wAq/Go/FsvBnv09ackc3swh8ZH98g0pmt</latexit>
f (z) = max(0, z)
<latexit sha1_base64="kxOkDi9OteK9stnv3AMxRIaEm4s=">AAAB+nicbVBNS8NAEJ3Ur1q/Uj16WSxCC1ISEfQiFL14rGA/oA1ls920SzebsLtR29qf4sWDIl79Jd78N27bHLT1wcDjvRlm5vkxZ0o7zreVWVldW9/Ibua2tnd29+z8fl1FiSS0RiIeyaaPFeVM0JpmmtNmLCkOfU4b/uB66jfuqVQsEnd6GFMvxD3BAkawNlLHzgfFUQldonaIH4vOCRqVOnbBKTszoGXipqQAKaod+6vdjUgSUqEJx0q1XCfW3hhLzQink1w7UTTGZIB7tGWowCFV3nh2+gQdG6WLgkiaEhrN1N8TYxwqNQx90xli3VeL3lT8z2slOrjwxkzEiaaCzBcFCUc6QtMcUJdJSjQfGoKJZOZWRPpYYqJNWjkTgrv48jKpn5Zdp+zenhUqV2kcWTiEIyiCC+dQgRuoQg0IPMAzvMKb9WS9WO/Wx7w1Y6UzB/AH1ucPEqeR7A==</latexit>

(
0 0 1 z>0
f (z) = f (z) ⇥ (1 f (z)) f (z) = 1 0
f (z) 2 f (z) =
0 z<0
<latexit sha1_base64="3GcbAayqBhW++9Kwcpn77Vsrhoc=">AAACB3icbVDLSsNAFJ3UV62vqEtBBovYLiyJCLoRim5cVrAPaEOZTCft4OTBzI1QQ3du/BU3LhRx6y+482+cpFlo64FhDufcy733uJHgCizr2ygsLC4trxRXS2vrG5tb5vZOS4WxpKxJQxHKjksUEzxgTeAgWCeSjPiuYG337ir12/dMKh4GtzCOmOOTYcA9TgloqW/ue0eVhyq+wF769YD7TOGKjY8zodo3y1bNyoDniZ2TMsrR6JtfvUFIY58FQAVRqmtbETgJkcCpYJNSL1YsIvSODFlX04DoeU6S3THBh1oZYC+U+gWAM/V3R0J8pca+qyt9AiM166Xif143Bu/cSXgQxcACOh3kxQJDiNNQ8IBLRkGMNSFUcr0rpiMiCQUdXUmHYM+ePE9aJzXbqtk3p+X6ZR5HEe2hA1RBNjpDdXSNGqiJKHpEz+gVvRlPxovxbnxMSwtG3rOL/sD4/AFGEZW7</latexit>

<latexit sha1_base64="MqVGA4i/sJsTCssjlSr5qxWmCNA=">AAAB+3icbZDLSsNAFIZP6q3WW6xLN4NFrAtLUgTdCEU3LivYC7SxTKaTduhkEmYmYg19FTcuFHHri7jzbZy2WWj1h4GP/5zDOfP7MWdKO86XlVtaXlldy68XNja3tnfs3WJTRYkktEEiHsm2jxXlTNCGZprTdiwpDn1OW/7oalpv3VOpWCRu9TimXogHggWMYG2snl0MjsqPx+gCuegEBQbvqj275FScmdBfcDMoQaZ6z/7s9iOShFRowrFSHdeJtZdiqRnhdFLoJorGmIzwgHYMChxS5aWz2yfo0Dh9FETSPKHRzP05keJQqXHom84Q66FarE3N/2qdRAfnXspEnGgqyHxRkHCkIzQNAvWZpETzsQFMJDO3IjLEEhNt4iqYENzFL/+FZrXiOhX35rRUu8ziyMM+HEAZXDiDGlxDHRpA4AGe4AVerYn1bL1Z7/PWnJXN7MEvWR/fnxSRkw==</latexit>

<latexit sha1_base64="9/X76zUp6jklbnfvYpAmK6D/r4Q=">AAACH3icbZDLSgMxFIYzXmu9VV26CRa1bsqMiLpQKbpxWcFeoFNKJj3ThmYyQ5IR2qFv4sZXceNCEXHXtzFtZ6GtBwIf/39OkvN7EWdK2/bIWlhcWl5Zzaxl1zc2t7ZzO7tVFcaSQoWGPJR1jyjgTEBFM82hHkkggceh5vXuxn7tCaRioXjU/QiaAekI5jNKtJFauXP/uDA4wdfY9aDDRELNXWqIHXyEB/gG266L7QlfGQbRThtaubxdtCeF58FJIY/SKrdy3247pHEAQlNOlGo4dqSbCZGaUQ7DrBsriAjtkQ40DAoSgGomk/2G+NAobeyH0hyh8UT9PZGQQKl+4JnOgOiumvXG4n9eI9b+ZTNhIoo1CDp9yI851iEeh4XbTALVvG+AUMnMXzHtEkmoNpFmTQjO7MrzUD0tOnbReTjLl27TODJoHx2gAnLQBSqhe1RGFUTRM3pF7+jDerHerE/ra9q6YKUze+hPWaMfrzyfEQ==</latexit>

8
Activation functions
Problems of ReLU? “dead neurons”

Leaky ReLU (
z z 0
f (z) =
0.01z z<0
<latexit sha1_base64="hBHq/48/smLgMJ23MnABJb6l66s=">AAACJnicbVDLSgMxFM3UV62vqks3waLUTZkRQRcWim5cVrAP6JSSSe+0oZnMmGSEdujXuPFX3LioiLjzU0zbWWjrgQuHc+69yT1exJnStv1lZVZW19Y3spu5re2d3b38/kFdhbGkUKMhD2XTIwo4E1DTTHNoRhJI4HFoeIPbqd94AqlYKB70MIJ2QHqC+YwSbaROvuwXR2e4jF0Pekwk1KxSYzzCp6bcHjxiG7sutku2M9euse2C6KaNnXzBWDPgZeKkpIBSVDv5idsNaRyA0JQTpVqOHel2QqRmlMM458YKIkIHpActQwUJQLWT2ZljfGKULvZDaUpoPFN/TyQkUGoYeKYzILqvFr2p+J/XirV/1U6YiGINgs4f8mOOdYinmeEuk0A1HxpCqGTmr5j2iSRUm2RzJgRn8eRlUj8vOXbJub8oVG7SOLLoCB2jInLQJaqgO1RFNUTRM3pFE/RuvVhv1of1OW/NWOnMIfoD6/sHSVWh/Q==</latexit>

9
What is the best activation function?
• Depends on the problem!
• ReLU/Leaky ReLU is often a good choice
• Research into families of activation functions

1 x
<latexit sha1_base64="GUh8uELOlEGy5IkjpK1QR7b6T60=">AAACHXicbVDLSgMxFM34rPVVdekmWIQWocyUoi6LblwqWBU6pWTSO21o5mFyR1qG+RE3/oobF4q4cCP+jeljoa0HEg7nnEtyjxdLodG2v62FxaXlldXcWn59Y3Nru7Cze6OjRHFo8EhG6s5jGqQIoYECJdzFCljgSbj1+ucj//YBlBZReI3DGFoB64bCF5yhkdqFmusrxlMnS6sZHdAjWnLM5SIMMAXlZ6WJP8hSV98rNKmsXG4XinbFHoPOE2dKimSKy3bh0+1EPAkgRC6Z1k3HjrGVMoWCS8jybqIhZrzPutA0NGQB6FY63i6jh0bpUD9S5oRIx+rviZQFWg8DzyQDhj09643E/7xmgv5pKxVhnCCEfPKQn0iKER1VRTtCAUc5NIRxJcxfKe8xUweaQvOmBGd25XlyU604xxX7qlasn03ryJF9ckBKxCEnpE4uyCVpEE4eyTN5JW/Wk/VivVsfk+iCNZ3ZI39gff0AACahSg==</latexit>

Parameteric rectified linear unit


Exponential rectified linear x + (1 + erf( p )) Swish
(PReLU) 2 2
unit (ELU)
( <latexit sha1_base64="YPJuRmwBxyu8n8xc29JYRDeHgtA=">AAACSnicbVA9bxNBEN0zCUkOAg6UNCMsUCJF1h0CQRGkSDSUiRQnkXyWNbees1fZ2zvtzkW2Tv59aajS5UfQUIAQTdb2FRAz1dv35uPtS0utHEfRXdB6tLH5eGt7J3zydPfZ8/bei3NXVFZSTxa6sJcpOtLKUI8Va7osLWGearpIr74s9Itrsk4V5oxnJQ1yHBuVKYnsqWEbs/0EdTnBQ5gewGcIk5TGytTS73TzcKXBFN4mTFOuISsszP37CCJIknBdSMYEUZiQGTU7hu1O1I2WBesgbkBHNHUybN8mo0JWORmWGp3rx1HJgxotK6nJe6oclSivcEx9Dw3m5Ab1Moo5vPHMaGkmKwzDkv17osbcuVme+s4ceeIeagvyf1q/4uzToFamrJiMXB3KKg1cwCJXGClLkvXMA5RWea8gJ2hRsk8/9CHED7+8Ds7fdeMP3ej0fef4qIljW7wSr8W+iMVHcSy+ihPRE1LciO/ip/gVfAt+BL+DP6vWVtDMvBT/VGvjHjFOr4I=</latexit>

(
↵x for x < 0 x
<latexit sha1_base64="i4r5X/BkfRXUB24kf/nXej0Rk9M=">AAACT3icbVHBTttAFFwHSsFtIYVjL09EVEFqIxu1KgeKkHrpESQCSHEarTfPySrrtbX7XDmy8oe9wK2/0QsHqqqb4AMQ5jSaeft2djbOlbQUBL+9xsrqi7WX6xv+q9dvNreab7cvbFYYgV2Rqcxcxdyikhq7JEnhVW6Qp7HCy3jybe5f/kRjZabPaZpjP+UjLRMpODlp0EySdsRVPuYfoNyHr+BHMY6kroTbaWf+vdfGH+XHcB/eR4QlVZBkBmZQQqQQAogiv1yyjiHwI9TDetGg2Qo6wQKwTMKatFiN00HzJhpmokhRk1Dc2l4Y5NSvuCEpFLpghcWciwkfYc9RzVO0/WrRxwz2nDJcREkyTbBQH56oeGrtNI3dZMppbJ96c/E5r1dQctivpM4LQi3uL0oKBZTBvFwYSoOC1NQRLox0WUGMueGC3Bf4roTw6ZOXycVBJ/zcCc4+tU6O6jrW2Tu2y9osZF/YCfvOTlmXCfaL/WF37K937d16/xr1aMOryQ57hMbGf3J+sA4=</latexit>

<latexit sha1_base64="gDD8xibaXQwZssQxCrAV2z+4Q7c=">AAACHnicbVDLSgNBEJyNrxhfUY9eBoMQEcOuGPQiBL14jGAekI1hdtKbDJl9MDMrCct+iRd/xYsHRQRP+jdONnvQaEFDUdVNd5cTciaVaX4ZuYXFpeWV/GphbX1jc6u4vdOUQSQoNGjAA9F2iATOfGgopji0QwHEczi0nNHV1G/dg5As8G/VJISuRwY+cxklSku9YtUtjw/xBR5jW7KBR8q2A4rgVLNdQWg8TmILH2G4i48zL0l6xZJZMVPgv8TKSAllqPeKH3Y/oJEHvqKcSNmxzFB1YyIUoxySgh1JCAkdkQF0NPWJB7Ibp+8l+EArfewGQpevcKr+nIiJJ+XEc3SnR9RQzntT8T+vEyn3vBszP4wU+HS2yI04VgGeZoX7TABVfKIJoYLpWzEdEh2K0okWdAjW/Mt/SfOkYlUr5s1pqXaZxZFHe2gflZGFzlANXaM6aiCKHtATekGvxqPxbLwZ77PWnJHN7KJfMD6/AXW6oEQ=</latexit>

↵(ex 1) for x  0
f (↵, x) = f (↵, x) = f (x) = x ( x) = x
x for x > 0 x for x 0 1+e
β = 0 → f(x) = x/2
α, β can either be constant or trainable parameter β = ∞ → f(x) = ReLU
10
Matrix Notation

• Learn parameters ✓ = {W
<latexit sha1_base64="80hiwkos5ckF2sLXO4G6OjDxjMw=">AAACV3icbZFdS8MwFIbTbuqcX1MvvQkOQUFGK4LeCENvvJzgPmCtI81SF0zTkpwqo/RPijf7K95ouk3Q1QOBl+e8h5y8CRLBNTjOzLIr1bX1jdpmfWt7Z3evsX/Q03GqKOvSWMRqEBDNBJesCxwEGySKkSgQrB+83BX9/itTmsfyEaYJ8yPyLHnIKQGDRg3pwYQBwTfYy7yIwCQIs37+lJ26Z/k5/iFBicw9FyXPH/JWkLggwUJ4+ajRdFrOvHBZuEvRRMvqjBrv3jimacQkUEG0HrpOAn5GFHAqWF73Us0SQl/IMxsaKUnEtJ/Nc8nxiSFjHMbKHAl4Tn9PZCTSehoFxlmsrFd7BfyvN0whvPYzLpMUmKSLi8JUYIhxETIec8UoiKkRhCpudsV0QhShYL6ibkJwV59cFr2Lluu03IfLZvt2GUcNHaFjdIpcdIXa6B51UBdR9IE+rYpVtWbWl71u1xZW21rOHKI/Ze9/AyNuss4=</latexit>
(1)
,b (1)
,W (2)
,b (2)
,w (o) (o)
,b }

Output
Input
<latexit sha1_base64="To0vQVkT+0dQdve6SBYL2lCXlls=">AAACKXicbZDLSsNAFIYnXmu9RV3qYrAILUJJRNSFQsGNywr2Ak0tk+mkHTqZhJmJUkKeoW/hxrdw7UZBUbc+g3unTQVtPTDw8f/nMOf8bsioVJb1bszMzs0vLGaWsssrq2vr5sZmVQaRwKSCAxaIuoskYZSTiqKKkXooCPJdRmpu73zo126IkDTgV6ofkqaPOpx6FCOlpZZZ6sMz6Eja8VHe8ZHqul58m1zH+aAAHcoVERix5Mfppk4C96GbUqFl5qyiNSo4DfYYcqWdgZP/ehiUW+az0w5w5BOuMENSNmwrVM0YCUUxI0nWiSQJEe6hDmlo5MgnshmPLk3gnlba0AuEflzBkfp7Ika+lH3f1Z3DleWkNxT/8xqR8k6aMeVhpAjH6UdexKAK4DA22KaCYMX6GhAWVO8KcRcJhHVCMqtDsCdPnobqQdE+Kh5e6jROQVoZsA12QR7Y4BiUwAUogwrA4A48ghfwatwbT8ab8ZG2zhjjmS3wp4zPb4R1qds=</latexit>

(o)| (o) (o)


d y = (w h +b )
x2R
<latexit sha1_base64="r88cLpKiZJtXS/wgD8iCVci7o1w=">AAACBHicbVC7TsMwFL3hWcorwNjFokJiqhKEBANDJRbGguhDakrlOE5r1XEi20FUUQYWfoWFAYRY+Qg2/gan7QAtR7J0fM69uvceP+FMacf5tpaWV1bX1ksb5c2t7Z1de2+/peJUEtokMY9lx8eKciZoUzPNaSeRFEc+p21/dFn47XsqFYvFrR4ntBfhgWAhI1gbqW9XvAjroR9mDznymEDTr5/d5HdB3646NWcCtEjcGanCDI2+/eUFMUkjKjThWKmu6yS6l2GpGeE0L3upogkmIzygXUMFjqjqZZMjcnRklACFsTRPaDRRf3dkOFJqHPmmsthRzXuF+J/XTXV43suYSFJNBZkOClOOdIyKRFDAJCWajw3BRDKzKyJDLDHRJreyCcGdP3mRtE5qrlNzr0+r9YtZHCWowCEcgwtnUIcraEATCDzCM7zCm/VkvVjv1se0dMma9RzAH1ifP+g+mDo=</latexit>

<latexit sha1_base64="Pxmha19eaTXPM6LP1MufJIpVKHg=">AAACDnicbZC7SgNBFIZn4y3GW9TSZjAEYhN2g6iFl4CNZRRzgWQNs5PZZMjs7DIzq4Rln8DGyvewsVBELARrO5/DF3CSTaGJPwz8fOcc5pzfCRiVyjS/jNTM7Nz8Qnoxs7S8srqWXd+oST8UmFSxz3zRcJAkjHJSVVQx0ggEQZ7DSN3pnw7r9WsiJPX5pRoExPZQl1OXYqQ0amfzLQ+pnuNGN/FVVPB3YtiiHCbQiS407LRLcTubM4vmSHDaWGOTO/l+U8f3R++Vdvaz1fFx6BGuMENSNi0zUHaEhKKYkTjTCiUJEO6jLmlqy5FHpB2NzolhXpMOdH2hH1dwRH9PRMiTcuA5unO4p5ysDeF/tWao3AM7ojwIFeE4+cgNGVQ+HGYDO1QQrNhAG4QF1btC3EMCYaUTzOgQrMmTp02tVLT2irvnZq58CBKlwRbYBgVggX1QBmegAqoAg1vwAJ7As3FnPBovxmvSmjLGM5vgj4yPHwYVoH4=</latexit>

(o) d2
w 2R <latexit sha1_base64="e+dSCgIdYZ0r/SoSH+vcacyeN2o=">AAAB/3icbVC7TsMwFHXKq5RXAImFxaJCKkuVIAQMPCqxMBZEH1ITKsd1W6uOE9kOUhUy8BN8AAsDCLEyMvADbHwHP4DTdoCWI1k6Oude3ePjhYxKZVlfRmZqemZ2LjufW1hcWl4xV9eqMogEJhUcsEDUPSQJo5xUFFWM1ENBkO8xUvN6Z6lfuyFC0oBfqX5IXB91OG1TjJSWmuaGdx0Xgp0EOpRDx0eq63nxZdI081bRGgBOEntE8qff7+rk/vij3DQ/nVaAI59whRmSsmFboXJjJBTFjCQ5J5IkRLiHOqShKUc+kW48yJ/Aba20YDsQ+nEFB+rvjRj5UvZ9T0+mCeW4l4r/eY1ItQ/dmPIwUoTj4aF2xKAKYFoGbFFBsGJ9TRAWVGeFuIsEwkpXltMl2ONfniTV3aK9X9y7sPKlIzBEFmyCLVAANjgAJXAOyqACMLgFD+AJPBt3xqPxYrwORzPGaGcd/IHx9gPVIpnC</latexit>

(o)
b 2R

<latexit sha1_base64="irtFVMvTfGN3YvYKHEA6HK2ferc=">AAACN3icbVDLSgMxFM34rPU16lIXwSK0CGWmiLpQKLhxJRXsAzq1ZNJMG5rJDElGKEO/oT/jxq2f4E43LhRx68K96Uts64HA4Zxzyb3HDRmVyrKejbn5hcWl5cRKcnVtfWPT3NouySASmBRxwAJRcZEkjHJSVFQxUgkFQb7LSNltX/T98h0Rkgb8RnVCUvNRk1OPYqS0VDevHB+pluvFre5tnM5luvAcOgrxVnpslMfGRNLWwuGv5I4ymbqZsrLWAHCW2COSyu/1nPT3Y69QN5+cRoAjn3CFGZKyaluhqsVIKIoZ6SadSJIQ4TZqkqqmHPlE1uLB3V14oJUG9AKhH1dwoP6diJEvZcd3dbK/qJz2+uJ/XjVS3mktpjyMFOF4+JEXMagC2C8RNqggWLGOJggLqneFuIUEwkpXndQl2NMnz5JSLmsfZ4+udRtnYIgE2AX7IA1scALy4BIUQBFgcA9ewBt4Nx6MV+PD+BxG54zRzA6YgPH1A5Mvrs8=</latexit>

(2) (2) (1) (2)


<latexit sha1_base64="SRTUR8r3Sy7W1YtjqUr102ylxUI=">AAACMXicbVDLSgMxFM3UV62vqktdBIvQIpQZEXWhUHDTZQX7gE4tmTTThmYyQ5IRyzDf0D9x4x/4CeKmC0XcunZv+hJtPRA4Oede7r3HCRiVyjQHRmJhcWl5JbmaWlvf2NxKb+9UpB8KTMrYZ76oOUgSRjkpK6oYqQWCIM9hpOp0r4Z+9Y4ISX1+o3oBaXiozalLMVJaaqaLtodUx3GjTnwbZa1cDC+hrRDvZKdGdWpMhfsYHv18nImba6YzZt4cAc4Ta0Iyhf2+nf166pea6We75ePQI1xhhqSsW2agGhESimJG4pQdShIg3EVtUteUI4/IRjS6OIaHWmlB1xf6cQVH6u+OCHlS9jxHVw4XlbPeUPzPq4fKPW9ElAehIhyPB7khg8qHw/hgiwqCFetpgrCgeleIO0ggrHTIKR2CNXvyPKkc563T/Mm1TuMCjJEEe+AAZIEFzkABFEEJlAEGD+AFvII349EYGO/Gx7g0YUx6dsEfGJ/fX0asyA==</latexit>

h(1) = tanh(W(1) x + b(1) ) h = tanh(W h +b )


<latexit sha1_base64="QgHquth6yO7g22tcetBWsGXk9aA=">AAACDnicbZC7SgNBFIZn4y3GW9TSZjAEYhN2RdTCS8DGMoq5QLIus5PZZMjs7DIzK4Rln8DGyvewsVBELARrO5/DF3A2SaGJPwz8fOcc5pzfDRmVyjS/jMzM7Nz8QnYxt7S8srqWX9+oyyASmNRwwALRdJEkjHJSU1Qx0gwFQb7LSMPtn6X1xg0Rkgb8Sg1CYvuoy6lHMVIaOfli20eq53qxm1zHJWsngW3K4Qi68aWGHcdKnHzBLJtDwWljjU3h9PtNndwfv1ed/Ge7E+DIJ1xhhqRsWWao7BgJRTEjSa4dSRIi3Edd0tKWI59IOx6ek8CiJh3oBUI/ruCQ/p6IkS/lwHd1Z7qnnKyl8L9aK1LeoR1THkaKcDz6yIsYVAFMs4EdKghWbKANwoLqXSHuIYGw0gnmdAjW5MnTpr5btvbLexdmoXIERsqCLbANSsACB6ACzkEV1AAGt+ABPIFn4854NF6M11FrxhjPbII/Mj5+AH7XoCo=</latexit>

<latexit sha1_base64="9I8t79sYT5d70QJap3gG8m4uWyU=">AAACDnicbZC7SgNBFIZnvcZ4W7W0GQyB2ITdIGrhJWBjGcVcIFmX2clsMmR2dpmZFcKyT2Bj5XvYWCgiFoK1nc/hCzi5FJr4w8DPd85hzvm9iFGpLOvLmJmdm19YzCxll1dW19bNjc2aDGOBSRWHLBQND0nCKCdVRRUjjUgQFHiM1L3e2aBevyFC0pBfqX5EnAB1OPUpRkoj18y3AqS6np946XVSKO2msEU5HEEvudSw7ZZS18xZRWsoOG3sscmdfr+pk/vj94prfrbaIY4DwhVmSMqmbUXKSZBQFDOSZluxJBHCPdQhTW05Coh0kuE5Kcxr0oZ+KPTjCg7p74kEBVL2A093DvaUk7UB/K/WjJV/6CSUR7EiHI8+8mMGVQgH2cA2FQQr1tcGYUH1rhB3kUBY6QSzOgR78uRpUysV7f3i3oWVKx+BkTJgG+yAArDBASiDc1ABVYDBLXgAT+DZuDMejRfjddQ6Y4xntsAfGR8/gfegLA==</latexit>

(1) d1 (2) d2
2R 2R
<latexit sha1_base64="0gz88APMXHPD9OESneWedLqussc=">AAACF3icbVDLSsNAFJ34rPVVdelmsAi6KYmIuvAFblyqWFtoYplMJjp0MgkzN0IJ+QsR/BU3Ij5wqzu/wx9w2rhQ64ELh3Pu5d57/ERwDbb9YQ0Nj4yOjZcmypNT0zOzlbn5Mx2nirI6jUWsmj7RTHDJ6sBBsGaiGIl8wRp+56DnN66Y0jyWp9BNmBeRC8lDTgkYqV2puRGBSz/MGvl5tuKs5tjlEhein50YMWg72AUeMY2DvF2p2jW7DzxInG9S3ft8gN2bneejduXdDWKaRkwCFUTrlmMn4GVEAaeC5WU31SwhtEMuWMtQScweL+v/leNlowQ4jJUpCbiv/pzISKR1N/JNZ+9g/dfrif95rRTCLS/jMkmBSVosClOBIca9kHDAFaMguoYQqri5FdNLoggFE2XZhOD8fXmQnK3VnI3a+rFd3d9GBUpoES2hFeSgTbSPDtERqiOKrtEdekRP1q11b71Yr0XrkPU9s4B+wXr7AgO4o5s=</latexit> <latexit sha1_base64="VNZqcYTQOLvxwDSDtZr4doj9KsA=">AAACGXicbVDLSsNAFJ34tr6qLt0MFqFuSlJEXfgCNy6rWCs0NUwmk3boZBJmboQS8hvd+CtuXCjiUld+hz/g9LFQ64ELh3Pu5d57/ERwDbb9aU1Nz8zOzS8sFpaWV1bXiusbNzpOFWV1GotY3fpEM8ElqwMHwW4TxUjkC9bwu+cDv3HPlOaxvIZewloRaUseckrASF7RdiMCHT/MGvldVq7u5tjlEo9EP7syYuBVsQs8YhoHnpN7xZJdsYfAk8QZk9Lp1xOc9I+fa17x3Q1imkZMAhVE66ZjJ9DKiAJOBcsLbqpZQmiXtFnTUEnMplY2/CzHO0YJcBgrUxLwUP05kZFI617km87ByfqvNxD/85ophIetjMskBSbpaFGYCgwxHsSEA64YBdEzhFDFza2YdogiFEyYBROC8/flSXJTrTj7lb1Lu3R2hEZYQFtoG5WRgw7QGbpANVRHFPXRI3pGL9aD9WS9Wm+j1ilrPLOJfsH6+AZNvqRB</latexit>

(1) d1 ⇥d (2) d2 ⇥d1


W 2R b W 2R b
11
Matrix Notation

• Learn parameters ✓ = {W1 , b1 , W2 , b2 , w, b}


<latexit sha1_base64="CuKUbEiIa4HR5DqaviWEEzZQDyU=">AAACO3icbVDJSgNBEO1xjXGLevTSJAiCIcwEUS9C0IvHKGaBTAg9nZ6kSc9Cd40ShvkLP8aLP+EtFy8eFPHq3c4iZPFBw6v3quiq54SCKzDNgbG0vLK6tp7aSG9ube/sZvb2qyqIJGUVGohA1h2imOA+qwAHweqhZMRzBKs5veuhX3tgUvHAv4d+yJoe6fjc5ZSAllqZOxu6DAi+xHZsewS6jhvXkpaVx3+VM1NprzjjTVWPSR47dtLK5MyCOQJeJNaE5EpZ++RpUOqXW5lXux3QyGM+UEGUalhmCM2YSOBUsCRtR4qFhPZIhzU09YnHVDMe3Z7gI620sRtI/XzAI3V6IiaeUn3P0Z3DNdW8NxT/8xoRuBfNmPthBMyn44/cSGAI8DBI3OaSURB9TQiVXO+KaZdIQkHHndYhWPMnL5JqsWCdFU5vdRpXaIwUOkRZdIwsdI5K6AaVUQVR9Ize0Af6NF6Md+PL+B63LhmTmQM0A+PnF/9tsHk=</latexit>

Output
Input
|
d y = (w h2 + b)
x2R
<latexit sha1_base64="9t5JxOOFJos7Wnh/aZJ6smWc/vo=">AAACG3icbVDLSsNAFJ34rPUVdelmsAgVoSRF0IVCwY3LCvYBTQyT6aQdOpmEmYlSQv7Djb/ixoUirgQX/o2TNoK2HrhwOOde7r3HjxmVyrK+jIXFpeWV1dJaeX1jc2vb3NltyygRmLRwxCLR9ZEkjHLSUlQx0o0FQaHPSMcfXeZ+544ISSN+o8YxcUM04DSgGCkteWZ9DC+gI+kgRFUnRGroB+l9dps6lCsiMGLZjzr06hk8hv6RZ1asmjUBnCd2QSqgQNMzP5x+hJOQcIUZkrJnW7FyUyQUxYxkZSeRJEZ4hAakpylHIZFuOvktg4da6cMgErq4ghP190SKQinHoa8780PlrJeL/3m9RAVnbkp5nCjC8XRRkDCoIpgHBftUEKzYWBOEBdW3QjxEAmGdiyzrEOzZl+dJu16zrZp9fVJpnBdxlMA+OABVYINT0ABXoAlaAIMH8ARewKvxaDwbb8b7tHXBKGb2wB8Yn9+dqqEa</latexit>

d2
w2R
<latexit sha1_base64="r88cLpKiZJtXS/wgD8iCVci7o1w=">AAACBHicbVC7TsMwFL3hWcorwNjFokJiqhKEBANDJRbGguhDakrlOE5r1XEi20FUUQYWfoWFAYRY+Qg2/gan7QAtR7J0fM69uvceP+FMacf5tpaWV1bX1ksb5c2t7Z1de2+/peJUEtokMY9lx8eKciZoUzPNaSeRFEc+p21/dFn47XsqFYvFrR4ntBfhgWAhI1gbqW9XvAjroR9mDznymEDTr5/d5HdB3646NWcCtEjcGanCDI2+/eUFMUkjKjThWKmu6yS6l2GpGeE0L3upogkmIzygXUMFjqjqZZMjcnRklACFsTRPaDRRf3dkOFJqHPmmsthRzXuF+J/XTXV43suYSFJNBZkOClOOdIyKRFDAJCWajw3BRDKzKyJDLDHRJreyCcGdP3mRtE5qrlNzr0+r9YtZHCWowCEcgwtnUIcraEATCDzCM7zCm/VkvVjv1se0dMma9RzAH1ifP+g+mDo=</latexit>

<latexit sha1_base64="OQGWOHHDxod0Lr8suLTzGRzsBoM=">AAACCHicbVC7TsMwFHXKq5RXgJEBiwqJqUoqJBgYKrEwFkQfUhMix3Faq44T2Q6oijKy8CssDCDEyiew8Tc4bQZoOZKl43Pu1b33+AmjUlnWt1FZWl5ZXauu1zY2t7Z3zN29roxTgUkHxywWfR9JwignHUUVI/1EEBT5jPT88WXh9+6JkDTmt2qSEDdCQ05DipHSkmceOhFSIz/MHnLoUA5nXz+7ye+ywGvmnlm3GtYUcJHYJamDEm3P/HKCGKcR4QozJOXAthLlZkgoihnJa04qSYLwGA3JQFOOIiLdbHpIDo+1EsAwFvpxBafq744MRVJOIl9XFnvKea8Q//MGqQrP3YzyJFWE49mgMGVQxbBIBQZUEKzYRBOEBdW7QjxCAmGls6vpEOz5kxdJt9mwrYZ9fVpvXZRxVMEBOAInwAZnoAWuQBt0AAaP4Bm8gjfjyXgx3o2PWWnFKHv2wR8Ynz/78Znq</latexit>

h1 = tanh(W1 x + b1 )
<latexit sha1_base64="tJT8ci0MBQZl4hoeA/0XzNX9tv4=">AAACJXicbVDLSsNAFJ3UV62vqEs3g0WoCCURQRcVCm5cVrAPaEOYTCft0MkkzEzEEvIzbvwVNy4sIrjyV5y0qWjrgYFzz7mXufd4EaNSWdanUVhZXVvfKG6WtrZ3dvfM/YOWDGOBSROHLBQdD0nCKCdNRRUjnUgQFHiMtL3RTea3H4iQNOT3ahwRJ0ADTn2KkdKSa9Z6AVJDz0+GqWvDa9hTiA8rc7GdifPiMYVnP4WnnVPXLFtVawq4TOyclEGOhmtOev0QxwHhCjMkZde2IuUkSCiKGUlLvViSCOERGpCuphwFRDrJ9MoUnmilD/1Q6McVnKq/JxIUSDkOPN2ZLSkXvUz8z+vGyr9yEsqjWBGOZx/5MYMqhFlksE8FwYqNNUFYUL0rxEMkEFY62JIOwV48eZm0zqu2VbXvLsr1Wh5HERyBY1ABNrgEdXALGqAJMHgCL+ANTIxn49V4Nz5mrQUjnzkEf2B8fQO8cKS+</latexit>
h2 = tanh(W2 h1 + b2 )
<latexit sha1_base64="6qo1Sk9pMYMRHaF55H3jONO0Lxc=">AAACJ3icbVDLSsNAFJ3UV62vqEs3g0WoCCUpgi5UCm5cVrAPaEKYTCft0MkkzEyEEvo3bvwVN4KK6NI/cdJGqa0HBs6ccy/33uPHjEplWZ9GYWl5ZXWtuF7a2Nza3jF391oySgQmTRyxSHR8JAmjnDQVVYx0YkFQ6DPS9ofXmd++J0LSiN+pUUzcEPU5DShGSkueeeWESA38IB2MvRq8hI5CfFD5EduZOFNhw5Pfr6+9Y88sW1VrArhI7JyUQY6GZ744vQgnIeEKMyRl17Zi5aZIKIoZGZecRJIY4SHqk66mHIVEuunkzjE80koPBpHQjys4UWc7UhRKOQp9XZktKee9TPzP6yYqOHdTyuNEEY6ng4KEQRXBLDTYo4JgxUaaICyo3hXiARIIKx1tSYdgz5+8SFq1qm1V7dvTcv0ij6MIDsAhqAAbnIE6uAEN0AQYPIAn8ArejEfj2Xg3PqalBSPv2Qd/YHx9A/dopVU=</latexit>

d1 ⇥d d1 d2 ⇥d1 d2
W1 2 R
<latexit sha1_base64="2wvpmjhpxYNVstsZLu8X4dfHiOc=">AAACE3icbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD14KHjxWMV+QFPDZrNpl242YXcjlJD/4MW/4sWDIl69ePPfuGl70NYHA4/3ZpiZ5yecKW3b31ZpaXllda28XtnY3Nreqe7utVWcSkJbJOax7PpYUc4EbWmmOe0mkuLI57Tjj64Kv/NApWKxuNPjhPYjPBAsZARrI3nVEzfCeuiHWSf3HOQygaaCn93m91lQaJpFVKEg96o1u25PgBaJMyM1mKHpVb/cICZpRIUmHCvVc+xE9zMsNSOc5hU3VTTBZIQHtGeowGZPP5v8lKMjowQojKUpodFE/T2R4UipceSbzuJgNe8V4n9eL9XhRT9jIkk1FWS6KEw50jEqAkIBk5RoPjYEE8nMrYgMscREmxgrJgRn/uVF0j6tO3bduTmrNS5ncZThAA7hGBw4hwZcQxNaQOARnuEV3qwn68V6tz6mrSVrNrMPf2B9/gCLVZ3p</latexit>
b1 2 R
<latexit sha1_base64="PVi41hGXwKM/EROs5iBVxJnxOBc=">AAACCnicbVC7TsMwFHXKq5RXgJHFUCExVQlCgoGhEgtjQfQhNSGyHae16jiR7SBVUWYWfoWFAYRY+QI2/gan7QAtR7rS0Tn36t57cMqZ0o7zbVWWlldW16rrtY3Nre0de3evo5JMEtomCU9kDyNFORO0rZnmtJdKimLMaRePrkq/+0ClYom40+OU+jEaCBYxgrSRAvvQi5Ee4ijHReBCjwk4FXB+W9znYeAWgV13Gs4EcJG4M1IHM7QC+8sLE5LFVGjCkVJ910m1nyOpGeG0qHmZoikiIzSgfUMFiqny88krBTw2SgijRJoSGk7U3xM5ipUax9h0lneqea8U//P6mY4u/JyJNNNUkOmiKONQJ7DMBYZMUqL52BBEJDO3QjJEEhFt0quZENz5lxdJ57ThOg335qzevJzFUQUH4AicABecgya4Bi3QBgQ8gmfwCt6sJ+vFerc+pq0VazazD/7A+vwBEieaeA==</latexit> <latexit sha1_base64="Og1rZA4fC86AjGFewrAci9YM3wI=">AAACFXicbVBNS8NAEN34WetX1KOXxSJ4kJIUQQ8eCl48VrEf0Naw2WzapZtN2J0IJeRPePGvePGgiFfBm//GTduDtj4YeLw3w8w8PxFcg+N8W0vLK6tr66WN8ubW9s6uvbff0nGqKGvSWMSq4xPNBJesCRwE6ySKkcgXrO2Prgq//cCU5rG8g3HC+hEZSB5ySsBInn3aiwgM/TBr514N97jEU8HPbvP7LCg04BHTOPDc3LMrTtWZAC8Sd0YqaIaGZ3/1gpimEZNABdG66zoJ9DOigFPB8nIv1SwhdEQGrGuoJGZTP5t8leNjowQ4jJUpCXii/p7ISKT1OPJNZ3GynvcK8T+vm0J40c+4TFJgkk4XhanAEOMiIhxwxSiIsSGEKm5uxXRIFKFggiybENz5lxdJq1Z1nap7c1apX87iKKFDdIROkIvOUR1dowZqIooe0TN6RW/Wk/VivVsf09YlazZzgP7A+vwB0rmejw==</latexit>
W2 2 R b2 2 R
<latexit sha1_base64="tvOh1VuDGh+gByDJJsf9gRGz3Gs=">AAACCnicbVC7TsMwFHV4lvIKMLIYKiSmKqmQYGCoxMJYEH1IbYhsx2mtOk5kO0hVlJmFX2FhACFWvoCNv8FpM0DLka50dM69uvcenHCmtON8W0vLK6tr65WN6ubW9s6uvbffUXEqCW2TmMeyh5GinAna1kxz2kskRRHmtIvHV4XffaBSsVjc6UlCvQgNBQsZQdpIvn00iJAe4TDDud+AAybgTMDZbX6fBX4j9+2aU3emgIvELUkNlGj59tcgiEkaUaEJR0r1XSfRXoakZoTTvDpIFU0QGaMh7RsqUESVl01fyeGJUQIYxtKU0HCq/p7IUKTUJMKms7hTzXuF+J/XT3V44WVMJKmmgswWhSmHOoZFLjBgkhLNJ4YgIpm5FZIRkohok17VhODOv7xIOo2669Tdm7Na87KMowIOwTE4BS44B01wDVqgDQh4BM/gFbxZT9aL9W59zFqXrHLmAPyB9fkDFUWaeg==</latexit>

12
Loss functions
• Binary classification • Regression

y = w · h2 + b
<latexit sha1_base64="V0Qby0eCxuwXwCxVJ0a3oB6NZto=">AAACDnicbVDLSsNAFJ3UV62vqktFBktBEEpSRN0IRTcuW7APaEKZTCft0MkkzEyUELJ05cZfcSNFEbeu3fkN/oTTB6LVAxcO59zLvfe4IaNSmeaHkZmbX1hcyi7nVlbX1jfym1sNGUQCkzoOWCBaLpKEUU7qiipGWqEgyHcZabqDi5HfvCZC0oBfqTgkjo96nHoUI6WlTr4YwzNo+0j1XS+5SaGNu4H6FvpppwwPodvJF8ySOQb8S6wpKVR2h7XP271htZN/t7sBjnzCFWZIyrZlhspJkFAUM5Lm7EiSEOEB6pG2phz5RDrJ+J0UFrXShV4gdHEFx+rPiQT5Usa+qztHd8pZbyT+57Uj5Z06CeVhpAjHk0VexKAK4Cgb2KWCYMViTRAWVN8KcR8JhJVOMKdDsGZf/ksa5ZJ1XDqq6TTOwQRZsAP2wQGwwAmogEtQBXWAwR14AE/g2bg3Ho0X43XSmjGmM9vgF4y3L3usntI=</latexit>

y = (w · h2 + b) Ground truth label


<latexit sha1_base64="LbnzRp1FIIn1Kl1FwLLBmHdh6Rg=">AAACFnicbVDLSsNAFJ3UV62vqEs3Q4tQKZakiLoRim5cVrAPaEqZTCft0MkkzEyUEPoVuvBX3LhQxK246984aSto64ELh3Pu5d573JBRqSxrbGSWlldW17LruY3Nre0dc3evIYNIYFLHAQtEy0WSMMpJXVHFSCsUBPkuI013eJX6zTsiJA34rYpD0vFRn1OPYqS01DWPY3gBHUn7Pio6PlID10vuR9DBvUDBH2Ew6lZgCbpHXbNgla0J4CKxZ6RQzTulx3E1rnXNL6cX4MgnXGGGpGzbVqg6CRKKYkZGOSeSJER4iPqkrSlHPpGdZPLWCB5qpQe9QOjiCk7U3xMJ8qWMfVd3pofKeS8V//PakfLOOwnlYaQIx9NFXsSgCmCaEexRQbBisSYIC6pvhXiABMJKJ5nTIdjzLy+SRqVsn5ZPbnQal2CKLDgAeVAENjgDVXANaqAOMHgAz+AVvBlPxovxbnxMWzPGbGYf/IHx+Q1wPKFE</latexit>

⇤ ⇤ ⇤ ⇤ ⇤ 2
L(y, y ) =
<latexit sha1_base64="rRgl1nuiddQVks0Gj7Z1U34kvgg=">AAACI3icbVDLSgMxFM3UV62vUZdugkVoxZYZERRBKLpx4aKCfUBnLJk0bUMzD5KMMAzzL278FTculOLGhf9iZjoLbT2Q5OSce0nucQJGhTSML62wtLyyulZcL21sbm3v6Lt7beGHHJMW9pnPuw4ShFGPtCSVjHQDTpDrMNJxJjep33kiXFDfe5BRQGwXjTw6pBhJJfX1S8tFcowRi++SSnQCo8fjKryCNXVazB/FUQJrsGKqLXUyaXarJn29bNSNDHCRmDkpgxzNvj61Bj4OXeJJzJAQPdMIpB0jLilmJClZoSABwhM0Ij1FPeQSYcfZjAk8UsoADn2ulidhpv7uiJErROQ6qjKdSMx7qfif1wvl8MKOqReEknh49tAwZFD6MA0MDignWLJIEYQ5VX+FeIw4wlLFWlIhmPMjL5L2ad006ub9WblxncdRBAfgEFSACc5BA9yCJmgBDJ7BK3gHH9qL9qZNtc9ZaUHLe/bBH2jfP8ZjoJU=</latexit>
y log y (1 y ) log (1 y) LMSE (y, y ) = (y
<latexit sha1_base64="qXd8yxiBWzOEL57x+yWmNbP8hNE=">AAACF3icbVDJSgNBFOxxjXGLevTSGIQoGmaCoBchKIIHhYhmgWz0dHqSJj0L3W/EYZi/8OKvePGgiFe9+Td2loMmFjQUVe/xusoOBFdgmt/GzOzc/MJiaim9vLK6tp7Z2KwoP5SUlakvfFmziWKCe6wMHASrBZIR1xasavfPB371nknFfe8OooA1XdL1uMMpAS21M/mGS6BHiYivknbcAPYA8fXtRZLkogMctfb38CnORfhwyFuFdiZr5s0h8DSxxiSLxii1M1+Njk9Dl3lABVGqbpkBNGMigVPBknQjVCwgtE+6rK6pR1ymmvEwV4J3tdLBji/18wAP1d8bMXGVilxbTw5SqElvIP7n1UNwTpox94IQmEdHh5xQYPDxoCTc4ZJREJEmhEqu/4ppj0hCQVeZ1iVYk5GnSaWQt8y8dXOULZ6N60ihbbSDcshCx6iILlEJlRFFj+gZvaI348l4Md6Nj9HojDHe2UJ/YHz+AK9WnaI=</latexit>
y )

• Multi-class classification (C classes)


C⇥d2 C
<latexit sha1_base64="rybD2PaXXru+5covpSZpYm/5C44=">AAACMHicbVDLSgMxFM3UV62vqktFgqXgQspMEXVZ7EKXrdgHdGrJpJk2NJMZkoxQhln6OW5c9Tt0o6CIW/0J04egbQ8ETs69l3vucQJGpTLNVyOxsLi0vJJcTa2tb2xupbd3qtIPBSYV7DNf1B0kCaOcVBRVjNQDQZDnMFJzesVhvXZHhKQ+v1H9gDQ91OHUpRgpLbXSl7aHVNdxo1oMbcrh+OtE1/FtVIS2oh6RsN3Kx8fwt9OZ0xm30hkzZ44AZ4k1IZnC/qD8fX8wKLXST3bbx6FHuMIMSdmwzEA1IyQUxYzEKTuUJEC4hzqkoSlH2kgzGh0cw6xW2tD1hX5cwZH6dyJCnpR9TzvNDl3K6dpQnFdrhMo9b0aUB6EiHI8XuSGDyofD9GCbCoIV62uCsKDaK8RdJBBWOuOUDsGaPnmWVPM56zR3UtZpXIAxkmAPHIIjYIEzUABXoAQqAIMH8AzewLvxaLwYH8bnuDVhTGZ2wT8YXz+Ypa2l</latexit>

W2R ,b 2 R
<latexit sha1_base64="JMe9y29+++gJ/sKiGS3/HILkxE8=">AAACJXicbVDLSsNAFJ34tr6iLt0MLUKlUJIi6kKh6Malgm2FpoTJdNIOTh7M3EhDyF/4BW78FTcuFBFc+StOWxVtPTBw5px7ufceLxZcgWW9GzOzc/MLi0vLhZXVtfUNc3OrqaJEUtagkYjktUcUEzxkDeAg2HUsGQk8wVrezdnQb90yqXgUXkEas05AeiH3OSWgJdc8Tl2OT7ADbACZinwIyCB3edkJCPQ9P2vl+Jv2c7eGKz9fL99zzZJVtUbA08T+IqV60ancvdfTC9d8cboRTQIWAhVEqbZtxdDJiAROBcsLTqJYTOgN6bG2piEJmOpkoytzvKuVLvYjqV8IeKT+7shIoFQaeLpyuKKa9Ibif147Af+ok/EwToCFdDzITwSGCA8jw10uGQWRakKo5HpXTPtEEgo62IIOwZ48eZo0a1X7oLp/qdM4RWMsoR1URGVko0NUR+foAjUQRffoET2jF+PBeDJejbdx6Yzx1bON/sD4+AQXA6iY</latexit>

yi = softmaxi (Wh2 + b)
C
X
⇤ ⇤
L(y, y ) = yi log yi
<latexit sha1_base64="vO7WpM8DRV37pdQR4k280HdhKGE=">AAACHnicbVDLSgMxFM34rPVVdekmWIQqWmZE0U2h2I0LFxXsAzrtkEnTNjTzIMkIQ5gvceOvuHGhiOBK/8ZMOwttPRA4nHMvN+e4IaNCmua3sbC4tLyymlvLr29sbm0XdnabIog4Jg0csIC3XSQIoz5pSCoZaYecIM9lpOWOa6nfeiBc0MC/l3FIuh4a+nRAMZJacgoXtofkCCOmbpNSfALj3vERrMBTW0Seo2jFSnqqligtO9RmwVDFDk0Sp1A0y+YEcJ5YGSmCDHWn8Gn3Axx5xJeYISE6lhnKrkJcUsxIkrcjQUKEx2hIOpr6yCOiqybxEniolT4cBFw/X8KJ+ntDIU+I2HP1ZBpGzHqp+J/XieTgqquoH0aS+Hh6aBAxKAOYdgX7lBMsWawJwpzqv0I8QhxhqRvN6xKs2cjzpHlWtsyydXderF5ndeTAPjgAJWCBS1AFN6AOGgCDR/AMXsGb8WS8GO/Gx3R0wch29sAfGF8/Vf2iCA==</latexit>
i=1

How to compute: r✓ L(✓)


<latexit sha1_base64="9sOcX05AFBnmvx4hix7d5k7hXFY=">AAACC3icbVA9SwNBEN3zM8avqKXNkSDEJtyJoGXQxsIigvmA3HHMbTbJkr29Y3dOCEd6G/+KjYUitv4BO/+Ne0kKTXww8Hhvhpl5YSK4Rsf5tlZW19Y3Ngtbxe2d3b390sFhS8epoqxJYxGrTgiaCS5ZEzkK1kkUgygUrB2OrnO//cCU5rG8x3HC/AgGkvc5BTRSUCp7EkIBQebhkCFMvAhwSEFkt9WZcjoJShWn5kxhLxN3TipkjkZQ+vJ6MU0jJpEK0LrrOgn6GSjkVLBJ0Us1S4COYMC6hkqImPaz6S8T+8QoPbsfK1MS7an6eyKDSOtxFJrO/FS96OXif143xf6ln3GZpMgknS3qp8LG2M6DsXtcMYpibAhQxc2tNh2CAoomvqIJwV18eZm0zmquU3Pvziv1q3kcBXJMyqRKXHJB6uSGNEiTUPJInskrebOerBfr3fqYta5Y85kj8gfW5w+XqZtf</latexit>

✓ = {W1 , b1 , W2 , b2 , w, b}
<latexit sha1_base64="CuKUbEiIa4HR5DqaviWEEzZQDyU=">AAACO3icbVDJSgNBEO1xjXGLevTSJAiCIcwEUS9C0IvHKGaBTAg9nZ6kSc9Cd40ShvkLP8aLP+EtFy8eFPHq3c4iZPFBw6v3quiq54SCKzDNgbG0vLK6tp7aSG9ube/sZvb2qyqIJGUVGohA1h2imOA+qwAHweqhZMRzBKs5veuhX3tgUvHAv4d+yJoe6fjc5ZSAllqZOxu6DAi+xHZsewS6jhvXkpaVx3+VM1NprzjjTVWPSR47dtLK5MyCOQJeJNaE5EpZ++RpUOqXW5lXux3QyGM+UEGUalhmCM2YSOBUsCRtR4qFhPZIhzU09YnHVDMe3Z7gI620sRtI/XzAI3V6IiaeUn3P0Z3DNdW8NxT/8xoRuBfNmPthBMyn44/cSGAI8DBI3OaSURB9TQiVXO+KaZdIQkHHndYhWPMnL5JqsWCdFU5vdRpXaIwUOkRZdIwsdI5K6AaVUQVR9Ize0Af6NF6Md+PL+B63LhmTmQM0A+PnF/9tsHk=</latexit>

13
Optimization
(t+1) (t)

<latexit sha1_base64="2xbrEJR+XVhUcysjVyGPSHic0HY=">AAACJnicbVDLSgNBEJz1GeMr6tHLYBAiYtgVQS+C6EU8RTBRyK5L72RihszOLjO9QljyNV78FS8eIiLe/BQnD/BZMFBd1U1PV5RKYdB1352p6ZnZufnCQnFxaXlltbS23jBJphmvs0Qm+iYCw6VQvI4CJb9JNYc4kvw66p4N/et7ro1I1BX2Uh7EcKdEWzBAK4WlYx87HOE2r+Cut9Onx/RLsOUe9W1BfQWRhDAfe/2LypjshKWyW3VHoH+JNyFlMkEtLA38VsKymCtkEoxpem6KQQ4aBZO8X/Qzw1NgXbjjTUsVxNwE+ejMPt22Sou2E22fQjpSv0/kEBvTiyPbGQN2zG9vKP7nNTNsHwW5UGmGXLHxonYmKSZ0mBltCc0Zyp4lwLSwf6WsAxoY2mSLNgTv98l/SWO/6rlV7/KgfHI6iaNANskWqRCPHJITck5qpE4YeSBPZEBenEfn2Xl13satU85kZoP8gPPxCRNypFM=</latexit>
=✓ ⌘r✓ J(✓)

• Logistic regression is convex: one global minimum

• Neural networks are non-convex and not easy to optimize

• A class of more sophisticated “adaptive” optimizers


that scale the parameter adjustment by an
accumulated gradient.
• Adam
• Adagrad
• RMSprop
• …

(Ruder 2016): An overview of gradient descent optimization algorithms


(https://ruder.io/optimizing-gradient-descent/)
14
Using SGD
• Decay learning rate

• Mini-batch (update after seeing m samples)

There is a cost to updating weights


• Less variance than pure SGD

• More efficient than updating the weights after every sample

• Randomize/shuffle samples in each mini-batch


Optimization
• Standard/Vanilla SGD: update with only gradients

• Momentum: update w/ running average of gradients

• Prevent instability from sudden changes

• Adagrad: different learning rate for each parameter

• update down weighs high variance values

• Adam: update w/ running average of gradient, down weights by


running average of variance

More details: TA tutorial on optimization


Blog: https://ruder.io/optimizing-gradient-descent/

16 Adapted from slide by Graham Neubig


Optimization

Gradient Descent

<latexit sha1_base64="t1YEIuqVaoSBe7wheKeApNWiXy0=">AAACMHicbVDLSgNBEJz1bXxFPXoZFCEihl0R9SIEPejBQwSjQjaG3snEDM7OLjO9Qlj2L/wNL36KXgwo4tWvcJIVfBYMVFd1M90VxFIYdN2eMzQ8Mjo2PjFZmJqemZ0rzi+cmSjRjNdYJCN9EYDhUiheQ4GSX8SaQxhIfh5cH/T98xuujYjUKXZj3gjhSom2YIBWahYPfexwhMu0hOveWkb36Jdgyw3q24L6CgIJzTT3Mj8E7DCQ6XFWyqW1ZnHFLbsD0L/E+yQrlWV//bZX6VabxQe/FbEk5AqZBGPqnhtjIwWNgkmeFfzE8BjYNVzxuqUKQm4a6eDgjK5apUXbkbZPIR2o3ydSCI3phoHt7K9qfnt98T+vnmB7t5EKFSfIFcs/aieSYkT76dGW0Jyh7FoCTAu7K2Ud0MDQZlywIXi/T/5LzjbL3nZ568SmsU9yTJAlskxKxCM7pEKOSJXUCCN35JE8kxfn3nlyXp23vHXI+ZxZJD/gvH8A7y2r+w==</latexit>

(t+1) (t)
✓ =✓ ⌘r✓ L(✓)

How to compute the gradient?

17
Neural networks as
computational graphs

18
Computational graph
z1
x
⋅ W1x
+ f1 h1
z2
⋅ W2h1
+ f2 h2
zo
W1 b1 ⋅ Woh2
+ σ
y
W2 b2
fLCE
Wo bo
Focus on computation
y*
Nodes represent operations LCE
Edges are values from one
operator to the next

Forward pass: compute function value


Backward pass: gradient using chain rule
19
Simplified example
Forward pass: compute function value
z y
x
⋅ Wx
+ σ fL L

W b

Backward pass: gradient using chain rule


∂L ∂L ∂L ∂L
∂y ∂L

x ∂Wx ∂z
+ σ fL L
∂L ∂L • Good news is that modern automatic
∂W ∂b differentiation tools did all for you!
W b • Implementing backprop by hand is like
20
programming in assembly language.
Backpropagation: single node

h = f(z)

z f h
∂h ∂f(z)
∂L ∂L ∂h = ∂L
= ∂z ∂z
∂z ∂h ∂z ∂h
Downstream
Local
Upstream

= ×
gradient gradient gradient
Gradient wrt input

21
Backpropagation
Multiple inputs Multiple output branches

x1 z1
x f1
∂L ∂L ∂z z = f(x1, x2)
∂x1
=
∂z ∂x1 f
x2 ∂L x z2
∂z f2
∂L ∂L ∂z
=
∂x2 ∂z ∂x2 Sum gradients of branches
<latexit sha1_base64="Mw2UZijWSwWvWR/cDN+pJuopb+0=">AAACWHicdVFdS8MwFE3r5j78qvPRl+AQfBqtiPrgYOCLDz5McB+wzpJm6RaWpiVJxVn6JwUf9K/4YroNnJteCBzOOffe5MSPGZXKtj8Mc6tQ3C6VK9Wd3b39A+uw1pVRIjDp4IhFou8jSRjlpKOoYqQfC4JCn5GeP73N9d4zEZJG/FHNYjIM0ZjTgGKkNOVZkRsIhFM3RkJRxOB99oNfMtiErkxCL6VNJ3tKuRb/t796NINres6tDsw8q2437HnBTeAsQR0sq+1Zb+4owklIuMIMSTlw7FgN03wiZiSruokkMcJTNCYDDTkKiRym82AyeKqZEQwioQ9XcM6udqQolHIW+toZIjWR61pO/qUNEhVcD1PK40QRjheLgoRBFcE8ZTiigmDFZhogLKi+K8QTpKNR+i+qOgRn/cmboHvecC4bFw8X9dbNMo4yOAYn4Aw44Aq0wB1ogw7A4B18GQWjaHyawCyZlYXVNJY9R+BXmbVvIGm2fA==</latexit>

n
X
@L @L @zi
Compute gradients for each input =
@x i=1
@zi @x
{z1 , . . . , zn } = successors of x
<latexit sha1_base64="8Lnb5mt5i14JVtOfZ2CC0/k4EEA=">AAACGHicbVDLSgNBEJz1bXxFPXoZDIIHibsS1IOC4MWjgtFANiyzk14dnJ1ZZnolyZLP8OKvePGgiFdv/o2Tx8FXQUNR1U13V5xJYdH3P72Jyanpmdm5+dLC4tLySnl17crq3HCocy21acTMghQK6ihQQiMzwNJYwnV8dzrwr+/BWKHVJXYzaKXsRolEcIZOisq7YdGLgh0ayrZGu0N7kQr79JiGCB0sqM05B2u1sVQntE87UbniV/0h6F8SjEmFjHEelT/CtuZ5Cgq5ZNY2Az/DVsEMCi6hXwpzCxnjd+wGmo4qloJtFcPH+nTLKW2aaONKIR2q3ycKllrbTWPXmTK8tb+9gfif18wxOWwVQmU5guKjRUkuKWo6SIm2hQGOsusI40a4Wym/ZYZxdFmWXAjB75f/kqu9arBfrV3UKidH4zjmyAbZJNskIAfkhJyRc1InnDyQJ/JCXr1H79l7895HrRPeeGad/ID38QVIM59P</latexit>

22
Backpropagation API
Multiple inputs

x1
Each node (operator) implement
local forward/backward API
∂L ∂L ∂z f(x1, x2) • forward(inputs)
∂x1
=
∂z ∂x1 f ∂L f(x1, …, xk)
x2
∂z • backward(upstream gradient)
∂L ∂L ∂z ∂f ∂f ∂f
= , , …,
∂x2 ∂z ∂x2 ∂x1 ∂x2 ∂xk

• Chain them together to form computation graph


• Reuse derivatives to minimize computation

23
Example: MultiplyGate

http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes03-neuralnets.pdf
24
Credits: Chris Manning (Stanford cs224n)
Backpropagation in general computational graph

• Forward propagation: visit nodes in topological sort order


• Compute value of node given predecessors
• Backward propagation:
• Initialize output gradient as 1
• Visit nodes in reverse order and compute gradient wrt each node
using gradient wrt successors
<latexit sha1_base64="Mw2UZijWSwWvWR/cDN+pJuopb+0=">AAACWHicdVFdS8MwFE3r5j78qvPRl+AQfBqtiPrgYOCLDz5McB+wzpJm6RaWpiVJxVn6JwUf9K/4YroNnJteCBzOOffe5MSPGZXKtj8Mc6tQ3C6VK9Wd3b39A+uw1pVRIjDp4IhFou8jSRjlpKOoYqQfC4JCn5GeP73N9d4zEZJG/FHNYjIM0ZjTgGKkNOVZkRsIhFM3RkJRxOB99oNfMtiErkxCL6VNJ3tKuRb/t796NINres6tDsw8q2437HnBTeAsQR0sq+1Zb+4owklIuMIMSTlw7FgN03wiZiSruokkMcJTNCYDDTkKiRym82AyeKqZEQwioQ9XcM6udqQolHIW+toZIjWR61pO/qUNEhVcD1PK40QRjheLgoRBFcE8ZTiigmDFZhogLKi+K8QTpKNR+i+qOgRn/cmboHvecC4bFw8X9dbNMo4yOAYn4Aw44Aq0wB1ogw7A4B18GQWjaHyawCyZlYXVNJY9R+BXmbVvIGm2fA==</latexit>

Xn
@L @L @zi
=
@x i=1
@zi @x

{z1 , . . . , zn } = successors of x
<latexit sha1_base64="8Lnb5mt5i14JVtOfZ2CC0/k4EEA=">AAACGHicbVDLSgNBEJz1bXxFPXoZDIIHibsS1IOC4MWjgtFANiyzk14dnJ1ZZnolyZLP8OKvePGgiFdv/o2Tx8FXQUNR1U13V5xJYdH3P72Jyanpmdm5+dLC4tLySnl17crq3HCocy21acTMghQK6ihQQiMzwNJYwnV8dzrwr+/BWKHVJXYzaKXsRolEcIZOisq7YdGLgh0ayrZGu0N7kQr79JiGCB0sqM05B2u1sVQntE87UbniV/0h6F8SjEmFjHEelT/CtuZ5Cgq5ZNY2Az/DVsEMCi6hXwpzCxnjd+wGmo4qloJtFcPH+nTLKW2aaONKIR2q3ycKllrbTWPXmTK8tb+9gfif18wxOWwVQmU5guKjRUkuKWo6SIm2hQGOsusI40a4Wym/ZYZxdFmWXAjB75f/kqu9arBfrV3UKidH4zjmyAbZJNskIAfkhJyRc1InnDyQJ/JCXr1H79l7895HrRPeeGad/ID38QVIM59P</latexit>

For more details see https://colah.github.io/posts/2015-08-Backprop/

25
Simplified example
Forward pass: compute function value Chain rule Local derivatives
z y
x
⋅ Wx
+ σ fL L
∂L
∂L
=1

∂L ∂L ∂fL(y, y*) ∂fL(y, y*)


= = y − y*
W b ∂y ∂L ∂y ∂y

L(y, y ) = ⇤ ⇤
y log y (1 ⇤
y ) log (1 y) ∂L ∂L ∂y ∂L ∂σ(z) ∂σ(z)
= = = σ(z)(1 − σ(z))
∂z ∂y ∂z ∂y ∂z
<latexit sha1_base64="rRgl1nuiddQVks0Gj7Z1U34kvgg=">AAACI3icbVDLSgMxFM3UV62vUZdugkVoxZYZERRBKLpx4aKCfUBnLJk0bUMzD5KMMAzzL278FTculOLGhf9iZjoLbT2Q5OSce0nucQJGhTSML62wtLyyulZcL21sbm3v6Lt7beGHHJMW9pnPuw4ShFGPtCSVjHQDTpDrMNJxJjep33kiXFDfe5BRQGwXjTw6pBhJJfX1S8tFcowRi++SSnQCo8fjKryCNXVazB/FUQJrsGKqLXUyaXarJn29bNSNDHCRmDkpgxzNvj61Bj4OXeJJzJAQPdMIpB0jLilmJClZoSABwhM0Ij1FPeQSYcfZjAk8UsoADn2ulidhpv7uiJErROQ6qjKdSMx7qfif1wvl8MKOqReEknh49tAwZFD6MA0MDignWLJIEYQ5VX+FeIw4wlLFWlIhmPMjL5L2ad006ub9WblxncdRBAfgEFSACc5BA9yCJmgBDJ7BK3gHH9qL9qZNtc9ZaUHLe/bBH2jfP8ZjoJU=</latexit>

∂z
Backward pass: gradient using chain rule
∂L ∂L ∂z ∂L ∂(Wx + b) ∂(Wx + b)
∂L ∂L ∂L ∂L = = =1
∂b ∂z ∂b ∂z ∂b ∂b
∂y ∂L

x ∂Wx ∂z
+ σ fL L ∂L ∂L ∂z ∂L ∂(Wx + b) ∂(Wx + b)
= = =1
∂L ∂Wx ∂z ∂Wx ∂z ∂Wx ∂Wx
∂L
∂W ∂b ∂Wx
∂L ∂L ∂Wx = xT
W b =
∂W ∂Wx ∂W ∂W
26
An example
Try to compute the gradients yourself!

http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes03-neuralnets.pdf
27
Designing classifiers
with neural networks
Feature design: partly eliminated,

partly rolled into network design

• Input features: f(x) → [ f1, f2, …, fm]


• Need to determine features
• Output: estimate P(y = c | x) for each class c Neural Networks

figure out architecture


• Need to model P(y = c | x) with a family of functions
Still need to figure • Train phase: Learn parameters of model to minimize loss function
out loss function General methods using
• Need Loss function and Optimization algorithm auto-differentiation
• Test phase: Apply parameters to predict class given a new input

28
Rise of deep-learning frameworks
Pytorch, TensorFlow, Keras, Theano, …
Provide frequently used components that can
be connected together
No longer need to code all the pieces of
• Easy to build complex models
your model and optimizer by yourself
• Connect up neural building blocks
• Mix and match selection of loss functions,
regularizers, and optimizers
• Optimize using auto-differentiation, no need Allows researchers and developers
to hand code optimizers for specific models to focus on
• Deals with numerical stability issues • Modeling the problem
• Deals with efficient computation (e.g. • Designing the network
batching, using GPUs)
• Provides (some) experiment logging and
visualization tools
29
Resources
• There is a lot more to learn about neural networks!

• TA Tutorials

• Optimization

• Pytorch and backpropagation

• Debugging Tips and Tricks

• Classes: CMPT 728 (Deep Learning)

30
Resources
• Deep learning books

• Courville, Goodfellow, and Bengio: https://www.deeplearningbook.org/

• Yoav Goldberg’s Primer on NN models for NLP: http://u.cs.biu.ac.il/~yogo/nnlp.pdf

• Classes from other universities

• Stanford CS231n notes: https://cs231n.github.io/

• University of Toronto: https://csc413-2020.github.io/ (Jimmy Ba and Roger


Grosse)

• University of Michigan: https://web.eecs.umich.edu/~justincj/teaching/eecs498/


FA2020/ (Justin Johnson)

31

You might also like