## MIT Introduction to Deep Learning | 6.S191

hi everyone, let’s get started. Good

afternoon and welcome to MIT 6.S191! TThis is really incredible to see the

turnout this year. This is the fourth year now we’re teaching this course and

every single year it just seems to be getting bigger and bigger. 6.S191 is a

one-week intensive boot camp on everything deep learning. In the past, at

this point I usually try to give you a synopsis about the course and tell you

all of the amazing things that you’re going to be learning. You’ll be gaining

fundamentals into deep learning and learning some practical knowledge about

how you can implement some of the algorithms of deep learning in your own

research and on some cool lab related software projects. But this year I

figured we could do something a little bit different and instead of me telling

you how great this class is I figured we could invite someone else from outside

the class to do that instead. So let’s check this out first. Hi everybody and

welcome MIT 6.S191 the official introductory course on deep

learning to taught here at MIT. Deep learning is revolutionising so many

fields from robotics to medicine and everything in between. You’ll the learn the fundamentals of this field and how you can build some of these

incredible algorithms. In fact, this entire speech and video are not real and

were created using deep learning and artificial intelligence. And in this

class you’ll learn how. It has been an honor to speak with you today and I hope you enjoy the course! Alright. so as you can tell deep learning

is an incredibly powerful tool. This was just an example of how we use deep

learning to perform voice synthesis and actually emulate someone else’s voice, in

this case Barack Obama, and also using video dialogue replacement to

actually create that video with the help of Canny AI. And of course you might as

you’re watching this video you might raise some ethical concerns which we’re

also very concerned about and we’ll actually talk about some of those later

on in the class as well. But let’s start by taking a step back and actually

introducing some of these terms that we’ve been we’ve talked about so far now. Let’s start with the word intelligence. I like to define intelligence as the

ability to process information to inform future decisions. Now the field of

artificial intelligence is simply the the field which focuses on building

algorithms, in this case artificial algorithms that can do this as well: process information to inform future

decisions. Now machine learning is just a subset of artificial intelligence

specifically that focuses on actually teaching an algorithm how to do this

without being explicitly programmed to do the task at hand.

Now deep learning is just a subset of machine learning which takes this idea

even a step further and says how can we automatically extract the useful pieces

of information needed to inform those future predictions or make a decision

And that’s what this class is all about teaching algorithms how to learn a task

directly from raw data. We want to provide you with a solid foundation of

how you can understand or how to understand these algorithms under the

hood but also provide you with the practical knowledge and practical skills

to implement state-of-the-art deep learning algorithms in Tensorflow which

is a very popular deep learning toolbox. Now we have an amazing set of lectures

lined up for you this year including Today which will cover neural networks

and deep sequential modeling. Tomorrow we’ll talk about computer vision and

also a little bit about generative modeling which is how we can generate

new data and finally I will talk about deep reinforcement learning and touch on

some of the limitations and new frontiers of where this field might be

going and how research might be heading in the next couple of years. We’ll spend

the final two days hearing about some of the guest lectures from top industry

researchers on some really cool and exciting projects. Every year these

happen to be really really exciting talks so we really encourage you to come

especially for those talks. The class will conclude with some final project

presentations which we’ll talk about in a little a little bit and also some

awards and a quick award ceremony to celebrate all of your hard work. Also I

should mention that after each day of lectures so after today we have two

lectures and after each day of lectures we’ll have a software lab which tries to

focus and build upon all of the things that you’ve learned in that day so

you’ll get the foundation’s during the lectures and you’ll get the practical

knowledge during the software lab so the two are kind of jointly coupled in that

sense. For those of you taking this class for credit you have a couple different

options to fulfill your credit requirement first is a project proposal

I’m sorry first yeah first you can propose a project in optionally groups

of two three or four people and in these groups you’ll work to develop a cool new

deep learning idea and we realized that one week which is the span of this

course is an extremely short amount of time to really not only think of an idea

but move that idea past the planning stage and try to implement something so

we’re not going to be judging you on your results towards this idea but

rather just the novelty of the idea itself on Friday

each of these three teams will give a three-minute presentation on that idea

and the awards will be announced for the top winners judged by a panel of judges the second option in my opinion is a bit

more boring but we like to give this option for people that don’t like to

give presentations so in this option if you don’t want to work in a group or you

don’t want to give a presentation you can write a one-page paper review of the

deep learning of a recent deepening of paper or any paper of your choice and

this will be due on the last day of class as well also I should mention that

and for the project presentations we give out all of these cool prizes

especially these three nvidia gpus which are really crucial for doing any sort of

deep learning on your own so we definitely encourage everyone to enter

this competition and have a chance to win these GPUs and these other cool

prizes like Google home and SSD cards as well also for each of the labs the three

labs will have corresponding prizes so it instructions to actually enter those

respective competitions will be within the labs themselves and you can enter to

enter to win these different prices depending on the different lab please

post a Piazza if you have questions check out the course website for slides

today’s slides are already up there is a bug in the website we fixed that now so

today’s slides are up now digital recordings of each of these lectures

will be up a few days after each class this course has an incredible team of

TAS that you can reach out to if you have any questions especially during the

software labs they can help you answer any questions that you might have and

finally we really want to give a huge thank to all of our sponsors who without

their help and support this class would have not been possible ok so now with

all of that administrative stuff out of the way let’s start with the the fun

stuff that we’re all here for let’s start actually by asking ourselves a

question why do we care about deep learning well why do you all care about

deep learning and all of you came to this classroom today and why

specifically do care about deep learning now well to answer that question we

actually have to go back and understand traditional machine learning at its core

first now traditional machine learning algorithms typically try to define as

set of rules or features in the data and these are usually hand engineered and

because their hand engineered they often tend to be brittle in practice so let’s

take a concrete example if you want to perform facial detection how might you

go about doing that well first you might say to classify a face the first thing

I’m gonna do is I’m gonna try and classify or recognize if I see a mouth

in the image the eyes ears and nose if I see all of those things then maybe I can

say that there’s a face in that image but then the question is okay but how do

I recognize each of those sub things like how do I recognize an eye how do I

recognize a mouth and then you have to decompose that into okay to recognize a

mouth I maybe have to recognize these pairs of lines oriented lines in a

certain direction certain orientation and then it keeps getting more

complicated and each of these steps you kind of have to define a set of features

that you’re looking for in the image now the key idea of deep learning is that

you will need to learn these features just from raw data so what you’re going

to do is you’re going to just take a bunch of images of faces and then the

deep learning algorithm is going to develop some hierarchical representation

of first detecting lines and edges in the image using these lines and edges to

detect corners and eyes and mid-level features like eyes noses mouths ears

then composing these together to detect higher-level features like maybe jaw

lines side of the face etc which then can be used to detect the final face

structure and actually the fundamental building blocks of deep learning have

existed for decades and they’re under underlying algorithms for training these

models have also existed for many years so why are we studying this now well for

one data has become much more pervasive we’re living in a the age of big data

and these these algorithms are hungry for a huge amounts of data to succeed

secondly these algorithms are massively parallel izybelle which means that they

can benefit tremendously from modern GPU architectures and hardware acceleration

that simply did not exist when these algorithms were developed and finally

due to open-source tool boxes like tensor flow which are which you’ll get

experience with in this class building and deploying these models has

become extremely streamlined so much so that we can condense all this material

down into one week so let’s start with the fundamental building block of a

neural network which is a single neuron or what’s also called a perceptron the

idea of a perceptron or a single neuron is very basic and I’ll try and keep it

as simple as possible and then we’ll try and work our way up from there let’s

start by talking about the forward propagation of information through a

neuron we define a set of inputs to that neuron as x1 through XM and each of

these inputs have a corresponding weight w1

through WN now what we can do is with each of these inputs and each of these

ways we can multiply them correspondingly together and take a sum

of all of them then we take this single number that’s summation and we pass it

through what’s called a nonlinear activation function and that produces

our final output Y now this is actually not entirely correct we also have what’s

called a bias term in this neuron which you can see here in green so the bias

term the purpose of the bias term is really to allow you to shift your

activation function to the left and to the right regardless of your inputs

right so you can notice that the bias term doesn’t is not affected by the X’s

it’s just a bias associate to that input now on the right side you can see this

diagram illustrated mathematically as a single equation and we can actually

rewrite this as a linear using linear algebra in terms of vectors and dot

products so instead of having a summation over all of the X’s I’m going

to collapse my X into a vector capital X which is now just a list or a vector of

numbers a vector of inputs I should say and you also have a vector of weights

capital W to compute the output of a single perceptron all you have to do is

take the dot product of X and W which represents that element wise

multiplication and summation and then apply that non-linearity which here is

denoted as G so now you might be wondering what is

this nonlinear activation function I’ve mentioned it a couple times but I

haven’t really told you precisely what it is now one common example of this

activation function is what’s called a sigmoid function and you can see an

example of a sigmoid function here on the bottom right one thing to note is

that this function takes any real number as input on the x-axis and it transforms

that real number into a scalar output between 0 & 1

it’s a bounded output between 0 & 1 so one very common use case of the sigmoid

function is to when you’re dealing with probabilities because probabilities have

to also be bounded between 0 & 1 so sigmoids are really useful when you want

to output a single number and represent that number as a probability

distribution in fact there are many common types of nonlinear activation

functions not just the sigmoid but many others that you can use in neural

networks and here are some common ones and throughout this presentation you’ll

find these tensorflow icons like you can see on the bottom right or sorry all

across the bottom here and these are just to illustrate how one could use

each of these topics in a practical setting you’ll see these kind of

scattered in throughout the slides no need to really take furious notes at

these codeblocks like I said all of the slides are published online so

especially during your labs if you want to refer back to any of the slides you

can you can always do that from the online lecture notes now why do we care

about activation functions the point of an activation function is to introduce

nonlinearities into the data and this is actually really important in real life

because in real life almost all of our data is nonlinear and here’s a concrete

example if I told you to separate the green points from the red points using a

linear function could you do that I don’t think so right so you’d get

something like this oh you could do it you wouldn’t do very good job at it and

no matter how deep or how large your network is if you’re using a linear

activation function you’re just composing lines on top of lines and

you’re going to get another line right so this is the best you’ll be able to do

with the linear activation function on the other hand nonlinearities allow you

to approximate arbitrarily complex

functions by kind of introducing these nonlinearities into your decision

boundary and this is what makes neural networks extremely powerful let’s

understand this with a simple example and let’s go back to this picture that

we had before imagine I give you a train network with weights W on the top right

so W here is 3 and minus 2 and the network only has 2 inputs x1 and x2 if

we want to get the output it’s simply the same story as we had before we

multiply our inputs by those weights we take the sum and pass it through a

non-linearity but let’s take a look at what’s inside of that non-linearity

before we apply it so we get is when we take this dot product of x1 times 3 X 2

times minus 2 we mul – 1 that’s simply a 2d line so we can plot that if we set

that equal to 0 for example that’s a 2d line and it looks like this so on the x

axis is X 1 on the y axis is X 2 and we’re setting that we’re just

illustrating when this line equals 0 so anywhere on this line is where X 1 and X

2 correspond to a value of 0 now if I feed in a new input either a test

example a training example or whatever and that input is with this coordinates

it’s has these coordinates minus 1 and 2 so it has the value of x1 of minus 1

value of x2 of 2 I can see visually where this lies with respect to that

line and in fact this this idea can be generalized a little bit more if we

compute that line we get minus 6 right so inside that before we apply the

non-linearity we get minus 6 when we apply a sigmoid non-linearity because

sigmoid collapses everything between 0 and 1 anything greater than 0 is going

to be above 0.5 anything below zero is going to be less than 0.5 so in is

because minus 6 is less than zero we’re going to have a very low output this

point Oh 200 to we can actually generalize this idea for

the entire feature space let’s call it for any point on this plot I can tell

you if it lies on the left side of the line that means that before we apply the

non-linearity the Z or the state of that neuron will be negative less than zero

after applying that non-linearity the sigmoid will give it a probability of

less than 0.5 and on the right side if it falls on the right side of the line

it’s the opposite story if it falls right on the line it means that Z equals

zero exactly and the probability equals 0.5 now actually before I move on this

is a great example of actually visualizing and understanding what’s

going on inside of a neural network the reason why it’s hard to do this with

deep neural networks is because you usually don’t have only two inputs and

usually don’t have only two weights as well so as you scale up your problem

this is a simple two dimensional problem but as you scale up the size of your

network you could be dealing with hundreds or thousands or millions of

parameters and million dimensional spaces and then visualizing these type

of plots becomes extremely difficult and it’s not practical and pause in practice

so this is one of the challenges that we face when we’re training with neural

networks and really understanding their internals but we’ll talk about how we

can actually tackle some of those challenges in later lectures as well

okay so now that we have that idea of a perceptron a single neuron let’s start

by building up neural networks now how we can use that perceptron to create

full neural networks and seeing how all of this story comes together let’s

revisit this previous diagram of the perceptron if there are only a few

things you remember from this class try to take away this so how a perceptron

works just keep remembering this I’m going to keep drilling it in you take

your inputs you apply a dot product with your weights and you apply a

non-linearity it’s that simple oh sorry I missed the step you have dot

product with your weights add a bias and apply your non-linearity so three steps

now let’s simplify this type of diagram a little bit I’m gonna remove the bias

just for simplicity I’m gonna remove all of the weight labels so now you can

assume that every line the weight associated to it and let’s

say so I’m going to note Z that Z is the output of that dot product so that’s the

element wise multiplication of our inputs with our weights and that’s what

gets fed into our activation function so our final output Y is just there our

activation function applied on Z if we want to define a multi output neural

network we simply can just add another one of these perceptrons to this picture

now we have two outputs one is a normal perceptron which is y1 and y2 is just

another normal perceptron the same ideas before they all connect to the previous

layer with a different set of weights and because all inputs are densely

connected to all of the outputs these type of layers are often called dense

layers and let’s take an example of how one might actually go from this nice

illustration which is very conceptual and and nice and simple to how you could

actually implement one of these dense layers from scratch by yourselves using

tensor flow so what we can do is start off by first defining our two weights so

we have our actual weight vector which is W and we also have our bias vector

right both of both of these parameters are governed by the output space so

depending on how many neurons you have in that output layer that will govern

the size of each of those weight and bias vectors what we can do then is

simply define that forward propagation of information so here I’m showing you

this to the call function in tensor flow don’t get too caught up on the details

of the code again you’ll get really a walk through of this code inside of the

labs today but I want to just show you some some high level understanding of

how you could actually take what you’re learning and apply the tensor flow

implementations to it inside the call function it’s the same idea again you

can compute Z which is the state it’s that multiplication of your inputs with

the weights you add the bias right so that’s right there

and once you have Z you just pass it through your sigmoid and that’s your

output for that now tension flow is great because it’s

already implemented a lot of these layers for us so we don’t have to do

what I just showed you from scratch in fact to implement a layer like this with

two two outputs or a percept a multi layer a multi output perceptron layer

with two outputs we can simply call this TF Harris layers dense with units equal

to two to indicate that we have two outputs on this layer and there is a

whole bunch of other parameters that you could input here such as the activation

function as well as many other things to customize how this layer behaves in

practice so now let’s take a look at a single layered neural network so this is

taking it one step beyond what we’ve just seen this is where we have now a

single hidden layer that feeds into a single output layer and I’m calling this

a hidden layer because unlike our inputs and our outputs these states of the

hidden layer are not directly enforced or they’re not directly observable we

can probe inside the network and see them but we don’t actually enforce what

they are these are learned as opposed to the inputs which are provided by us now

since we have a transformation between the inputs and the hidden layer and the

hidden layer and the output layer each of those two transformations will have

their own weight matrices which here I call W 1 and W 2 so its corresponds to

the first layer and the second layer if we look at a single unit inside of that

hidden layer take for example Z 2 I’m showing here

that’s just a single perceptron like we talked about before it’s taking a

weighted sum of all of those inputs that feed into it and it applies the

non-linearity and feeds it on to the next layer same story as before this

picture actually looks a little bit messy so what I want to do is actually

clean things up a little bit for you and I’m gonna replace all of those lines

with just this symbolic representation and we’ll just use this from now on in

the future to denote dense layers or fully connected layers between two

between an input and an output or between an input and hidden layer and again if we wanted to implement this

intensive flow the idea is pretty simple we can just define two of these dense

layers the first one our hidden layer with n outputs and the second one our

output layer with two outputs we can cut week and like join them together

aggregate them together into this wrapper which is called a TF sequential

model and sequential models are just this idea of composing neural networks

using a sequence of layers so whenever you have a sequential message passing

system or sequentially processing information throughout the network you

can use sequential models and just define your layers as a sequence and

it’s very nice to allow information to propagate through that model now if we

want to create a deep neural network the idea is basically the same thing except

you just keep stacking on more of these layers and to create more of an more of

a hierarchical model ones where the final output is computed by going deeper

and deeper into this representation and the code looks pretty similar again so

again we have this TF sequential model and inside that model we just have a

list of all of the layers that we want to use and they’re just stacked on top

of each other okay so this is awesome so hopefully now you have an understanding

of not only what a single neuron is but how you can compose neurons together and

actually build complex hierarchical models with deep with neural networks

now let’s take a look at how you can apply these neural networks into a very

real and applied setting to solve some problem and actually train them to

accomplish some task here’s a problem that I believe any AI system should be

able to solve for all of you and probably one that you care a lot about

will I pass this class to do this let’s start with a very simple two input model

one feature or one input we’re gonna define is how many let’s see how many

lectures you attend during this class and the second one is the number of

hours that you spend on your final projects I should say that the minimum

number of hours you can spend your final project is 50 hours now I’m just joking

okay so let’s take all of the data from previous years and plot it on this

feature space like we looked at before green points are students that have

passed the class in the past and red points are people that have failed we

can plot all of this data onto this two-dimensional grid like this and we

can also plot you so here you are you have attended four lectures and you’ve

only spent five hours on your final exam you’re on you’re on your final project

and the question is are you going to pass the class given everyone around you

and how they’ve done in the past how are you going to do so let’s do it we have

two inputs we have a single layered set single hidden layer neural network we

have three hidden units in that hidden layer and we’ll see that the final

output probability when we feed in those two inputs of four and five is predicted

to be 0.1 or 10% the probability of you passing this class is 10% that’s not

great news the actual prediction was one so you did pass the class now does

anyone have an idea of why the network was so wrong in this case exactly so we

never told this network anything the weights are wrong we’ve just initialized

the weights in fact it has no idea what it means to pass a class it has no idea

of what each of these inputs mean how many lectures you’ve attended and the

hours you’ve spent on your final project it’s just seeing some random numbers it

has no concept of how other people in the class have done so far so what we

have to do to this network first is train it and we have to teach it how to

perform this task until we teach it it’s just like a baby that doesn’t know

anything so it just entered the world it has no concepts or no idea of how to

solve this task and we have to teach at that now how do we do that the idea here

is that first we have to tell the network when it’s wrong so we have to

quantify what’s called its loss or its error and to do that we actually just

take our prediction or what the network predicts and we compare it to what the

true answer was if there’s a big discrepancy between the

prediction and the true answer we can tell the network hey you made a big

mistake right so this is a big error it’s a big loss and you should try and

fix your answer to move closer towards the true answer which it should be okay

now you can imagine if you don’t have just one student but now you have many

students the total loss let’s call it here the empirical risk or the objective

function it has many different names it’s just the the average of all of

those individual losses so the individual loss is a loss that takes as

input your prediction and your actual that’s telling you how wrong that single

example is and then the final the total loss is just the average of all of those

individual student losses so if we look at the problem of binary classification

which is the case that we’re actually caring about in this example so we’re

asking a question will I pass the class yes or no binary classification we can

use what is called as the softmax cross-entropy loss and for those of you

who aren’t familiar with cross-entropy this was actually a a formulation

introduced by Claude Shannon here at MIT during his master’s thesis as well and

this was about 50 years ago it’s still being used very prevalently today and

the idea is it just again compares how different these two distributions are so

you have a distribution of how how likely you think the student is going to

pass and you have the true distribution of if the student passed or not you can

compare the difference between those two distributions and that tells you the

loss that the network incurs on that example now let’s assume that instead of

a classification problem we have a regression problem where instead of

predicting if you’re going to pass or fail to class you want to predict the

final grade that you’re going to get so now it’s not a yes/no answer problem

anymore but instead it’s a what’s the grade I’m

going to get what’s the number what so it’s it’s a full range of numbers that

are possible now and now we might want to use a different

type of loss for this different type of problem and in this case we can do

what’s called a mean squared error loss so we take the actual prediction we take

the the sorry excuse me we take the prediction of the network we take the

actual true final grade that the student got we subtract them we take their

squared error and we say that that’s the mean squared error that’s the loss that

the network should should try to optimize and try to minimize so ok so

now that we have all this information with the loss function and how to

actually quantify the error of the neural network let’s take this and

understand how to train train our model to actually find those weights that it

needs to to use for its prediction so W is what we want to find out W is the set

of weights and we want to find the optimal set of weights that tries to

minimize this total loss over our entire test set so our test set is this example

data set that we want to evaluate our model on so in the class example the

test set is you so you want to understand how likely you are to pass

this class you’re the test set now what this means is that we want to find the

W’s that minimize that total loss function which we call as the objective

function J of W now remember that W is just a aggregation or a collection of

all of the individual w’s from all of your weights so here this is just a way

for me to express this in a clean notation but W is a whole set of numbers

it’s not just a single number and you want to find this all of the W’s you

want to find the value of each of those weights such that you can minimize this

entire loss function it’s a very complicated problem and remember that

our loss function is just a simple function in terms of those weights so if

we plot in the case again of a two-dimensional weight problem so one of

the weights is on the x-axis one of the weights is on this axis and on the z

axis we have the loss so for any value of w we can see what the loss

would be at that point now what do we want to do we want to find the place on

this landscape what are the values of W that we get the minimum loss okay so

what we can do is we can just pick a random W pick a random place on this

this landscape to start with and from this random place let’s try to

understand how the landscape is changing what’s the slope of the landscape we can

take the gradient of the loss with respect to each of these weights to

understand the direction of maximum ascent okay that’s what the gradient

tells us now that we know which way is up we can take a step in the direction

that’s down so we know which way is up we reverse the sign so now we start

heading downhill and we can move towards that lowest point now we just keep

repeating this process over and over again until we’ve converged to a local

minimum now we can summarize this algorithm which is known as gradient

descent because you’re taking a gradient and you’re descending down down that

landscape by starting to initialize our rates wait randomly we compute the

gradient DJ with respect to all of our weights then we update our weights in

the opposite direction of that gradient and take a small step which we call here

ADA of that gradient and this is referred to as the learning rate and

we’ll talk a little bit more about that later but ADA is just a scalar number

that determines how much of a step you want to take at each iteration how

strongly or aggressively do you want to step towards that gradient in code the

picture looks very similar so to implement gradient descent is just a few

lines of code just like the pseudocode you can initialize your weights randomly

in the first line you can compute your loss with respect to those gradients and

with respect to those predictions and your data given that gradient you just

update your weights in the opposite direction of that event of that vector

right now the magic line here is actually how

do you compute that gradient and that’s something I haven’t told you and that’s

something it’s not easy at all so the question is given a loss and given all

of our weights in our network how do we know which way is good which way is a

good place to move given all of this information and I never told you about

that but that’s a process called back propagation and let’s talk about a very

simple example of how we can actually derive back propagation using elementary

calculus so we’ll start with a very simple network with only one hidden

neuron and one output this is probably the simplest neural network that you can

create you can’t really get smaller than this computing the gradient of our loss

with respect to W to here which is that second way between the hidden state and

our output can tell us how much a small change in W 2 will impact our loss so

that’s what the gradient tells us right if we change W 2 in the differential

different like a very minor manner how does our loss change does it go up or

down how does it change and by how much really so that’s the gradient that we

care about the gradient of our loss with respect to W 2 now to evaluate this we

can just apply the chain rule in calculus so we can split this up into

the gradient of our loss with respect to our output Y multiplied by the gradient

of our walk or output Y with respect to W 2 now if we want to repeat this

process for a different way in the neural network let’s say now W 1 not W 2

now we replace W 1 on both sides we also apply the chain rule but now you’re

going to notice that the gradient of Y with respect to W 1 is also not directly

computable we have to apply the chain rule again to evaluate this so let’s

apply the chain rule again we can break that second term up into with respect to

now the the state Z ok and using that we can kind of back propagate all of these

gradients from the output all the way back to the input that allows our error

signal to really propagate from output to input and

allows these gradients to be computed in practice now a lot of this is not really

important or excuse me it’s not as crucial that you understand the

nitty-gritty math here because in a lot of popular deep learning frameworks we

have what’s called automatic differentiation which does all of this

back propagation for you under the hood and you never even see it which is

incredible it made training neural networks so much easier you don’t have

to implement back propagation anymore but it’s still important to understand

how these work at the foundation which is why we’re going through it now ok

obviously then you repeat this for every single way in the network here we showed

it for just W 1 and W 2 which is every single way in this network but if you

have more you can just repeat it again keep applying the chain rule from output

to input to compute this ok and that’s the back prop algorithm in theory very

simple it’s just an application of the chain rule in essence but now let’s

touch on some of the insights from training and how you can use the back

prop algorithm to train these networks in practice optimization of neural

networks is incredibly tough in practice so it’s not as simple as the picture I

showed you on the colorful one on the previous slide here’s an illustration

from a paper that came out about two or three years ago now where the authors

tried to visualize the landscape of a of a neural network with millions of

parameters but they collapsed that down onto just two-dimensional space so that

we can visualize it and you can see that the landscape is incredibly complex

it’s not easy there are many local minima where the gradient descent

algorithm could get stuck into and applying gradient descent in practice in

these type of environments which is very standard in neural networks can be a

huge challenge now we’re called the update equation

that we defined previously with gradient descent this is that same equation we’re

going to update our weights in the direction in the opposite direction of

our gradient I didn’t talk too much about this parameter ADA I pointed it

out this is the learning rate it determines

how much of a step we should take in the direction of that gradient and in

practice setting this learning rate can have a huge impact in performance so if

you set that learning rate to small that means that you’re not really trusting

your gradient on each step so if ADA is super tiny

that means on each time each step you’re only going to move a little bit towards

in the opposite direction of your gradient just in little small increments

and what can happen then is you can get stuck in these local minima because

you’re not being as aggressive as you should be to escape them now if you set

the learning rate to large you can actually overshoot completely and

diverge which is even more undesirable so setting the learning rate can be very

challenging in practice you want to pick a learning rate that’s large enough such

that you avoid the local minima but small offs such that you still converge

in practice now the question that you’re all probably asking is how do we set the

learning rate then well one option is that you can just try a bunch of

learning rates and see what works best another option is to do something a

little bit more clever and see if we can try to have an adaptive learning rate

that changes with respect to our lost landscape maybe it changes with respect

to how fast the learning is happening or a range of other ideas within the

network optimization scheme itself this means that the learning rate is no

longer fixed but it can now increase or decrease throughout training so as

training progressive your learning rate may speed up you may take more

aggressive steps you may take smaller steps as you get closer to the local

minima so that you really converge on that point and there are many options

here of how you might want to design this adaptive algorithm and this has

been a huge or a widely studied field in optimization theory for machine learning

and deep learning and there have been many published papers and

implementations within tensor flow on these different types of adaptive

learning rate algorithms so SGD is just that vanilla gradient descent that I

showed you before that’s the first one all of the others are all

adaptive learning rates which means that they change their learning rate during

training itself so they can increase or decrease depending on how the

optimization is going and during your labs we really encourage you again to

try out some of these different optimization schemes see what works what

doesn’t work a lot of it is problem dependent there are some heuristics that

you can you can get but we want you to really gain those heuristics yourselves

through the course of the labs it’s part of building character okay so let’s put

this all together from the beginning we can define our model which is defined as

this sequential wrapper inside of this sequential wrapper we have all of our

layers all of these layers are composed of perceptrons or single neurons which

we saw earlier the second line defines our optimizer which we saw in the

previous slide this can be SGD it can also be any of

those adaptive learning rates that we saw before now what we want to do is

during our training loop it’s very it’s the same stories again as before

nothing’s changing here we forward pass all of our inputs through that model we

get our predictions using those predictions we can evaluate them and

compute our loss our loss tells us how wrong our network was on that iteration

it also tells us how we can compute the gradients and how we can change all of

the weights in the network to improve in the future and then the final line there

takes those gradients and actually allows our optimizer to update the

weights and the trainable variables such that on the next iteration they do a

little bit better and over time if you keep looping this will converge and

hopefully you should fit your data no now I want to continue to talk about

some tips for training these networks in practice and focus on a very powerful

idea of batching your data into mini batches so to do this let’s revisit the

gradient descent algorithm this gradient is actually very computationally

expensive to compute in practice so using the backprop algorithm is

a very expensive idea and practice so what we want to do is actually not

compute this over all of the data points but actually computed over just a single

data point in the data set and most real-life applications it’s not actually

feasible to compute on your entire data set at every iteration it’s just too

much data so instead we pick a single point randomly we compute our gradient

with respect to that point and then on the next iteration we pick a different

point and we can get a rough estimate of our gradient at each step right so

instead of using all of our data now we just pick a single point I we compute

our gradient with respect to that single point I and what’s a middle ground here

so the downside of using a single point is that it’s going to be very noisy the

downside of using all of the points is that it’s too computationally expensive

if there’s some middle ground that we can have in between so that middle

ground is actually just very simple you instead of taking one point and instead

taking all of the points let take a mini batch of points so maybe something on

the order of 10 20 30 100 maybe depending on how rough or accurate you

want that approximation of your gradient to be and how much you want to trade off

speed and computational efficiency now the true gradient is just obtained by

averaging the gradient from each of those B points so B is the size of your

batch in this case now since B is normally not that large like I said

maybe on the order of tens to a hundreds this is much faster to compute than full

gradient descent and much more accurate than stochastic gradient descent because

it’s using more than one point more than one estimate now this increase in

gradient accuracy estimation actually allows us to converge to our target much

quicker because it means that our gradients are more accurate in practice

it also means that we can increase our learning rate and trust each update more

so if we’re very noisy in our gradient estimation we probably want to lower our

learning rate a little more so we don’t fully step in the wrong direction if

we’re not totally confident with that gradient if we have a larger batch of

gradient of data to they are gradients with we can trust

that learning great a little more increase it so that it steps it more

aggressively in that direction what this means also is that we can now massively

paralyze this computation because we can split up batches on multiple GPUs or

multiple computers even to achieve even more significant speed ups with this

training process now the last topic I want to address is that of overfitting

and this is also known as the problem of generalization in machine learning and

it’s actually not unique to just deep learning but it’s a fundamental problem

of all of machine learning now ideally in machine learning we want a model that

will approximate or estimate our data or accurately describes our data let’s say

like that said differently we want to build models that can learn

representations from our training data that’s still generalize to unseen test

data now assume that you want to build a line that best describes these points

you can see on the on the screen under fitting describes if we if our model

does not describe the state of complexity of this problem or if we

can’t really capture the true complexity of this problem while overfitting on the

right starts to memorize certain aspects of our training data and this is also

not desirable we want the middle ground which ideally we end up with a model in

the middle that is not too complex to memorize all of our training data but

also one that will continue to generalize when it sees new data so to

address this problem of regularization in neural network specifically let’s

talk about a technique of regularization which is another way that we can deal

with this and what this is doing is it’s trying to discourage complex information

from being learned so we want to eliminate the model from actually

learning to memorize the training data we don’t want to learn like very

specific pinpoints of the training data that don’t generalize well to test data

now as we’ve seen before this is actually crucial for our models to be

able to generalize to our test data so this is very important the most popular

regularization technique deep learning is this very basic idea of

drop out now the idea of drop out is well actually let’s start with by

revisiting this picture of a neural network that we had introduced

previously and drop out during training we randomly set some of these

activations of the hidden neurons to zero with some probability so I’d say

our probability is 0.5 we’re randomly going to set the

activations to 0.5 with probability of 0.5 to some of our

hidden neurons to 0 the idea is extremely powerful because it allows the

network to lower its capacity it also makes it such that the network can’t

build these memorization channels through the network where it tries to

just remember the data because on every iteration 50% of that data is going to

be or 50% of that memorization or memory is going to be wiped out so it’s going

to be forced to to not only generalize better but it’s going to be forced to

have multiple channels through the network and build a more robust

representation of its prediction now we just repeat this on every iteration so

on the first iteration we dropped out one 50% of the nodes on the next

iteration we can drop out a different randomly sampled 50% which may include

some of the previously sampled nodes as well and this will allow the network to

generalize better to new test data the second regularization technique that

we’ll talk about is the notion of early stopping so what I want to do here is

just talk about two lines so during training which is the x-axis here we

have two lines the y-axis is our loss curve the first line is our training

loss so that’s the green line the green line tells us how our training data how

well our model is fitting to our training data we expect this to be lower

than the second line which is our testing data

so usually we expect to be doing better on our training data than our testing

data as we train and as this line moves forward into the future both of these

lines should kind of decrease go down because we’re optimizing the network

we’re improving its performance eventually though there becomes a point

where the training data starts to diverge from the testing data now what

happens is that the training day should always continue to fit or the

model should always continue to fit the training data because it’s still seeing

all of the training data it’s not being penalized from that except for maybe if

you drop out or other means but the testing data it’s not seeing so at some

point the network is going to start to do better on its training data than its

testing data and what this means is basically that the network is starting

to memorize some of the training data and that’s what you don’t want so what

we can do is well we can perform early stopping or we can identify this point

this inflection point where the test data starts to increase and diverge from

the training data so we can stop the network early and make sure that our

test accuracy is as minimum as possible and of course if we actually look at on

the side of this line if we look at on the left side that’s where a model is

under fit so we haven’t reached the true capacity of our model yet so we’d want

to keep training if we didn’t stop yet if we did stop already and on the right

side is where we’ve over fit where we’ve passed that early stopping point and we

need to like basically we’ve started to memorize some of our training did and

that’s when we’ve gone too far I’ll conclude this lecture by just

summarizing three main points that we’ve covered so far first we’ve learned about

the fundamentals of neural networks which is a single neuron or a perceptron

we’ve learned about stacking and composing these perceptrons together to

form complex hierarchical representations and how we can

mathematically optimize these networks using a technique called back

propagation using their loss and finally we address the practical side of

training these models such as mini batching regularization and adaptive

learning rates as well with that I’ll finish up I can take a couple questions

and then we’ll move on to office lecture on deep sequential modeling I’ll take

any like maybe a couple questions if there are any now thank you

## Leave a Reply