Lecture 02 – Is Learning Feasible?

ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced
the learning problem. And if you have an application in your
domain that you wonder if machine learning is the right technique for
it, we found that there are three criteria that you should check. You should ask yourself: is
there a pattern to begin with that we can learn? And we realize that this condition can be
intuitively met in many applications, even if we don’t know mathematically
what the pattern is. The example we gave was the
credit card approval. There is clearly a pattern– if someone
has a particular salary, has been in a residence for so long, has
that much debt, and so on, that this is somewhat correlated to
their credit behavior. And therefore, we know that the pattern
exists in spite of the fact that we don’t know exactly
what the pattern is. The second item is that we cannot pin
down the pattern mathematically, like the example I just gave. And this is why we resort
to machine learning. The third one is that we have data
that represents that pattern. In the case of the credit application,
for example, there are historical records of previous customers, and we
have the data they wrote in their application when they applied, and we
have some years’ worth of record of their credit behavior. So we have data that are going to enable
us to correlate what they wrote in the application to their eventual credit
behavior, and that is what we are going to learn from. Now, if you look at the three criteria,
basically there are two that you can do without, and one that
is absolutely essential. What do I mean? Let’s say that you don’t
have a pattern. Well, if you don’t have a pattern,
then you can try learning. And the only problem is
that you will fail. That doesn’t sound very encouraging. But the idea here is that, when we develop
the theory of learning, we will realize that you can apply the technique
regardless of whether there is a pattern or not. And you are going to determine whether
there’s a pattern or not. So you are not going to be fooled and
think, I learned, and then give the system to your customer, and the
customer will be disappointed. There is something you can actually
measure that will tell you whether you learned or not. So if there’s no pattern, there is no
harm done in trying machine learning. The other one, also,
you can do without. Let’s say that we can pin the
thing down mathematically. Well, in that case, machine learning
is not the recommended technique. It will still work. It may not be the optimal technique. If you can outright program it, and
find the result perfectly, then why bother generate examples, and try to
learn, and go through all of that? But machine learning is
not going to refuse. It is going to learn, and it is
going to give you a system. It may not be the best system in this
case, but it’s a system nonetheless. The third one, I’m afraid
you cannot do without. You have to have data. Machine learning is about
learning from data. And if you don’t have data, there is
absolutely nothing you can do. So this is basically the picture about
the context of machine learning. Now, we went on to focus on one type,
which is supervised learning. And in the case of supervised learning,
we have a target function. The target function we
are going to call f. That is our standard notation. And this corresponds, for example,
to the credit application. x is your application, and f of x is
whether you are a good credit risk or not, for the bank. So if you look at the target function,
the main criterion about the target function is that it’s unknown. This is a property that we
are going to insist on. And obviously, unknown is a very generous
assumption, which means that you don’t have to worry about what
pattern you are trying to learn. It could be anything, and you will learn
it– if we manage to do that. There’s still a question
mark about that. But it’s a good assumption to have, or
lack of assumption, if you will, because then you know that you don’t
worry about the environment that generated the examples. You only worry about the system that you
use to implement machine learning. Now, you are going to be given data. And the reason it’s called supervised
learning is that you are not only given the input x’s, as
you can see here. You’re also given the output– the target outputs. So in spite of the fact that the target
function is generally unknown, it is known on the data
that I give you. This is the data that you are going to
use as training examples, and that you are going to use to figure out
what the target function is. So in the case of supervised learning,
you have the targets explicitly. In the other cases, you have less
information than the target, and we talked about it– like unsupervised
learning, where you don’t have anything, and reinforcement learning,
where you have partial information, which is just a reward or punishment
for a choice of a value of y that may or may not be the target. Finally, you have the solution tools. These are the things that you’re going
to choose in order to solve the problem, and they are called the learning
model, as we discussed. They are the learning algorithm
and the hypothesis set. And the learning algorithm will
produce a hypothesis– the final hypothesis, the one that you
are going to give your customer, and we give the symbol g for that. And hopefully g approximates f,
the actual target function, which remains unknown. And g is picked from a hypothesis set,
and the general the symbol for a member of the hypothesis
set is h. So h is a generic hypothesis. The one you happen to pick,
you are going to call g. Now, we looked at an example
of a learning algorithm. First, the learning model– the
perceptron itself, which is a linear function, thresholded. That happens to be the hypothesis set. And then, there is an algorithm that
goes with it that chooses which hypothesis to report
based on the data. And the hypothesis in this case is
represented by the purple line. Different hypotheses in the
set H will result in different lines. Some of them are good and some of them
are bad, in terms of separating correctly the examples which
are the pluses and minuses. And we found that there’s a very simple
rule to adjust the current hypothesis, while the algorithm is still
running, in order to get a better hypothesis. And once you have all the points
classified correctly, which is guaranteed in the case of the perceptron
learning algorithm if the data was linearly separable
in the first place, then you will get there, and that will
be the g that you are going to report. Now, we ended the lecture on sort of
a sad note, because after all of this encouragement about learning,
we asked ourselves: well, can we actually learn? So we said
it’s an unknown function. Unknown function is an attractive
assumption, as I said. But can we learn an unknown
function, really? And then we realized that if you look at
it, it’s really impossible. Why is it impossible? Because I’m going to give you a finite
data set, and I’m going to give you the value of the function on this set. Good. Now, I’m going to ask you what is
the function outside that set? How in the world are you going to tell
what the function is outside, if the function is genuinely unknown? Couldn’t it assume any value it wants? Yes, it can. I can give you 1000 points, a million
points, and on the million-and-first point, still the function can behave
any way it wants. So it doesn’t look like the statement
we made is feasible in terms of learning, and therefore we have
to do something about it. And what we are going to do about it
is the subject of this lecture. Now, the lecture is called
Is Learning Feasible? And I am going to address this question
in extreme detail from beginning to end. This is the only topic
for this lecture. Now, if you want an outline– it’s really a logical flow. But if you want to cluster
it into points– we are going to start with
a probabilistic situation, that is a very simple probabilistic situation. It doesn’t seem to relate to learning. But it will capture the idea– can we say something outside the
sample data that we have? So we’re going to answer it in a way
that is concrete, and where the mathematics is very friendly. And then after that, I’m going to be
able to relate that probabilistic situation to learning as we stated. It will take two stages. First, I will just translate the
expressions into something that relates to learning, and then we will
move forward and make it correspond to real learning. That’s the last one. And then after we do that, and we think
we are done, we find that there is a serious dilemma that we have. And we will find a solution to that
dilemma, and then declare victory– that indeed, learning is feasible
in a very particular sense. So let’s start with the experiment
that I talked about. Consider the following situation. You have a bin, and the
bin has marbles. The marbles are either red or green. That’s what it looks like. And we are going to do an experiment
with this bin. And the experiment is to pick
a sample from the bin– some marbles. Let’s formalize what the probability
distribution is. There is a probability of picking
a red marble, and let’s call it mu. So now you think of mu as the
probability of a red marble. Now, the bin is really just a visual
aid to make us relate to the experiment. You can think of this abstractly
as a binary experiment– two outcomes, red or green. Probability of red is mu,
independently from one point to another. If you want to stick to the bin, you can
say the bin has an infinite number of marbles and the fraction
of red marbles is mu. Or maybe it has a finite number of
marbles, and you are going to pick the marbles, but replace them. But the idea now is that every time you
reach in the bin, the probability of picking a red marble is mu. That’s the rule. Now, there’s a probability of
picking a green marble. And what might that be? That must be 1 minus mu. So that’s the setup. Now, the value of mu is unknown to us. So in spite of the fact that you can
look at this particular bin and see there’s less red than green,
so mu must be small. and all of that. You don’t have that advantage in real. The bin is opaque– it’s sitting there,
and I reach for it like this. So now that I declare mu is unknown,
you probably see where this is going. Unknown is a famous word from last lecture,
and that will be the link to what we have. Now, we pick N marbles independently. Capital N. And I’m using the same
notation for N, which is the number of data points in
learning, deliberately. So the sample will look like this. And it will have some
red and some green. It’s a probabilistic situation. And we are going to call the fraction
of marbles in the sample– this now is a probabilistic
quantity. mu is an unknown constant
sitting there. If you pick a sample, someone else picks
a sample, you will have a different frequency in sample from
the other person. And we are going to call it nu. Now, interestingly enough, nu also
should appear in the figure. So it says nu equals fraction
of red marbles. So that’s where it lies. Here is nu! For some reason that I don’t understand,
the app wouldn’t show nu in the figures. So I decided maybe the app is actually
a machine learning expert. It doesn’t like things in sample. It only likes things that are real. So it knows that nu is not important. It’s not an indication. We are really interested in
knowing what’s outside. So it kept the mu, but actually
deleted the nu. At least that’s what we are going to
believe for the rest of the lecture. Now, this is the bin. So now, the next step is to ask ourselves
the question we asked in machine learning. Does nu, which is the sample frequency,
tell us anything about mu, which is the actual frequency in the bin
that we are interested in knowing? The short answer– this is to remind you what it is. The short answer is no. Why? Because the sample can be mostly green,
while the bin is mostly red. Anybody doubts that? The thing could have 90% red,
and I pick 100 marbles, and all of them happen to be green. This is possible, correct? So if I ask you what is actually mu, you
really don’t know from the sample. You don’t know anything about the
marbles you did not pick. Well, that’s the short answer. The long answer is yes. Not because no and yes, but
this is more elaborate. We have to really discuss a lot
in order to get there. So why is it yes? Because if you know a little bit about
probability, you realize that if the sample is big enough, the sample
frequency, which is nu– the mysterious disappearing quantity here– that
is likely to be close to mu. Think of a presidential poll. There are maybe 100 million or more
voters in the US, and you make a poll of 3000 people. You have 3000 marbles, so to speak. And you look at the result in the
marbles, and you tell me how the 100 million will vote. How the heck did you know that? So now the statistics come in. That’s where the probability
plays a role. And the main distinction between
the two answers is possible versus probable. In science and in engineering, you go
a huge distance by settling for not absolutely certain, but
almost certain. It opens a world of possibilities,
and this is one of the possibilities that it opens. So now we know that, from
a probabilistic point of view, nu does tell me something about mu. The sample frequency tells me
something about the bin. So what does it exactly say? Now we go into a mathematical
formulation. In words, it says: in a big sample,
nu, the sample frequency, should be close to mu,
the bin frequency. So now, the symbols that go with
that– what is a big sample? Large N, our parameter N. And how do we say that
nu is close to mu? We say that they are within epsilon. That is our criterion. Now, with this in mind, we are
going to formalize this. The formula that I’m going to show
you is a formula that is going to stay with us for the
rest of the course. I would like you to pay attention. And I’m going to build it gradually. We are going to say that the probability
of something is small. So we’re going to say that it’s less
than or equal to, and hopefully the right-hand side will be
a small quantity. Now if I am claiming that the
probability of something is small, it must be that that thing is a bad event. I don’t want it to happen. So we have a probability of something
bad happening being small. What is a bad event in the context
we are talking about? It is that nu does not
approximate mu well. They are not within epsilon
of each other. And if you look at it, here you have
nu minus mu in absolute value, so that’s the difference
in absolute value. That happens to be bigger
than epsilon. So that’s bad, because that tells us
that they are further away from our tolerance epsilon. We don’t want that to happen. And we would like the probability
of that happening to be as small as possible. Well, how small can we guarantee it? Good news. It’s e to the minus N. It’s a negative exponential. That is great, because negative
exponentials tend to die very fast. So if you get a bigger sample, this
will be diminishingly small probability. So the probability of something bad
happening will be very small, and we can claims that, indeed, nu will be
within epsilon from mu, and we will be wrong for a very minute
amount of the time. But that’s the good news. Now the bad news– ouch! Epsilon is our tolerance. If you’re a very tolerant
person, you say: I just want nu and mu to be
within, let’s say, 0.1. That’s not very much to ask. Now, the price you pay for that is
that you plug in the exponent not epsilon, but epsilon squared. So that becomes 0.01. 0.01 will dampen N significantly, and
you lose a lot of the benefit of the negative exponential. And if you are more stringent and
you say, I really want nu to be close to mu. I am not fooling around here. So I am going to pick epsilon
to be 10 to the minus 6. Good for you. 10 to the minus 6? Pay the price for it. You go here, and now that’s
10 to the minus 12. That will completely kill any
N you will ever encounter. So the exponent now will
be around zero. So this probability will be around
1, if that was the final answer. That’s not yet the final answer. So now, you know that the probability
is less than or equal to 1. Congratulations! You knew that already.
[LAUGHTER] Well, this is almost the formula,
but it’s not quite. What we need is fairly trivial. We just put 2 here, and 2 there. Now, between you and me, I prefer
the original formula better, without the 2’s. However, the formula with the 2’s has the
distinct advantage of being: true. [LAUGHTER] So we have to settle for that. Now that inequality is called
Hoeffding’s Inequality. It is the main inequality we are going
to be using in the course. You can look for the proof. It’s a basic proof in mathematics. It’s not that difficult, but
definitely not trivial. And we are going to use it all the way–
and this is the same formula that will get us to prove something
about the VC dimension. If the buzzword ‘VC dimension’ means
anything to you, it will come from this after a lot of derivation. So this is the building block that
you have to really know cold. Now, if you want to translate the
Hoeffding Inequality into words, what we have been talking about is that
we would like to make the statement: mu equals nu. That would be the ultimate. I look at the in-sample frequency, that’s
the out-of-sample frequency. That’s the real frequency out there. But that’s not the case. We actually are making the statement
mu equals nu, but we’re not making the statement– we are making a PAC statement. And that stands for: this statement is
probably, approximately, correct. Probably because of this. This is small, so the probability
of violation is small. Approximately because of this. We are not saying that mu equals nu. We are saying that they are
close to each other. And that theme will remain
with us in learning. So we put the glorified Hoeffding’s
Inequality at the top, and we spend a viewgraph analyzing what it means. In case you forgot what nu and
mu are, I put the figure. So mu is the frequency within the bin. This is the unknown quantity
that we want to tell. And nu is the disappearing quantity
which happens to be the frequency in the sample you have. So what about the Hoeffding
Inequality? Well, one attraction of this
inequality is that it is valid for every N, positive integer, and every
epsilon which is greater than zero. Pick any tolerance you want, and
for any number of examples you want, this is true. It’s not an asymptotic result. It’s a result that holds for
every N and epsilon. That’s a very attractive proposition
for something that has an exponential in it. Now, Hoeffding Inequality belongs to
a large class of mathematical laws, which are called the Laws
of Large Numbers. So this is one law of large numbers,
one form of it, and there are tons of them. This happens to be one of the
friendliest, because it’s not asymptotic, and happens to have
an exponential in it. Now, one observation here is that if you
look at the left-hand side, we are computing this probability. This probability patently
depends on mu. mu appears explicitly in it, and
also mu affects the probability distribution of nu. Nu is the sample, in N
marbles you picked. That’s a very simple binomial
distribution. You can find the probability that
nu equals anything based on the value of mu. So the probability that this quantity,
which depends on mu, exceeds epsilon– the probability itself
does depend on mu. However, we are not interested
in the exact probability. We just want to bound it. And in this case, we are
bounding it uniformly. As you see, the right-hand side
does not have mu in it. And that gives us a great tool, because
now we don’t use the quantity that, we already declared, is unknown. mu is unknown. It would be a vicious cycle if I go
and say that it depends on mu, but I don’t know what mu is. Now you know uniformly, regardless of
the value of mu– mu could be anything between 0 and 1, and this will still
be bounding the deviation of the sample frequency from
the real frequency. That’s a good advantage. Now, the other point is that there is
a trade-off that you can read off the inequality. What is the trade-off? The trade-off is between
N and epsilon. In a typical situation, if we think of N
as the number of examples that are given to you– the amount of data– in
this case, the number of marbles out of the bin, N is usually dictated. Someone comes and gives you a certain
resource of examples. Epsilon is your taste in tolerance. You are very tolerant. You
pick epsilon equals 0.5. That will be very easy to satisfy. And if you are very stringent, you can
pick epsilon smaller and smaller. Now, because they get multiplied here,
the smaller the epsilon is, the bigger than N you need in order to compensate
for it and come up with the same level of probability bound. And that makes a lot of sense. If you have more examples, you are more
sure that nu and mu will be close together, even closer and
closer and closer, as you get larger N. So this makes sense. Finally, it’s a subtle point, but
it’s worth saying. We are making the statement that nu
is approximately the same as mu. And this implies that mu is
approximately the same as nu. What is this? The logic here is a little bit subtle. Obviously, the statement is a tautology,
but I’m just making a logical point, here. When you run the experiment,
you don’t know what mu is. mu is an unknown. It’s a constant. The only random fellow in this
entire operation is nu. That is what the probability
is with respect to. You generate different samples, and
you compute the probability. This is the probabilistic thing. This is a happy constant sitting
there, albeit unknown. Now, the way you are using the
inequality is to infer mu, the sample here, from nu. That is not the cause and effect
that actually takes place. The cause and effect is that mu affects
nu, not the other way around. But we are using it the
other way around. Lucky for us, the form of the
probability is symmetric. Therefore, instead of saying that nu
tends to be close to mu, which will be the accurate logical statement– mu
is there, and nu has a tendency to be close to it. We, instead of that, say that I know
already nu, and now mu tends to be close to nu. That’s the logic we are using. Now, I think we understand what the bin
situation is, and we know what the mathematical condition that
corresponds to it is. What I’d like to do, I’d like to connect that to the
learning problem we have. In the case of a bin, the unknown
quantity that we want to decipher is a number, mu. Just unknown. What is the frequency inside the bin. In the learning situation that we had,
the unknown quantity we would like to decipher is a full-fledged function. It has a domain, X, that could be
a 10th-order Euclidean space. Y could be anything. It could be binary, like
the perceptron. It could be something else. That’s a huge amount of information. The bin has only one number. This one, if you want to specify it,
that’s a lot of specification. So how am I going to be able to relate
the learning problem to something that simplistic? The way we are going to do it
is the following. Think of the bin as your input space
in the learning problem. That’s the correspondence. So every marble here is a point x. That is a credit card applicant. So if you look closely at the gray
thing, you will read: salary, years in residence, and whatnot. You can’t see it here because
it’s too small! Now the bin has all the points
in the space. Therefore, this is really the space. That’s the correspondence in our mind. Now we would like to give
colors to the marbles. So here are the colors. There are green marbles, and they
correspond to something in the learning problem. What do they correspond to? They correspond to your hypothesis
getting it right. So what does that mean? There is a target function
sitting there, right? You have a hypothesis. The hypothesis is a full function,
like the target function is. You can compare the hypothesis to the
target function on every point. And they either agree or disagree. If they agree, please color
the corresponding point in the input space– Color it green. Now, I’m not saying that you know which
ones are green and which ones are not, because you don’t know
the target function overall. I’m just telling you the mapping that
takes an unknown target function into an unknown mu. So both of them are unknown,
admittedly, but that’s the correspondence that maps it. And now you go, and there
are some red ones. And, you guessed it. You color the thing red if your
hypothesis got the answer wrong. So now I am collapsing the entire
thing into just agreement and disagreement between your hypothesis
and the target function, and that’s how you get to color the bin. Because of that, you have a mapping for
every point, whether it’s green or red, according to this rule. Now, this will add a component to
the learning problem that we did not have before. There is a probability associated
with the bin. There is a probability of
picking a marble, and independently, and all of that. When we talked about the learning
problem, there was no probability. I will just give you a sample set,
and that’s what you work with. So let’s see what is the addition we
need to do in order to adjust the statement of the learning problem to
accommodate the new ingredient. And the new ingredient is important,
because otherwise we cannot learn. It’s not like we have the luxury
of doing without it. So we go back to the learning
diagram from last time. Do you remember this one? Let me remind you. Here is your target function,
and it’s unknown. And I promised you last time that it
will remain unknown, and the promise will be fulfilled. We are not going to touch this box. We’re just going to add another box
to accommodate the probability. And the target function generates
the training examples. These are the only things that
the learning algorithm sees. It picks a hypothesis from the
hypothesis set, and produces it as the final hypothesis, which hopefully
approximates f. That’s the game. So what is the addition
we are going to do? In the bin analogy, this
is the input space. Now the input space
has a probability. So I need to apply this probability to
the points from the input space that are being generated. I am going to introduce a probability
distribution over the input space. Now the points in the input space–
let’s say the d-dimensional Euclidean space– are not just generic points now. There is a probability of picking
one point versus the other. And that is captured by the probability,
which I’m going to call capital P. Now the interesting thing is that I’m
making no assumptions about P. P can be anything. I just want a probability. So invoke any probability you want, and
I am ready with the machinery. I am not going to restrict the
probability distributions over X. That’s number one. So this is not as bad as it looks. Number two, I don’t even
need to know what P is. Of course, the probability choice will
affect the choice of the probability of getting a green marble or a red
marble, because now the probability of different marbles changed, so it
could change the value mu. But the good news with the Hoeffding is
that I could bound the performance independently of mu. So I can get away with not only any P,
but with a P that I don’t know, and I’ll still be able to make the
mathematical statement. So this is a very benign addition
to the problem. And it will give us very high
dividends, which is the feasibility of learning. So what do you do with
the probability? You use the probability to generate the
points x_1 up to x_N. So now x_1 up to x_N are assumed to be
generated by that probability independently. That’s the only assumption
that is made. If you make that assumption,
we are in business. But the good news is, as
I mentioned before, we did not compromise about
the target function. You don’t need to make assumptions about
the function you don’t know and you want to learn, which is good news. And the addition is almost technical. That there is a probability somewhere,
generating the points. If I know that, then I can make
a statement in probability. Obviously, you can make that statement
only to the extent that the assumption is valid, and we can discuss that
in later lectures when the assumption is not valid. So, OK. Happy ending. We are done, and we now have
the correspondence. Are we done? Well, not quite. Why are we not done? Because the analogy I gave
you requires a particular hypothesis in mind. I told you that the red and green marbles
correspond to the agreement between h and the target function. So when you tell me what h is,
you dictate the colors here. All of these colors. This is green not because it’s
inherently green, not because of anything inherent about
the target function. It’s because of the agreement between
the target function and your hypothesis, h. That’s fine, but what is the problem? The problem is that I know that
for this h, nu generalizes to mu. You’re probably saying, yeah,
but h could be anything. I don’t see the problem yet. Now here is the problem. What we have actually discussed is
not learning, it’s verification. The situation as I describe it– you have a single bin and you have red
and green marbles, and this and that, corresponds to the following. A bank comes to my office. We would like a formula
for credit approval. And we have data. So instead of actually taking the data,
and searching hypotheses, and picking one, like the perceptron learning
algorithm, here is what I do that corresponds to what I just described. You guys want a linear formula? OK. I guess the salary should
have a big weight. Let’s say 2. The outstanding debt is negative, so
that should be a weight minus 0.5. And years in residence are important,
but not that important. So let’s give them a 0.1. And let’s pick a threshold
that is high, in order for you not to lose money. Let’s pick a threshold of 0.5. Sitting down, improvising an h. Now, after I fix the h, I ask you for
the data and just verify whether the h I picked is good or bad. That I can do with the bin, because
I’m going to look at the data. If I miraculously agree with everything
in your data, I can definitely declare victory
by Hoeffding. But what are the chances that this
will happen in the first place? I have no control over whether I will
be good on the data or not. The whole idea of learning is that I’m
searching the space to deliberately find a hypothesis that
works well on the data. In this case, I just dictated
a hypothesis. And I was able to tell you for sure
what happens out-of-sample. But I have no control of what news
I’m going to tell you. You can come to my office. I improvise this. I go to the data. And I tell you, I have
a fantastic system. It generalizes perfectly, and
it does a terrible job. That’s what I have, because when
I tested it, nu was terrible. So that’s not what we are looking for. What we are looking for is
to make it learning. So how do we do that? No guarantee that nu will be small. And we need to choose the hypothesis
from multiple h’s. That’s the game. And in that case, you are going to go for
the sample, so to speak, generated by every hypothesis, and then you pick
the hypothesis that is most favorable, that gives you the least error. So now, that doesn’t look
like a difficult thing. It worked with one bin. Maybe I can have more than one bin, to
accommodate the situation where I have more than one hypothesis. It looks plausible. So let’s do that. We will just take multiple bins. So here is the first bin. Now you can see that
this is a bad bin. So that hypothesis is terrible. And the sample reflects
that, to some extent. But we are going to have other bins,
so let’s call this something. So this bin corresponds
to a particular h. And since we are going to have other
hypotheses, we are going to call this h_1 in preparation
for the next guy. The next guy comes in,
and you have h_2. And you have another mu_2. This one looks like a good hypothesis,
and it’s also reflected in the sample. And it’s important to look
at the correspondence. If you look at the top red point here
and the top green point here, this is the same point in the input space. It just was colored red here
and colored green here. Why did that happen? Because the target function disagrees
with this h, and the target function happens to agree with this h. That’s what got this the color green. And when you pick a sample, the sample
also will have different colors, because the colors depend
on which hypothesis. And these are different hypotheses. That looks simple enough. So let’s continue. And we can have M of them. I am going to consider a finite number
of hypotheses, just to make the math easy for this lecture. And we’re going to go more sophisticated
when we get into the theory of generalization. So now I have this. This is good. I have samples, and the samples
here are different. And I can do the learning, and the
learning now, abstractly, is to scan these samples looking
for a good sample. And when you find a good sample, you
declare victory, because of Hoeffding, and you say that it must be that the
corresponding bin is good, and the corresponding bin happens to be
the hypothesis you chose. So that is an abstraction of learning. That was easy enough. Now, because this is going to stay with
us, I am now going to introduce the notation that will survive with us
for the entire discussion of learning. So here is the notation. We realize that both mu, which
happens to be inside the bin, and nu, which happens to be
the sample frequency– in this case, the sample frequency of
error– both of them depend on h. So I’d like to give a notation
that makes that explicit. The first thing, I am going to call mu and nu
with a descriptive name. So nu, which is the frequency in the
sample you have, is in-sample. That is a standard definition for what
happens in the data that I give you. If you perform well in-sample, it means
that your error in the sample that I give you is small. And because it is called in-sample,
we are going to denote it by E_in. I think this is worth blowing up,
because it’s an important one. This is our standard notation for
the error that you have in-sample. Now, we go and get the other one,
which happens to be mu. And that is called out-of-sample. So if you are in this field, I guess
what matters is the out-of-sample performance. That’s the lesson. Out-of-sample means something
that you haven’t seen. And if you perform out-of-sample, on
something that you haven’t seen, then you must have really learned. That’s the standard for it,
and the name for it is E_out. With this in mind, we realize that we
don’t yet have the dependency on h which we need. So we are going to make the notation a little
bit more elaborate, by calling E_in and E_out– calling them E_in of h, and E_out of h. Why is that? Well, the in-sample performance– you
are trying to see the error of approximating the target function
by your hypothesis. That’s what E_in is. So obviously, it depends
on your hypothesis. So it’s E_in of h. Someone else picks another h, they will
get another E_in of their h. Similarly E_out, the corresponding
one is E_out of h. So now, what used to be
nu is now E_in of h. What used to be mu, inside
the bin, is E_out of h. Now, the Hoeffding Inequality,
which we know all too well by now, said that. So all I’m going to do is just
replace the notation. And now it looks a little bit
more crowded, but it’s exactly the same thing. The probability that your in-sample
performance deviates from your out-of- sample performance by more than your
prescribed tolerance is less than or equal to a number that
is hopefully small. And you can go back and forth. There’s nu and mu, or you can go here
and you get the new notation. So we’re settled on the notation now. Now, let’s go for the multiple
bins and use this notation. These are the multiple
bins as we left them. We have the hypotheses h_1 up to h_M,
and we have the mu_1 and mu_M. And if you see 1, 2, M, again,
this is a disappearing nu– the symbol that the app doesn’t like. But thank God we switched
notations, so that something will appear. Yeah! So right now, that’s what we have. Every bin has an out-of-sample
performance, and out-of- sample is: Out. Of. Sample. So this is a sample. What’s in it is in-sample. What is not in it is out-of-sample. And the out-of-sample depends on
h_1 here, h_2 here, and h_M here. And obviously, these quantities will be
different according to the sample, and these quantities will be different
according to the ultimate performance of your hypothesis. So we solved the problem. It’s not verification.
It’s not a single bin. It’s real learning. I’m going to scan these. So that’s pretty good. Are we done already? Not so fast. [LAUGHING] What’s wrong? Let me tell you what’s wrong. The Hoeffding Inequality, that we have
happily studied and declared important and all of that, doesn’t apply
to multiple bins. What? You told us mathematics, and you go
read the proof, and all of that. Are you just pulling tricks on us? What is the deal here? And you even can complain. We sat for 40 minutes now going from
a single bin, mapping it to the learning diagram, mapping it to
multiple bins, and now you tell us that the main tool we developed
doesn’t apply. Why doesn’t it apply, and what
can we do about it? Let me start by saying why it doesn’t
apply, and then we can go for what we can do about it. Now, everybody has a coin. I hope the online audience
have a coin ready. I’d like to ask you to take
the coin out and flip it, let’s say, five times. And record what happens. And when you at home flip the
coin five times, please, if you happen to get all five heads in
your experiment, then text us that you got all five heads. If you get anything else,
don’t bother text us. We just want to know if someone
will get five heads. Everybody is done flipping the coin. Because you have been so generous and
cooperative, you can keep the coin! [LAUGHTER] Now, did anybody get five heads? All five heads? Congratulations, sir. You have a biased coin, right? We just argued that in-sample
corresponds to out-of-sample, and we have this Hoeffding thing, and therefore
if you get five heads, it must be that this coin
gives you heads. We know better. So in the online audience,
what happened? MODERATOR: Yeah, in the online audience,
there’s also five heads. PROFESSOR: There are lots of
biased coins out there. Are they really biased coins? No. What is the deal here? Let’s look at it. Here, with the audience here, I didn’t
want to push my luck with 10 flips, because it’s a live broadcast. So I said five will work. For the analytical example,
let’s take 10 flips. Let’s say you have a fair coin,
which every coin is. You have a fair coin. And you toss it 10 times. What is the probability that you
will get all 10 heads? Pretty easy. One half, times one half,
10 times, and that will give you about 1 in 1000. No chance that you will get it– not no chance, but very little chance. Now, the second question is the one we
actually ran the experiment for. If you toss 1000 fair coins– it wasn’t
1000 here. It’s how many there. Maybe out there is 1000. What is the probability that some
coin will give you all 10 heads? Not difficult at all to compute. And when you get the answer, the answer
will be it’s actually more likely than not. So now it means that the 10 heads in
this case are no indication at all of the real probability. That is the game we are playing. Can I look at the sample and infer
something about the real probability? No. In this case, you will get 10
heads, and the coin is fair. Why did this happen? This happened because
you tried too hard. Eventually what will happen is– Hoeffding applies to any one of them. But there is a probability, let’s
say half a percent, that you will be off here. Another half a percent that
you will be off here. If you do it often enough, and you are
lucky enough that the half percents are disjoint, you will end up with
extremely high probability that something bad will happen, somewhere. That’s the key. So let’s translate this into
the learning situation. Here are your coins. And how do they correspond
to the bins? Well, it’s a binary experiment, whether
you are picking a red marble or a green marble, or you are flipping
a coin getting heads or tails. It’s a binary situation. So there’s a direct correspondence. Just get the probability of heads being
mu, which is the probability of a red marble, corresponding to them. So because the coins are fair, actually all the bins in this case
are half red, half green. That’s really bad news
for a hypothesis. The hypothesis is completely random. Half the time it agrees with
the target function. Half the time it disagrees. No information at all. Now you apply the learning paradigm we
mentioned, and you say: let me generate a sample from
the first hypothesis. I get this, I look at it,
and I don’t like that. It has some reds. I want really a clean hypothesis
that performs perfectly– all green. You move on. And, OK. This one– even, I don’t know. This is even worse. You go on and on and on. And eventually, lo and behold,
I have all greens. Bingo. I have the perfect hypothesis. I am going to report this to my
customer, and if my customer is in financial forecasting, we are going
to beat the stock market and make a lot of money. And you start thinking about the car you
are going to buy, and all of that. Well, is it bingo? No, it isn’t. And that is the problem. So now, we have to find something
that makes us deal with multiple bins properly. Hoeffding Inequality– if you have one
experiment, it has a guarantee. The guarantee gets terribly diluted as
you go, and we want to know exactly how the dilution goes. So here is a simple solution. This is a mathematical slide.
I’ll do it step-by-step. There is absolutely nothing
mysterious about it. This is the quantity we’ve
been talking about. This is the probability
of a bad event. But in this case, you realize
that I’m putting g. Remember, g was our final hypothesis. So this corresponds to a process where
you had a bunch of h’s, and you picked one according to a criterion, that
happens to be an in-sample criterion, minimizing the error there, and
then you report the g as the one that you chose. And you would like to make a statement
that the probability for the g you chose– the in-sample error– happens to
be close to the out-of-sample error. So you’d like the probability of the
deviation being bigger than your tolerance to be, again, small. All we need to do is find a Hoeffding
counterpart to this, because now this fellow is loaded. It’s not just a fixed hypothesis
and a fixed bin. It actually corresponds to a large
number of bins, and I am visiting the random samples in order to pick one. So clearly the assumptions of Hoeffding
don’t apply– that correspond to a single bin. This probability is less
than or equal to the probability of the following. I have M hypotheses– capital M hypotheses. h_1, h_2, h_3, h_M. That’s my entire learning model. That’s the hypothesis set that I have,
finite as I said I would assume. If you look at what is the probability
that the hypothesis you pick is bad? Well, this will be less than
or equal to the probability that the first hypothesis is bad, or the second
hypothesis is bad, or, or, or the last hypothesis is bad. That is obvious. g is one of them. If it’s bad, one of them is bad. So less than or equal to that. This is called the union
bound in probability. It’s a very loose bound, in general,
because it doesn’t consider the overlap. Remember when I told you that the half
a percent here, half a percent here, half a percent here– if you are very unlucky and these are
non-overlapping, they add up. The non-overlapping is the worst-case
assumption, and it is the assumption used by the union bound. So you get this. And the good news about this is that I
have a handle on each term of them. The union bound is coming up. So I put the OR’s. And then I use the union bound to say that this
is less than or equal to, and simply sum the individual probabilities. So the half a percent plus half a percent
plus half a percent– this will be an upper bound
on all of them. The probability that one of them goes
wrong, the probability that someone gets all heads, and I add the
probability for all of you, and that makes it a respectable probability. So this event here is implied. Therefore, I have the implication because
of the OR, and this one because of the union bound, where I have
the pessimistic assumption that I just need to add the probabilities. Now, all of this– again, we make
simplistic assumptions, which is really not simplistic as in trivially
restricting, but rather the opposite. We just don’t want to make any
assumptions that restrict the applicability of our result. So we took the worst case. It cannot get worse than that. If you look at this, now
I have good news for you. Because each term here is
a fixed hypothesis. I didn’t choose anything. Every one of them has a hypothesis
that was declared ahead of time. Every one of them is a bin. So if I look at a term by itself,
Hoeffding applies to this, exactly the same way it applied before. So this is a mathematical
statement now. I’m not looking at the
bigger experiment. I reduced the bigger experiment
to a bunch of quantities. Each of them corresponds to a simple
experiment that we already solved. So I can substitute for each of
these by the bound that the Hoeffding gives me. So what is the bound that
the Hoeffding gives me? That’s the one. For every one of them, each of
these guys was less than or equal to this quantity. One by one. All of them are obviously the same. So each of them is smaller
than this quantity. Each of them is smaller than this quantity. Now I can be confident that the
probability that I’m interested in, which is the probability that
the in-sample error being close to the out-of-sample error–
the closeness of them is bigger than my tolerance, the bad event. Under the genuine learning scenario– you
generate marbles from every bin, and you look deliberately for a sample
that happens to be all green or as green as possible, and
you pick this one. And you want an assurance that
whatever that might be, the corresponding bin will genuinely
be good out-of-sample. That is what is captured
by this probability. That is still bounded by something,
which also has that exponential in it, which is good. But it has an added factor that will be
a very bothersome factor, which is: I have M of them. Now, this is the bad event. I’d like the probability to be small. I don’t like to magnify the right-hand
side, because that is the probability of something bad happening. Now, with M, you realize that if you use 10 hypotheses, this
probability is probably tight. If you use a million hypotheses, we
probably are already in trouble. There is no guarantee, because now the
million gets multiplied by what used to be a respectable probability, which
is 1 in 100,000, and now you can make the statement that the probability
that something bad happens is less than 10. [LAUGHING] Yeah, thank you very much. We have to take a graduate
course to learn that! Now you see what the problem is. And the problem is extremely
intuitive. In that Q&A session after the last
lecture, we all got through the discussion the assertion that if you
have a more sophisticated model, the chances are you will memorize in-sample,
and you are not going to really generalize well out-of-sample,
because you have so many parameters to work with. There are so many ways to look at that
intuitively, and this is one of them. If you have a very sophisticated model–
M is huge, let alone infinite. That’s later to come. That’s what the theory of
generalization is about. But if you pick a very sophisticated
example with a large M, you lose the link between the in-sample
and the out-of-sample. So you look at here. [LAUGHING], I didn’t mean it this
way, but let me go back just to show you what it is. At least you know it’s
over, so that’s good. So this fellow is supposed
to track this fellow. The in-sample is supposed to
track the out-of-sample. The more sophisticated the model you
use, the looser that in-sample will track the out-of-sample. Because the probability of them
deviating becomes bigger and bigger and bigger. And that is exactly the
intuition we have. Now, surprise. The next one is for the Q&A. We will
take a short break, and then we will go to the questions and answers. We are now in the Q&A session. And if anybody wants to ask a question,
they can go to the microphone and ask, and we can start
with the online audience questions, if there are any. MODERATOR: The first question is what happens when
the Hoeffding Inequality gives you something trivial,
like less than 2? PROFESSOR: Well, it means that
either the resources of the examples you have, the amount of data you have,
is not sufficient to guarantee any generalization, or– which is somewhat equivalent– that your tolerance is too stringent. The situation is not
really mysterious. Let’s say that you’d like to take
a poll for the president. And let’s say that you ask
five people at random. How can you interpret the result? Nothing. You need a certain amount of respondents
in order for the right-hand side to start
becoming interesting. Other than that, it’s
completely trivial. It’s very likely that what you have seen
in-sample doesn’t correspond to anything out-of-sample. MODERATOR: So in the case
of the perceptron– the question is would each set
of w’s be considered a new m? PROFESSOR: The perceptron and, as a matter of fact, every
learning model of interest that we’re going to encounter, the
number of hypotheses, M, happens to be infinite. We were just talking about the
right-hand side not being meaningful because it’s bigger than 1. If you take
an infinite hypothesis set and verbatim apply what I said, then you
find that the probability is actually less than infinity. That’s very important. However, this is our first step. There will be another step, where we deal
with infinite hypothesis sets. And we are going to be able to describe
them with an abstract quantity that happens to be finite, and that
abstract quantity will be the one we are going to use in the counterpart
for the Hoeffding Inequality. That’s why there is mathematics
that needs to be done. Obviously, the perceptron has an infinite
number of hypotheses because you have real space, and here is your
hypothesis, and you can perturb this continuously as you want. Even just by doing this, you already
have an infinite number of hypotheses without even exploring further. MODERATOR: OK,
and this is a popular one. Could you go over again in slide 6, of
the implication of nu equals mu and vice versa. PROFESSOR: Six. It’s a subtle point, and it’s common
between machine learning and statistics. What do you do in statistics? What is the cause and effect for
a probability and a sample? The probability results in a sample. So if I know the probability, I can
tell you exactly what is the likelihood that you’ll get one
sample or another or another. Now, what you do in statistics
is the reverse of that. You already have the sample, and you are
trying to infer which probability gave rise to it. So you are using the effect to
decide the cause rather than the other way around. So the same situation here. The bin is the cause. The frequency in the sample
is the effect. I can definitely tell you what the
distribution is like in the sample, based on the bin. The utility, in terms of learning,
is that I look at the sample and infer the bin. So I infer the cause based
on the effect. There’s absolutely nothing
terrible about that. I just wanted to make the point clear,
that when we write the Hoeffding Inequality, which you can see here,
we are talking about this event. You should always remember that nu is
the thing that plays around and causes the probability to happen,
and mu is a constant. When we use it to predict that the
out-of-sample will be the same as the in- sample, we are really taking nu as
fixed, because this is the in- sample we’ve got. And then we are trying to interpret
what mu gave rise to it. And I’m just saying that, in this case,
since the statement is of the form that the difference between them,
which is symmetric, is greater than epsilon, then if you look at this as
saying mu is there and I know that nu will be approximately the same,
you can also flip that. And you can say, nu is here, and I
know that mu that gave rise to it must be the same. That’s the whole idea. It’s a logical thing rather
than a mathematical thing. MODERATOR: OK. Another conceptual question that is
arising is that a more complicated model corresponds to
a larger number of h’s. And some people are asking– they thought each h was a model. PROFESSOR: OK. Each h is a hypothesis. A particular function, one of them you
are going to pick, which is going to be equal to g, and this is the g that
you’re going to report as your best guess as an approximation for f. The model is the hypotheses that you’re
allowed to visit in order to choose one. So that’s the hypothesis
set, which is H. And again, but there is
an interesting point. I’m using the number of hypotheses as
a measure for the complexity in the intuitive argument that I gave you. It’s not clear at all that the pure number
corresponds to the complexity. It’s not clear that anything that has
to do with the size, really, is the complexity. Maybe the complexity has to do with
the structure of individual hypotheses. And that’s a very interesting point. And that will be discussed at some
point– the complexity of individual hypotheses versus the complexity of
the model that captures all the hypotheses. This will be a topic that we will
discuss much later in the course. MODERATOR: Some people are
getting ahead. So how do you pick g? PROFESSOR: OK. We have one way of picking g– that
already was established last time– which is the perceptron
learning algorithm. So your hypothesis set is H. Script H. It has a bunch of h’s, which are the
different lines in the plane. And you pick g by applying the PLA,
the perceptron learning algorithm, playing around with this boundary,
according to the update rule, until it classifies the inputs correctly,
assuming they are linearly separable, and the one you end up with
is what is declared g. So g is just a matter of notation,
a name for whichever one we settle on, the final hypothesis. How you pick g depends on what
algorithm you use, and what hypothesis set you use. So it depends on the learning model,
and obviously on the data. MODERATOR: OK. This is a popular question. So it says: how would you extend the
equation to support an output that is a valid range of responses
and not a binary response? PROFESSOR: It can be done. One of the things that I mentioned
here is that this fellow, the probability here, is uniform. Now, let’s say that you are not talking
about a binary experiment. Instead of taking the frequency of error
versus the probability of error, you take the expected value
of something versus the sample average of it. And they will be close to each other,
and some, obviously technical, modification is needed to be here. And basically, the set of laws of large
numbers, from which this is one member, has a bunch of members that actually
have to do with expected value and sample average, rather than just the
specific case of probability and sample average. If you take your function as being 1,
0, and you take the expected value, that will give you the sample as the
sample average, and the probability as the expected value. So it’s not a different animal. It’s just a special case that
is easier to handle. And in the other case, one of the things
that matters is the variance of your variable. So it will affect the bounds. Here, I’m choosing epsilon in general,
because the variance of this variable is very limited. Let’s say that the probability
is mu, so the variance is mu times 1 minus mu. It goes from a certain value
to a certain value. So it can be absorbed. It’s bounded above and below. And this is the reason why the
right-hand side here can be uniformly done. If you have something that has variance
that can be huge or small, then that will play a role in your
choice of epsilon, such that this will be valid. So the short answer is: it can be done. There is a technical modification, and
the main aspect of the technical modification, that needs to be taken into
consideration, is the variance of the variable I’m talking about. MODERATOR: OK. There’s also a common confusion. Why are there are multiple bins? PROFESSOR: OK. The bin was only our conceptual
tool to argue that learning is feasible in a probabilistic sense. When we used a single bin, we had
a correspondence with a hypothesis, and it looked like we actually captured
the essence of learning, until we looked closer and we realized that, if
you restrict yourself to one bin and apply the Hoeffding Inequality directly
to it, what you are really working with– if you want to put it in
terms of learning– is that my hypothesis set
has only one hypothesis. And that corresponds to the bin. So now I am picking it– which is my only choice. I don’t have everything else. And all I’m doing now is verifying that
its in-sample performance will correspond to the out-of-sample
performance, and that is guaranteed by the plain-vanilla Hoeffding. Now, if you have actual learning,
then you have more than one hypothesis. And we realize that the bin changes with
the hypothesis, because whether a marble is red or green depends on
whether the hypothesis agrees or disagrees with your target function. Different hypotheses will
lead to different colors. Therefore, you need multiple bins to
represent multiple hypotheses, which is the only situation that admits
learning as we know it– that I’m going to explore the hypotheses,
based on their performance in-sample, and pick the one that performs best,
perhaps, in-sample, and hope that it will generalize well out-of-sample. MODERATOR: OK. Another confusion. Can you resolve the relationship
between the probability and the big H? so I’m not clear exactly what– PROFESSOR: We applied the– there are a bunch of components
in the learning situation, so let me get the– It’s a big diagram, and it
has lots of components. So one big space or set is X, and
another one is H. So if you look at here. This is hypothesis set H. It’s a set. OK, fine. And also, if you look here, the target
function is defined from X to Y, and in this case, X is also a set. The only invocation of probability that
we needed to do, in order to get the benefit of the probabilistic
analysis in learning, was to put a probability distribution on X. H, which is down there, is left
as a fixed hypothesis set. There is no question of
a probability on it. When we talk about the Bayesian
approach, in the last lecture in fact, there will be a question of
putting a probability distribution here in order to make the whole
situation probabilistic. But that is not the approach that is
followed for the entire course, until we discuss that specific
approach at the end. Question. STUDENT: What do we do when there
are many possible hypotheses which will satisfy my criteria? Like, in perceptron, for example. I could have several hyperplanes which
could be separating the set. So how do I pick the best– PROFESSOR: Correct. Usually, with a pre-specified
algorithm, you’ll end up with something. So the algorithm will
choose it for you. But your remark now is that, given that there are many solutions
that happen to have zero in-sample error, there is really no distinction
between them in terms of the out-of- sample performance. I’m using the same hypothesis set,
so M is the same. And the in-sample error is the same. So my prediction for the out-of-sample
error would be the same, as there’s no distinction between them. The good news is that the learning
algorithm will solve this for you, because it will give you one specific,
the one it ended with. But even within the ones that achieve
zero error, there is a method, that we’ll talk about later on when we
talk about support vector machines, that prefers one particular solution
as having a better chance of generalization. Not clear at all given what I said
so far, but I’m just telling you, as an appetizer, there’s something
to be done in that regard. MODERATOR: OK. A question is does the inequality
hold for any g, even if g is not optimal? PROFESSOR: What about the g? MODERATOR: Does it hold for any
g, no matter how you pick g? PROFESSOR: Yeah. So the whole idea– once you write the symbol g, you
already are talking about any hypothesis. Because by definition, g is the final
hypothesis, and your algorithm is allowed to pick any h from the
hypothesis set and call it g. Therefore, when I say g, don’t
look at a fixed hypothesis. Look at the entire learning process that
went through the H, the set of hypotheses, according to the
data and according to the learning rule, and went through and ended up with
one that is declared the right one, and now we call this g. So the answer is patently
that g can be different. Patently yes, just by the notation
that I’m using. MODERATOR: Also, some confusion. With the perceptron algorithm
or any linear algorithm– there’s a confusion that, at each
step, there’s a hypothesis, but– PROFESSOR: Correct. But these are hidden processes for us. As far as analysis I mentioned,
you get the data, the algorithm does something magic, and
ends up with a final hypothesis. In the course of doing that, it will
obviously be visiting lots of hypotheses. So the abstraction of having just the
samples sitting there, and eyeballing them and picking the one that happens
to be green, is an abstraction. In reality, these guys happen in
a space, and you are moving from one hypothesis to another by
moving some parameters. And in the course of doing that,
including in the perceptron learning algorithm, you are moving from
one hypothesis to another. But I’m not accounting for that, because
I haven’t found my final hypothesis yet. When you find the final hypothesis,
you call it g. On the other hand, because I use the
union bound, I use the worst-case scenario, the generalization bound
applies to every single hypothesis you visited or you didn’t visit. Because what I did to get the bound, of
deviation between in-sample and out-of- sample, is that I consider that all the
hypotheses simultaneously behave from in-sample to out-of-sample, closely
according to your epsilon criterion. And that obviously guarantees that
whichever one you end up with will be fine. But obviously, it could be an overkill. And among the positive side effects
of that is that even the intermediate values have
good generalization– not that we look at it or consider it,
but just to answer the question. MODERATOR: A question
about the punchline. They say that they don’t understand
exactly how the Hoeffding works– shows that learning is feasible. PROFESSOR: OK. Hoeffding shows that verification
is feasible. The presidential poll makes sense. That, if you have a sample and you have
one question to ask, and you see how the question is answered in the
sample, then there is a reason to believe that the answer in the general
population, or in the big bin, will be close to the answer you got in-sample. So that’s the verification. In order to move from verification to
learning, you need to be able to make that statement, simultaneously on
a number of these guys, and that’s why you had the modified Hoeffding
Inequality at the end, which is this one that has the red M in it. This is no longer the plain-vanilla
Hoeffding Inequality. We’ll still call it Hoeffding. But it basically deals with a situation
where you have M of these guys simultaneously, and you want to
guarantee that all of them are behaving well. Under those conditions, this is the
probability that the guarantee can give, and the probability, obviously,
is looser than it used to be. So the probability that bad thing
happens when you have many possibilities is bigger than the
probability that bad things happen when you have one of them. And this is the case where you added up
as if they happen disjointly, as I mentioned before. MODERATOR: Can it be said that the
bin corresponds to the entire population in a– PROFESSOR: The bin corresponds
to the entire population before coloring. So remember the gray bin– I have it somewhere. We had a viewgraph where the
bin had gray marbles. So this is my way of saying this
is a generic input, and we call it X. And this is indeed the input space in
this case, or the general population. Now, we start coloring it according
to when you give me a hypothesis. So now there’s more in the process
than just the input space. But indeed, the bin can correspond to
the general population, and the sample will correspond to the people you polled
over the phone, in the case of the presidential thing. MODERATOR: Is there a relation between
the Hoeffding Inequality and the p-values in statistics? PROFESSOR: Yes. The area where we are trying to say that
if I have a sample and I get an estimate on the sample, the
estimate is reliable. The estimate is close to
the out-of-sample. The probability that you will deviate–
is a huge body of work. And the p-value in statistics
is one approach. And there are other laws of large
numbers that come with it. I don’t want to venture
too much into that. I basically picked from that jungle of
mathematics the single most useful formula that will get me home when
I talk about the theory of generalization. And I want to focus on it. I want to understand it– this specific
formula– perfectly, so when we keep modifying it until we get to the
VC dimension, things are clear. And, obviously, if you get curious about
the law of large numbers, and different manifestations of in-sample
being close to out-of-sample and probabilities of error, that is a very
fertile ground, and a very useful ground to study. But it is not a core subject
of the course. The subject is only borrowing
one piece as a utility to get what it wants. So that ends the questions here? Let’s call it a day, and
we will see you next week.


  1. Oliver Nicholls

    March 25, 2013 at 12:16 pm

    I'm getting confused towards the end. Once you have g, what use is it to compare it to every h in H? Surely, g is the best h in H, that's how it became g. Also, why does he add M to the inequality at the very end? Doesn't that just increase the value, the bigger H is? So with a large H and a subsequent large M, won't the comparison be totally redundant? I think he made that point at the end, but I don't see why he added it in, in the first place.

  2. Oliver Nicholls

    April 13, 2013 at 10:08 pm

    Ahhh, thank you. Very well explained.

  3. WhyMe432532

    April 28, 2013 at 9:18 pm

    Thanks for the excellent lecture. Really enjoying them. Very well explained.

  4. Sam Chien

    July 4, 2013 at 4:18 pm

    I think the original Hoeffding's equality applies when the hypothesis is specified BEFORE you see the data (e.g. a crazy hypothesis like if it is raining then approves the credit card otherwise not). However, in reality, we will learn a specific hypothesis using the data (e.g. using least squares to learn the regression coefficients), in that case, the learned hypothesis is g and you can considered it as chosen from a hypothesis space (H). If the hypothesis space is finite of size M, then

  5. Sam Chien

    July 4, 2013 at 4:26 pm

    cont. the formula with M on the RHS will be used, which is very conservative. Unfortunately, most hypothesis space is not finite, i.e., you have infinite number of hypothesis in it, so you can't use M to measure the model complexity. So in that case, we will resort to something called VC dimension and derive generalization bounds based on that.

  6. Esteban Afonso

    July 15, 2013 at 7:32 pm

    great teacher

  7. Sergio Torassa

    July 20, 2013 at 4:32 pm

    It would be great to see the Professor in some courses on Coursera. He is one of the best one I've ever heard. Thanks!

  8. Arun Kumar M S

    September 13, 2013 at 5:56 pm

    Something like feynman's lectures is being attempted.

  9. amyslo

    September 28, 2013 at 12:08 pm


    It is much better then this one.

  10. Sergio Torassa

    September 30, 2013 at 10:16 am

    Thanks for pointing it out! I don't know how could I miss looking for a ML course on Coursera. The problem will be the overlap with the Criptography one from Dan Boneh

  11. Sonia1978NYC

    October 1, 2013 at 5:13 am

    Probably Approximately Correct 😮 I have the book by Leslie Valiant

  12. Sonia1978NYC

    October 1, 2013 at 5:39 am

    5!= 5*4*3*2*1=120/12

  13. Sonia1978NYC

    October 1, 2013 at 5:40 am

    ^oops, 120/120…5!/5!1!=1

  14. Sonia1978NYC

    October 1, 2013 at 5:41 am


  15. Sonia1978NYC

    October 1, 2013 at 5:44 am

    2^10=1024…ten heads:1/1024

  16. Andrei Krishkevich

    October 6, 2013 at 11:01 pm

    The Coursera ML course assumes you're an idiot to start with, teaches you little and then proclaims you an "expert". This course assumes substantial background, teaches you things in-depth, and at the end is still humble about how much knowledge is gave you. Andrew Ng's YouTube lectures recorded at Stanford are quite good though.

  17. judgeomega

    October 14, 2013 at 7:49 pm

    Ng's lectures are hardly even within the realm of a true 'lecture'. His command of the english language is very limited and he rarely explains anything beyond the iteration of formulas and proofs. You would be better served by reading a book on the subject.
    Abu-mostafa is a TRUE teacher which walks you through the process. Of the dozens of lectures on this complex subject, he has the best compromise between content and approachability.

  18. Marco E. Benalcázar

    October 21, 2013 at 8:14 pm

    This is the right answer (1/1024) for the first question in the "coin analogy"

  19. Vishnu Teja

    November 6, 2013 at 5:55 pm

    I am doing the course on ML in Coursera, It is a very good start for someone who jwants to get started and know what machine learning is all about and do some exercises to get a feel for it. That said, simple derivations in calculus which you should know from high school are skipped and just the final formula is given which is a little disappointing. I don't see how anyone can do machine learning without knowing basic calculus. Too emphasis is placed on being nice.

  20. Murtuza

    November 14, 2013 at 9:43 pm

    I cannot thank you much for this lecture. You make machine learning math a piece of cake.

  21. Yi-Lin Su

    November 17, 2013 at 7:38 am

    thank you!!!

  22. delightfulsunny

    December 4, 2013 at 10:17 pm

    Remind myself that this is just foundation, and that it is dry and Zzzz…. but must ….keep…going..
    an hour later…really the summary is that the more your model cater towards a specific sample, your model is more prone to failure when it comes to unknown. it is like fourier series, fitting too well to the data can lead to not actually learning at all  >…< 

  23. adila papaya

    December 19, 2013 at 5:09 pm


    Now, between you and me, I prefer the original formula better. Without the 2.
    However, the formula with the 2s has the distinct advantage of being … true. So we have to settle for that.

    Best quote ever on the Hoeffding's Inequality. 🙂

  24. Rabab Hamed M ALy

    January 17, 2014 at 2:31 pm

    very nice

  25. Matam .Manjunath

    February 20, 2014 at 3:09 pm

    Thank you Caltech and Prof 

  26. Dan K

    February 28, 2014 at 8:50 pm

    I like how he says okaaaaaay !

  27. Olugbemi Eric

    August 20, 2014 at 1:09 pm

    thanks a lot for sharing. This will surely be of help to me in my m.sc

  28. Yair Elblinger

    February 4, 2015 at 8:03 pm

    This I don't understand: We have a probability measure P on X. If this P tends to choose x's for which decision is easy wouldn't it be wrong to use Ein(h) to estimate Eout(h)?
    In the credit card approval example, if we have h that simply decides that a person is reliable if the person has any positive amount of money even 1$. And assume P tends to choose people with large amounts of money. Wouldn't there be a bayes there? 

  29. Yair Elblinger

    February 4, 2015 at 8:18 pm

    Ok, kind of stupid but I think I can answer my question. Not just Ein but also Eout is affected by P. That is Eout(h)=sigma P(x)h(x). That answers it. 

  30. Yair Elblinger

    February 4, 2015 at 8:23 pm

    This means that we assume that the data we have was sampled in the same way real data occurs. Does not seems a trivial assumption to me but makes allot of sense to need such an assumption.
    Anyway, great lecture! 

  31. Alexander Virtser

    February 22, 2015 at 6:45 pm

    Thank you very much – a brilliant work  

  32. Ghor Havanovich

    February 24, 2015 at 1:34 pm

    I stopped following after 23min,, what to do?

  33. Dissonantia Cognitiva

    May 16, 2015 at 9:53 pm

    He sounds just like King Julian, the king of the lemurs, I keep on hoping he starts singing "I like to move it move it"

  34. Ragia Ibrahim

    September 27, 2015 at 7:26 am

    I couldn't understand this clearly 🙁

  35. AschMu

    October 18, 2015 at 8:18 pm

    I love this guy !

  36. User User

    November 3, 2015 at 2:46 pm

    Thanks to pofessor and Caltech, sending me back to my youth! I recall myself, excited by outstanding lecturers, I met then…. His speach and passion perfectly colourizes the topic and, I beleive, is of great help to fogeign students to understand.

  37. Sharmila Velamur

    November 23, 2015 at 3:45 am

    Professor, your lectures are so enjoyable that I look forward to "learning" :). Thank you!

  38. Potado Tomado

    December 23, 2015 at 5:52 am

    Prof. Abu-Mostafa is the man! Super cool guy.

  39. southshofosho

    December 31, 2015 at 11:54 pm

    He fields questions like a total G, I love this course

  40. Badra Sid

    January 17, 2016 at 8:34 am

    Brillant ya yaser.

  41. Badra Sid

    January 17, 2016 at 8:37 am

    The best way to have a grasp of all these lectures is to do all the homeworks and projects. If you are a "learning machine", then start to read conference papers on learning machine….so you will be ready for research.

  42. Shem Leong

    January 24, 2016 at 8:49 am

    Would recommend watching at 1.5x

  43. Naveen Kumar

    February 9, 2016 at 8:32 am

    Brilliant lecture. A different approach than that of Andrew ng's. Loved it !!

  44. reemasriganesh

    February 17, 2016 at 6:08 am

    Loved it! THANK YOU Professor Yaser, and thank you, Caltech!

  45. Santosh Paudel

    May 20, 2016 at 8:37 pm

    can anybody explain to me what epsilon exactly means there! Thank you

  46. Edu WHs

    May 22, 2016 at 1:01 am

    Umm… I'm probably missing sth 😉 What formula is he using to get to 63% probabilities? If each coin gets 10 straight heads once each 1,024 times (say we run it infinite times… Then the proportion should be 1 over 2 to the N, right? So 1 over 2 to the ten, so once each 1024 times roughly.) So because the probability of each coin is independent, doesn't it mean that the probability should be almost 100%? (1000/1024)

    Ah, Ok… So even if you had 100 trillion flips each with 99.9% chances of being heads, you still have 0.01×0.01… (100 trillion times) of chances to get all tails.

    For this example, you have 1/1024 chances of it being heads 10 consecutive times, so you have (1-(1/1024)) chances of it having at least one of the 10 flips being tails… That is, if you have 1 in 1024 chances of it being heads 10 straight times, you have 1023 in 1024 chances of it not being that. And if it is not that, it means that, at least, there is one tails somewhere (at least one) that would break the chain. So over 1000 repetitions, you have (1-(1/1024)) to the 1000, or (1023/1024) to the 1000, or 37% chances to get at least one tail on each set of 10. So 63% chances approx. to get 10 consecutive heads.

    That being said, I still believe if the chances are 1 in 1024 to get 10/10 heads, for each 1024 attempts when the number of attempts goes towards infinity, we should get at least one of those to be 10 straight heads? So maybe it has to do with distribution? Like sometimes you can get 2 or more sets of 10 straight heads in your lot of 1000, while other times you may get none. So the chances of you finding (in a lot of 1000 tries) at least 1 set of 10 straight heads is 63%? (because they can form clusters, and sometimes you will get a group with none)

    Or maybe it doesn't have to do with that? I mean, what are probabilities, really? Say you have 99.9% chances to get heads and 0.01% to get tails. You do it twice and the chances to get at least one time heads are really high, of course. But there is 1/10,000 chances of actually being tails and tails… So if you go towards infinity, you might think the distribution would be 99.9% of the time, no matter where or the order, you get heads, and 0.01% of the time you would get tails. But for N tries, there is 0.01 to the N chances of actually being all tails… So you can do it 100 trillion times, or go towards infinity, and there is still a very, very, very small, but real chance to, well, get all tails. So the chance is there, and now let's suppose it happens… Now, if that slim chance was the way events unfolded, then that option would happen forever, infinite times, and the 99.9% chances would mean nothing. You might say, well, but if we run the experiment again, now we will probably get 99.9% of the time heads. So the 99.9% vs 0.01% probability isn't wrong… But actually, this new set of samples can be concatenated to the last, as they go towards infinity and the premise is that this will happen (eventually) infinite times, and ALL times, as one single time not getting tails would break the chain…

    So now we might say it is unlikely, but now think of a person seeing it, witnessing the event… Wouldn't they say chances are 100% tails?

    So one important thing is that possibilities don't guarantee you will get heads and tails in a proportion of 1/1024 and 1023/1024. It really doesn't. A probability of 90% doesn't mean sth will happen 90% of the time, but that we believe it has '9 chances out of ten to be that'. But once the drawing is made, it can happen 70% of the time only, on 2% of the time, and stay like this forever… At least that's my understanding of it after giving it some thought!

  47. Nikolaj Ezhov

    June 28, 2016 at 6:28 pm

    Why he is calling "h" a hypothesis and not a function?

  48. Július Marko

    August 5, 2016 at 1:30 pm

    Hello. What should I do, if I misunderstood a lot of this talk? It can matter of language, because I'm not from English spoken countries, but other, just programming courses I understand. Specially things about machine learning, I don't. What to do? Study math or what…

  49. Korakot Chaovavanich

    October 7, 2016 at 4:08 am

    Such a marvelous lecture!
    The logical sequence, the explanation, the joke, the intuitutive powerpoint/animation, …
    I wish all my lectures are like this

  50. cesar leon

    October 7, 2016 at 4:16 am

    he is an amazing teacher, but something bothers me, for some reason when i hear him, it reminfs me of the synth voice of S. Hawkings.

  51. philipralph

    October 31, 2016 at 6:53 am

    @45:24: I got 5 heads!!! Actually a total of 7 consecutive heads before my first tails. You always think it happens to someone else…

  52. Theodore Galanos

    November 26, 2016 at 2:17 am

    Hello and thank you for the wonderful lectures!
    I'm new in this field and I am trying to combine it with Computational Geometry. As such, my problems are unique in the sense that the training sets can (usually) be constructed at the will of the modeller. The data are always (potentially) there, it is a matter of choosing which to produce. I was wondering if there is a theoretical or practical approach to an optimized selection of training samples from the whole? Does that relate to assigning a specifc (hopefully optimum in some way)l P(x), e.g. uniform distribution which takes samples uniformly from the whole? Or is it that random selection is still good enough in this case?
    Thank you in advance.

  53. omkar chavan

    November 30, 2016 at 3:59 pm


  54. bigblind

    December 6, 2016 at 10:32 pm

    I find him a great teacher, but I don't understand why they switch away from the slides sometimes. Seeing him talk doesn't give me any information, so just keep the slides on screen.

  55. Nilakash Das

    January 6, 2017 at 10:51 am

    I did not understand the union bound concept. My doubt is that shouldn't the upper bound (the probability of a hypothesis that is selected is bad) be (1/M) times (2M exp{-2e^2N}, assuming each hypthesis is equally likely to be selected. An analogy is consider this question
    "There are two bags containing white and black balls. In the first bag there are 8 white and 6 black balls and in the second bag there are 4 white and 7 black balls. One ball is drawn at random from any of these two bags. Find the probability of this ball being black."
    In the above question assume that selecting a bad ball signifies a bad event. Thus, P(bad event)=1/2*6/14+1/2*7/11=(1/2)*(6/14+7/11). In this example, M=2.

  56. Finley McDonell

    January 6, 2017 at 12:28 pm

    a colossal bore …

  57. Sarnath K

    March 18, 2017 at 2:01 am

    Awesome lecture.. The 10 times coin flip experiment was surprising and it is pretty interesting. Few observations:
    1) By choosing a sufficiently big M, then I can get inequality like P[Bad Event] < 2 : which actually is a no-brainer

    2) An absolute value of "bound" is meaningless unless I know the "mu" , the unknown quantity. But that said, practically, we can overcome this by some educated guesses and possibly central limit theorem too.

    3) I am surprised there is no talk of Central limit theorem… I was expecting that something will be proved based on it…Possibly Hoeffding has relation to it….

  58. Bernard van Tonder

    May 5, 2017 at 2:36 pm

    13:00 Isn't this 'problem' just Hume's original problem of induction?

  59. Jack Sparrow

    June 1, 2017 at 9:46 am

    this is just the 2nd lecture. yet it is painfully hard for me.

  60. Ning Li

    July 21, 2017 at 2:37 pm

    The professor teaches so good. Glad they made the video.

  61. pablo garcia

    July 29, 2017 at 6:25 pm

    fantastic lecture !!!! thanks a lot


    August 11, 2017 at 7:41 pm

    Really Brilliant lecture ..

  63. Saimon Thapa

    August 22, 2017 at 5:39 am

    I got 5 heads!! really

  64. granand

    September 3, 2017 at 10:19 am

    Thank you caltech and Professor. But please can you help me, it's decades I touched maths to catch up. Tell which links I must read and understand so I can get back here to follow like you smart guys.

  65. granand

    September 3, 2017 at 10:21 am

    Can some one list me all the formulas I must know for the entire course

  66. Neil Bryan Closa

    September 24, 2017 at 5:04 pm

    This is an amazing lecture. Looking forward for watching the next lecture videos.

  67. Prem Kumar Tiwari

    October 9, 2017 at 4:17 am

    If you toss 1000 fair coin 10 times , what is the probability that some coin will get 10 heads ? :

    Can someone explain , how answer is 63% ?

  68. Kieran Mace

    October 22, 2017 at 2:55 am

    Is an new hypothesis h_avg, defined as an average over a subset set of hypotheses in H, necessarily also in H? or does it depend on the functional form of H?

  69. Jon Snow

    November 23, 2017 at 8:01 am

    55:59 Overfitting!!!

  70. Jon Snow

    November 23, 2017 at 8:54 am

    What was the conclusion of the lecture? I mean how did we prove that learning is feasible?

  71. David Tung

    November 27, 2017 at 9:11 am

    Didn't state Hoeffding's inequality correctly. The value of nu must be bounded with a range 1 with the formula (at 20:30).

  72. Ming Li

    November 28, 2017 at 1:33 pm

    this is awesome man, tks.

  73. naga pradeep kumar

    December 3, 2017 at 9:11 pm

    In slide 23, why is the probablity equation of g dependent on all hypothesis while we pick only one out of the multiple hypothesis? Shouldnt it be equal to the probabilty of the hypothesis chosen?

  74. ajayram198

    January 10, 2018 at 1:22 am

    This is quite a difficult lecture, couldnt understand much of it!

  75. Soumik Dasgupta

    January 26, 2018 at 6:31 am

    What does it mean mathematically "A pattern must exist" in the data?

  76. wafa Mribah

    April 5, 2018 at 3:09 pm

    One thing I couldn't figured it out though is how the target function and the hypothesis would agree? how the comparison occurs?

  77. Sevendi Eldrige Rifki Poluan

    April 12, 2018 at 2:38 pm

    Great explanation sir. I like ur accent. Your accent is really like a Ratahan accent.

  78. Ahmed Elsayed

    April 23, 2018 at 5:57 pm

    I did not see that mentioned anywhere, Dr. Yasser has a book describing the course content in more details called "Learning From Data".

  79. Nissar Ali

    May 20, 2018 at 6:32 pm

    beautiful so far !!

  80. 丛亮

    June 4, 2018 at 3:08 pm

    Q&A session is great, which explains a lot questions in my mind. Especially for me without looking at other materials

  81. 丛亮

    June 4, 2018 at 4:19 pm

    How is the probability distribution over X considered into the learning process? The marbles (sample) from the bin (space) are subjected to the probability distribution. How does the probability affect learning? I only know that the multi-bins problem necessitates the modification of the plain-vanilla hoeffding's inequality. The multi-bins are brought about by the number of hypothesis in the hypothesis set, not by the probability distribution over the X space.

  82. Naman Vats

    July 9, 2018 at 6:50 pm

    How would you know what is the value of E(out) ??

  83. shakesbeer00

    August 11, 2018 at 7:27 pm

    Thanks for the excellent lecture. Here are a couple of questions:
    At 42:22 Just want to be more rigorous, would the P notation in this Hoeffding inequality depend on both X and y?
    At 50:02 How exactly is g defined here in order to have this inequality hold? The inequality seems to require that g minimize the | Ein – Eout|? But that is not intuitive. Instead, it is more intuitive to have g that minimizes Ein (or eventually Eout) based on the definition of Ein and Eout earlier.

  84. Logan Phillips

    September 19, 2018 at 4:14 pm


    How do we compare the hypothesis to the target function if we never know what the target function is?

  85. Yan Zhang

    September 24, 2018 at 1:04 pm

    So, is learning feasible?

  86. Aviral Janveja

    December 4, 2018 at 1:08 am

    Brilliant ! 😀

  87. angthuonimo

    December 6, 2018 at 1:58 pm

    There's a mistake in here that the union of probabilities was equated with their sum. The reducio in absurdum of P>1 is pointed out. This should be a union of independent probabilities because the samples (handfuls of marbles) are independent. Lectures 5+ would be moot without this mistake because M would be bounded to 1, let alone to a polynomial. Am I missing something?

  88. Gaurav Jain

    December 20, 2018 at 2:07 am

    Just awesome!!!!

  89. Michael Nguyen

    January 8, 2019 at 8:27 pm

    Does anyone else find this guy absolutely hilarious for some reason? Something about the look on his face makes you feel like he's constantly thinking "Yeah, I'm killing this lecture right now". When he whips out that smug half smile I can't help but laugh out loud. You can tell he loves teaching. Great set of lectures.

  90. Thang Bom

    January 15, 2019 at 3:37 pm

    excellent lecture. It looks like that the inequation in the final verdict is so loose that P[|E_in(g) – E_out(g)| > eps] <= sum which is larger than 1 is almost obvious for any selected function g if M is so large. This corresponds to the case there is no pattern in the population. Anyway, the idea of trying to explain the overfitting symptom by concrete problem and formula is outstanding. Bravo!!

  91. d13tr

    January 27, 2019 at 12:52 pm

    If this video is confusing to you, consider the following:

    The example at 9:34 is only to show that we know something about the entire set, based on a sample. Basically, it says the bigger the sample, the closer v relates to u (Hoeffding).

    At 28:00, forget the above example. We are not trying to make a hypothesis for that example. The new values have nothing to do with above example. From this point on, we have a random data set (X) with an unknown function f. We want to know if we could make a hypothesis h to predict results. In other words: is learning feasible?

    So in the new bin, the probability of how many points are green is a measurement for how correct a hypothesis is. We do not know how many are green, so we take a sample. In this sample, we get the relation between correct and incorrect results of the hypothesis and this says something about the entire bin (Hoeffding). So if the sample is sufficiently big and has a lot of positive predictions than yes, learning is feasible.

    Or not? -> 33:30

    okay? okay.

  92. Kezwik

    February 6, 2019 at 8:04 pm

    Hey, i just want to thank caltech sooo much for this course.
    I am currently studying informatiks in thw first year, with the goal of Machinelearning and AI. But most of the courses i have to make are out of my interests. I unterstand that a lot of those courses are juat basic for some of the stuff in higher semesters, but since i am studying because of interest and not because of a graduation, i only need the basics for the stuff i am interested in higher semester, like machine learning. All the rest is not really needed. And that is why this course really is what i want, because i can learn what my interests are in and don't have to deal.with all the other stuff that i in specific won't need.

    I even understand more in this course than i do in the lectures of my university. Even when i don't speak that much english. Your Professor is reeaally good with explaining and keep that style of PDFs. Because the PDFs are why i don't get anything in my university…there is a minimum of 10 formulas and 150 words per slice, so basically every slice is just a wall of text the professor would read of :/

    Thank you caltech

  93. plekkchand

    March 6, 2019 at 11:15 am

    Why the "okay?" hiccup every other sentence?

  94. blaoi

    March 22, 2019 at 4:44 pm

    I got five heads

  95. Romulo Amaral

    May 1, 2019 at 1:32 pm

    Thumbs up with you got five heads or five tails

  96. sagar rathi

    May 18, 2019 at 9:27 pm

    How much does this person knows, if he calls such a big concept just as simple tool. Respect.

  97. Darklink9110

    July 4, 2019 at 7:34 pm

    Im confused what the green and red marbles mean, if you pick random marbles from the bin they are random? what is learning in this context?

  98. Yue HAN

    July 20, 2019 at 6:16 am

    what is the P over X, didnt quite get it

  99. Minh Tâm Nguyễn

    July 28, 2019 at 2:56 am

    Love both his voice and jokes. A brilliant professor

  100. sreerag mundirikkal

    October 8, 2019 at 5:28 pm

    In a world of bestest MOCCs this playlist stands apart.

Leave a Reply