17. Learning: Boosting


PATRICK WINSTON: We’ve now
almost completed our journey. This will be it for
talking about several kinds of learning– the venerable kind, that’s
the nearest neighbors and identification tree
types of learning. Still useful, still the right
thing to do if there’s no reason not to do the
simple thing. Then we have the
biologically-inspired approaches. Neural nets. All kinds of problems with local
maxima and overfitting and oscillation, if you get
the rate constant too big. Genetic algorithms. Like neural nets, both are very
naive in their attempt to mimic nature. So maybe they work on
a class of problems. They surely do each have a class
of problems for which they’re good. But as a general purpose first
resort, I don’t recommend it. But now the theorists have come
out and done some things are very remarkable. And in the end, you have to
say, wow, these are such powerful ideas. I wonder if nature has
discovered them, too? Is there good engineering
in the brain, based on good science? Or given the nature of
evolution, is it just random junk that is the best ways
for doing anything? Who knows? But today, we’re going to talk
about an idea that I’ll bet is in there somewhere, because it’s
easy to implement, and it’s extremely powerful in what
it does, and it’s the essential item in anybody’s
repertoire of learning mechanisms. It’s also a mechanism which,
if you understand only by formula, you will never be able
to work the problems on the quiz, that’s for sure. Because on the surface, it
looks like it’d be very complicated to simulate
this approach. But once you understand how it
works and look at a little bit of the math and let it sing
songs to you, it turns out to be extremely easy. So it’s about letting multiple
methods work in your behalf. So far, we’ve been talking about
using just one method to do something. And what we’re going to do now
is we’re looking to see if a crowd can be smarter than the
individuals in the crowd. But before we get too far down
that abstract path, let me just say that the whole works
has to do with classification, and binary classification. Am I holding a piece of chalk in
my hand, or a hand grenade? Is that a cup of
coffee or tea? Those are binary classification
problems. And so we’re going to be talking
today strictly about binary classification. We’re not going to be talking
about finding the right letter in the alphabet that’s
written on the page. That’s a 26-way choice. We’re talking about
binary choices. So we assume that there’s
a set of classifiers that we can draw on. Here’s one– h. And it produces either a
minus 1 or a plus 1. So that’s how the classification
is done. If it’s coffee, plus 1. If it’s tea, minus 1. Is this chalk, plus one. If it’s a hand grenade,
minus 1. So that’s how the classification
works. Now, too bad for us, normally
the world doesn’t give us very good classifiers. So if we look at the error rate
of this classifier or any other classifier, that error
rate will range from 0 to 1 in terms of the fraction
of the cases got wrong on a sample set. So you’d like your error rate
to be way down here. You’re dead if it’s
over there. But what about in the middle? What if it’s, say,
right there. Just a little bit better
than flipping a coin. If it’s just a little bit better
than flipping a coin, that’s a weak classifier. And the question is, can you
make a classifier that’s way over here, like there, a
strong classifier, by combining several of these
weak classifiers, and letting them vote? So how would you do that? You might say, well, let us make
a big classifier capital H, that works on some sample x,
and has its output produces something that depends on the
sum of the outputs of the individual classifiers. So we have H1 working on x. We have H2 working on x. And we have H3 also
working on x. Let’s say three of them,
just to start us off. And now let’s add those
guys up, and take the sign of the output. So if two out of the three of
those guys agree, then we’ll get an either plus
1 or minus 1. If all three agree, we’ll
get plus 1 or minus 1. Because we’re just
taking the sign. We’re just taking the sign
of the sum of these guys. So this means that one guy can
be wrong, as long as the other two guys are right. But I think it’s easier to see
how this all works if you think of some space of samples,
you say, well, let’s let that area here be where H1
is wrong, and this area over here is where H2 is wrong. And then this area over here
is where H3 is wrong. So if the situation is like
that, then this formula always gives you the right answers
on the samples. I’m going to stop saying that
right now, because I want to be kind of a background thing
on the samples set. We’re talking about wrapping
this stuff over the sample set. Later on, we’ll ask, OK, given
that you trained this thing on a sample set, how well does it
do on some new examples? Because we want to
ask ourselves about overfitting questions. But for now, we just want to
look and see if we believe that this arrangement, where
each of these H’s is producing plus 1 or minus 1, we’re adding
them up and taking the sign, is that going to give us a
better result than the tests individually? And if they look like this when
draped over a sample set, then it’s clear that we’re going
to get the right answer every time, because there’s no
area here where any two of those tests are giving
us the wrong answer. So the two that are getting
the right answer, in this little circle here for H1, these
other two are getting the right answer. So they’ll outvote it, and
you’ll get the right answer every time. But it doesn’t have
to be that simple. It could look like this. There could be a situation
where this is H1, wrong answer. This is H2, wrong answer. And this is H3, wrong answer. And now the situation gets a
little bit more murky, because we have to ask ourselves whether
that area where three out of the three get it wrong
is sufficiently big so as to be worse than 1 of the
individual tests. So if you look at that Venn
diagram, and stare at it long enough, and try some things, you
can say, well, there is no case where this will give
a worse answer. Or, you might end up with the
conclusion that there are cases where we can arrange those
circles such that the voting scheme will give an
answer that’s worst than an individual test, but I’m not
going to tell you the answer, because I think we’ll make
that a quiz question. Good idea? OK. So we’ll make that
a quiz question. So that looks like
a good idea. And we can construct a little
algorithm that will help us pick the particular weak
classifiers to plug in here. We’ve got a whole bag
of classifiers. We’ve got H1, we’ve got
H2, we’ve got H55. We’ve got a lot of them
we can choose from. So what we’re going to do is
we’re going to use the data, undisturbed, to produce H1. We’re just going to try all the
tests on the data and see which one gives us the
smallest error rate. And that’s the good guy, so
we’re going to use that. Then we’re going to use
the data with an exaggeration of H1 errors. In other words– this is a critical idea. What we’re going to do is
we’re going to run this algorithm again, but instead of
just looking at the number of samples that are got wrong,
what we’re going to do is we’re going to look at a
distorted set of samples, where the ones we’re not doing
well on has exaggerated effect on the result. So we’re going to weight them
or multiply them, or do something so that we’re going
to pay more attention to the samples on which H1 produces an
error, and that’s going to give us H2. And then we’re going to do it
one more time, because we’ve got three things to go with here
in this particular little exploratory scheme. And this time, we’re
going to have an exaggeration of those samples– which samples are we going
to exaggerate now? We might as well look for the
ones where H1 gives us a different answer from H2,
because we want to be on the good guy’s side. So we can say we’re going to
exaggerate those samples four which H1 gives us a different
result from H2. And that’s going
to give us H3. All right. So we can think of this whole
works here as part one of a multi-part idea. So let’s see. I don’t know, what might
be step two? Well, this is a good idea. Then what we’ve got that we can
easily derive from that is a little tree looked
like this. And we can say that H of x
depends on H1, H2, and H3. But now, if that that’s a good
idea, and that gives a better answer than any of the
individual tests, maybe we can make this idea a little bit
recursive, and say, well, maybe H1 is actually
not an atomic test. But maybe it’s the vote
of three other tests. So you can make a
tree structure that looks like this. So this is H11, H12, H13,
and then 3 here. And then this will
be H31, H32, H33. And so that’s a sort of
get out the vote idea. We’re trying to get a whole
bunch of individual tests into the act. So I guess the reason this
wasn’t discovered until about ’10 years ago was because you’ve
got to get so many of these desks all lined up before
the idea gets through that long filter of ideas. So that’s the only idea number
two of quite a few. Well, next thing we might
think is, well, we keep talking about these
classifiers. What kind of classifiers
are we talking about? I’ve got– oh, shoot, I’ve spent
my last nickel. I don’t have a coin to flip. But that’s one classifier,
right? The trouble with that classifier
is it’s a weak classifier, because it
gives me a 50/50 chance of being right. I guess there are conditions
in which a coin flip is better than a– it is a weak classifier. If the two outcomes are not
equally probable, than a coin flip is a perfectly good
weak classifier. But what we’re going to do is
we’re going to think in terms of a different set
of classifiers. And we’re going to call
them decision tree. Now, you remember decision
trees, right? But we’re not going to
build decision trees. We’re going to use decision
tree stumps. So if we have a two-dimensional
space that looks like this, then a decision
tree stump is a single test. It’s not a complete tree that
will divide up the samples into homogeneous groups. It’s just what you can
do with one test. So each possible test
is a classifier. How many tests do we
get out of that? 12, right? Yeah. It doesn’t look like
12 to me, either. But here’s how you get to 12. One decision tree test you can
stick in there would be that test right there. And that would be a complete
decision tree stump. But, of course, you can
also put in this one. That would be another
decision tree stump. Now, for this one on the right,
I could say, everything on the right is a minus. Or, I could say, everything
on the right is a plus. It would happen to be wrong, but
it’s a valid test with a valid outcome. So that’s how we double the
number of test that we have lines for. And you know what? can even have a kind of test out
here that says everything is plus, or everything
is wrong. So for each dimension, the
number of decision tree stumps is the number of lines
I can put in times 2. And then I’ve got two dimensions
here, that’s how I got to twelve. So there are three lines. I can have the pluses
on either the left or the right side. So that’s six. And then I’ve got two
dimensions, so that gives me 12. So that’s the decision
tree stump idea. And here are the other decision
tree boundaries, obviously just like that. So that’s one way can generate
a batch of tests to try out with this idea of using
a lot of tests to help you get the job done. STUDENT: Couldn’t you also have
a decision tree on the right side? PATRICK WINSTON: The question
is, can you also have a test on the right side? See, this is just a stand-in for
saying, everything’s plus or everything’s minus. So it doesn’t matter where
you put the line. It can be on the right side,
or the left side, or the bottom, or the top. Or you don’t have to put
the line anywhere. It’s just an extra test, an
additional to the ones you put between the samples. So this whole idea
of boosting, the main idea of the day. Does it depend on using
decision tree stumps? The answer is no. Do not be confused. You can use boosting with
any kind of classifier. so why do I use decision
tree stumps today? Because it makes my life easy. We can look at it, we can
see what it’s doing. But we could put bunch of
neural nets in there. We could put a bunch of real
decision trees in there. We could put a bunch of nearest neighbor things in there. The boosting idea
doesn’t care. I just used these decision
tree stumps because I and everybody else use them
for illustration. All right. We’re making progress. Now, what’s the error rate
for any these tests and lines we drew? Well, I guess it’ll be the error
rate is equal to the sum of 1 over n– That’s the total number
of points, the number of samples– summed over the cases
where we are wrong. So gee, we’re going to work on
combining some of these ideas. And we’ve got this notion
of exaggeration. At some stage in what we’re
doing here, we’re going to want to be able to exaggerate
the effect of some errors relative to other errors. So one thing we can do is
we can assume, or we can stipulate, or we can assert that
each of these samples has a weight associated with it. That’s W1, this is W2,
and that’s W3. And in the beginning, there’s no
reason to suppose that any one of these is more
or less important than any of the other. So in the beginning, W sub i
at time [? stub ?] one is equal to 1 over n. So the error is just adding up
the number of samples that were got wrong. And that’ll be the fraction
of samples to that you didn’t get right. And that will be
the error rate. So what we want to do is we want
to say, instead of using this as the error rate for all
time, what we want to do is we want to move that over, and
say that the error rate is equal to the sum over the things
you got wrong in the current step, times the
weights of those that were got wrong. So in step one, everything’s
got the same weight, it doesn’t matter. But if we find a way to change
their weights going downstream– so as to, for example, highly
exaggerate that third sample, then W3 will go up relative
to W1 and W2. The one thing we want to be sure
of is there is no matter how we adjust the weights, that
the sum of the weights over the whole space
is equal to 1. So in other words, we want to
choose the weights so that they emphasize some of the
samples, but we also want to put a constraint on the weights
such that all of them added together is
summing to one. And we’ll say that that enforces
a distribution. A distribution is a set of
weights that sum to one. Well, that’s just a nice idea. So we’re make a little
progress. We’ve got this idea that we
can add some plus/minus 1 classifiers together, you
get a better classifier. We got some idea about
how to do that. It occurs to us that maybe
we want to get a lot of classifiers into the act
somehow or another. And maybe we want to think
about using decision tree stumps so as to ground out
thinking about all this stuff. So the next step is to say,
well, how actually should we combine this stuff? And you will find, in the
literature libraries, full of papers that do stuff
like that. And that was state of the art
for quite a few years. But then people began to say,
well, maybe we can build up this classifier, H of x, in
multiple steps and get a lot of classifiers into the act. So maybe we can say that the
classifier is the sign of H– that’s the one we
picked first. That’s the classifier
we picked first. That’s looking at samples. And then we’ve got H2. And then we’ve got H3. And then we’ve got how many
other classifiers we might want, or how many classifiers
we might need in order to correctly classify everything
in our sample set. So people began to think about
whether there might be an algorithm that would develop
a classifier that way, one step at a time. That’s why I put that step
number in the exponent, because we’re picking this one
at first, then we’re expanding it to have two, and then we’re
expanding it to have three, and so on. And each of those individual
classifiers are separately looking at the sample. But of course, it would be
natural to suppose that just adding things up wouldn’t
be enough. And it’s not. So it isn’t too hard to invent
the next idea, which is to modify this thing just a little
bit by doing what? It looks almost like a scoring
polynomial, doesn’t it? So what would we do to tart
this up a little bit? STUDENT: [INAUDIBLE]. PATRICK WINSTON: Come again? Do what? STUDENT: [INAUDIBLE]. PATRICK WINSTON: Somewhere out
there someone’s murmuring. STUDENT: Add– PATRICK WINSTON: Add weights! STUDENT: –weights. Yeah. PATRICK WINSTON: Excellent. Good idea. So what we’re going to do is
we’re going to have alphas associated with each of these
classifiers, and we’re going to determine if somebody
can build that kind formula to do the job. So maybe I ought to modify this
gold star idea before I get too far downstream. And we’re not going to treat
everybody in a crowd equally. We’re going to wait some of the
opinions more than others. And by the way, they’re all
going to make errors in different parts of the space. So maybe it’s not the wisdom of
even a weighted crowd, but a crowd of experts. Each of which is good at
different parts of the space. So anyhow, we’ve got this
formula, and there are a few things that one can
say turn out. But first, let’s write down the
an algorithm for what this ought to look like. Before I run out of space, I
think I’ll exploit the right hand board here, and put the
overall algorithm right here. So we’re going to start out by
letting of all the weights at time 1 be equal to 1 over n. That’s just saying that they’re
all equal in the beginning, and they’re
equal to 1 over n. And n is the number
of samples. And then, when I’ve got
that, I want to compute alpha, somehow. Let’s see. No, I don’t want to do that. I want to I want to pick a classifier the
minimizes the error rate. And then m, i, zes,
error at time t. And that’s going to
be at time t. And we’re going to come
back in here. That’s why we put a step
index in there. So once we’ve picked a
classifier that produces an error rate, then we can
use the error rate to determine the alpha. So I want the alpha over here. That’ll be sort of a byproduct
of picking that test. And with all that stuff in
hand, maybe that will be enough to calculate Wt plus 1. So we’re going to use that
classifier that we just picked to get some revised weights,
and then we’re going to go around that loop until this
classifier produces a perfect set of conclusions on
all the sample data. So that’s going to be our
overall strategy. Maybe we’ve got, if we’re going
to number these things, that’s the fourth big idea. And this arrangement here
is the fifth big idea. Then we’ve got the
sixth big idea. And the sixth big
idea says this. Suppose that the weight on it
ith sample at time t plus 1 is equal to the weight at time t
on that same sample, divided by some normalizing factor,
times e to the minus alpha at time t, times h at time t, times
some function y which is a function of x, But not
a function of time. Now you say, where did
this come from? And the answer is, it did not
spring from the heart of mathematician in the first
10 minutes that he looked at this problem. In fact, when I asked
[INAUDIBLE] how this worked, he said, well,
he was thinking about this on the couch every Saturday
for about a year, and his wife was getting pretty
sore, but he finally found it and saved their marriage. So where does stuff like
this come from? Really, it comes from knowing
a lot of mathematics, and seeing a lot of situations,
and knowing that something like this might be
mathematically convenient. Something like this might be
mathematically convenient. But we’ve got to back up a
little and let it sing to us. What’s y? We saw y last time. The support vector machines. That’s just a function. That’s plus 1 or minus 1,
depending on whether the output ought to be plus
1 or minus 1. So if this guy is giving the
correct answer, and the correct answer is plus, and then
this guy will be plus 1 too, because it always gives
you the correct answer. So in that case, where this
guy is giving the right answer, these will have the same
sign, so that will be a plus 1 combination. On the other hand, if that guy’s
giving the wrong answer, you’re going to get a minus
1 out of that combination. So it’s true even if the right
answer should be minus, right? So if the right answer should
be minus, and this is plus, then this will be minus 1, and
the whole combination well give you minus 1 again. In other words, the y just flips
the sign if you’ve got the wrong answer, no matter
whether the wrong answer is plus 1 or minus 1. These alphas– shoot, those are the same
alphas that are in this formula up here, somehow. And then that z, what’s
that for? Well, if you just look at the
previous weights, and its exponential function to produce
these W’s for the next generation, that’s not going to
be a distribution, because they won’t sum up to 1. So what this thing here, this
z is, that’s a sort of normalizer. And that makes that whole
combination of new weights add up to 1. So it’s whatever you got by
adding up all those guys, and then dividing by that number. Well, phew. I don’t know. Now there’s some
it-turns-out-thats. We’re going to imagine that
somebody’s done the same sort of thing we did to the support
vector machines. We’re going to find a way
to minimize the error. And the error we’re going to
minimize is the error produced by that whole thing
up there in 4. We’re going to minimize the
error of that entire expression as we go along. And what we discover when
we do the appropriate differentiations and stuff– you know, that’s what
we do in calculus– what we discover is that you
get minimum error for the whole thing if alpha is equal
to 1 minus the error rate at time t, divided by the
error rate at time t. Now let’s take the logarithm
of that, and multiply it by half. And that’s what [INAUDIBLE] was struggling to find. But we haven’t quite
got it right. And so let me add this in
separate chunks, so we don’t get confused about this. It’s a bound on that expression
up there. It’s a bound on the error rate
produced by that expression. So interestingly enough, this
means that the error rate can actually go up as you add
terms to this formula. all you know is that the error
rate is going to be bounded by an exponentially decaying
function. So it’s eventually guaranteed
to converge on zero. So it’s a minimal error bound. It turns out to be
exponential. Well, there it is. We’re done. Would you like to see
a demonstration? Yeah, OK. Because you look at that, and
you say, well, how could anything like that
possibly work? And the answer is, surprisingly
enough, here’s what happens. There’s a simple
little example. So that’s the first
test chosen. the greens are pluses and the
reds are minuses, so it’s still got an error. Still got an error– boom. There, in two steps. It now has– we can look in the upper
right hand corner– we see its used three
classifiers, and we see that one of those classifiers says
that everybody belongs to a particular class, three
different weights. And the error rate has
converged to 0. So let’s look at a couple
of other ones. Here is the one I use for
debugging this thing. We’ll let that run. See how fast it is? Boom. It converges to getting all the
samples right very fast. Here’s another one. This is one we gave on an
exam a few years back. First test. Oh, I let it run, so
it got everything instantaneously right. Let’s take that through
step at a time. There’s the first
one, second one. Still got a lot of errors. Ah, the error rate’s dropping. And then flattened, flattened,
and it goes to 0. Cool, don’t you think? But you say to me, bah, who
cares about that stuff? Let’s try something
more interesting. There’s one. That was pretty fast, too. Well, there’s not too
many samples here. So we can try this. So there’s an array of
pluses and minuses. Boom. You can see how that error
rate is bounded by an exponential? So in a bottom graph, you’ve got
the number of classifiers involved, and that goes up to
a total, eventually, of 10. You can see how positive
or negative each of the classifiers that’s added
is by looking at this particular tab. And this just shows how
they evolve over time. But the progress thing here
is the most interesting. And now you say to me, well, how
did the machine do that? And it’s all right here. We use an alpha that
looks like this. And that allows us to compute
the new weights. It says we’ve got a preliminary
calculation. We’ve got to find a z that
does the normalization. And we sure better bring our
calculator, because we’ve got, first of all, to calculate
the error rate. Then we’ve got to take its
logarithm, divide by 2, plug it into that formula, take the
exponent, and that gives us the new weight. And that’s how the
program works. And if you try that,
I guarantee you will flunk the exam. Now, I don’t care about
my computer. I really don’t. It’s a slave, and it can
calculate these logarithm and exponentials till it turns
blue, and I don’t care. Because I’ve got four cores or
something, and who cares. Might as well do this,
than sit around just burning up heat. But you don’t want to do that. So what you want to do is you
want to know how to do this sort of thing more
expeditiously. So we’re going to have to let
them the math sing to us a little bit, with a view towards
finding better ways of doing this sort of thing. So let’s do that. And we’re going to run out of
space here before long, so let me reclaim as much of
this board as I can. So what I’m going to do is I’m
going to say, well, now that we’ve got this formula for alpha
that relates alpha t to the error, then I can plug
that into this formula up here, number 6. And what I’ll get is that the
weight of t plus 1 is equal to the weight at t divided by
that normalizing factor, multiplied times something that
depends on whether it’s categorized correctly or not. That’s what that y’s in
their for, right? So we’ve got a logarithm here,
and we got a sign flipper up there in terms of that H
of x and y combination. So if the sign of that whole
thing at minus alpha and that y H combination turns out to be
negative, then we’re going to have to flip the numerator
and denominator here in this logarithm, right? And oh, by the way, since we’ve
got a half out here, that turns out to be the square
root of that term inside the logarithm. So when we carefully do that,
what we discover is that it depends on whether it’s the
right thing or not. But what it turns out to be is
something like a multiplier of the square root. Better be careful, here. The square root of what? STUDENT: [INAUDIBLE]. PATRICK WINSTON: Well,
let’s see. But we have to be careful. So let’s suppose that this is 4
things that we get correct. So if we get it correct, then
we’re going to get the same sign out of H of x and y. We’ve get a minus sign out
there, so we’re going to flip the numerator and denominator. So we’re going to get the square
root of e of t over 1 minus epsilon of t if
that’s correct. If it’s wrong, it’ll just
be the flip of that. So it’ll be the square root of
1 minus the error rate over the error rate. Everybody with me on that? I think that’s right. If it’s wrong, I’ll have to hang
myself and wear a paper bag over my head like
I did last year. But let’s see if we can make
this go correctly this time. So now, we’ve got this guy here,
we’ve got everything plugged in all right, and we
know that now this z ought to be selected so that it’s equal
to the sum of this guy multiplied by these things as
appropriate for whether it’s correct or not. Because we want, in the end,
for all of these w’s to add up to 1. So let’s see what they add up
to without the z there. So what we know is that it must
be the case that if we add over the correct ones, we
get the square root of the error rate over 1 minus the
rate of the Wt plus 1. Plus now we’ve got the sum of
1 minus the error rate over the error rate, times the sum of
the Wi at time t for wrong. So that’s what we get if
we added all these up without the z. So since everything has to add
up to 1, then z ought to be equal to this sum. That looks pretty horrible,
until we realize that if we add these guys up over the
weights that are wrong, that is the error rate. This is e. So therefore, z is equal the
square root of the error rate times 1 minus the error rate. That’s the contribution
of this term. Now, let’s see. What is the sum of the
weights over the ones that are correct? Well, that must be 1 minus
the error rate. Ah, so this thing gives you the
same result as this one. So z is equal to 2 times that. And that’s a good thing. Now we are getting somewhere. Because now, it becomes a little
bit easier to write some things down. Well, we’re way past this,
so let’s get rid of this. And now we can put some
things together. Let me point out what I’m
putting together. I’ve got an expression
for z right here. And I’ve got an expression
for the new w’s here. So let’s put those together and
say that w of t plus 1 is equal to w of t. I guess we’re going to
divide that by 2. And then we’ve got this square
root times that expression. So if we take that correct one,
and divide by that one, then the [INAUDIBLE] cancel out, and I get 1 over
1 minus the error rate. That’s it. That’s correct. And if it’s not correct,
then it’s Wt over 2– and working through the math– 1 over epsilon, if wrong. Do we feel like we’re
making any progress? No. Because we haven’t let it
sing to us enough yet. So I want to draw your attention
to what happens to amateur rock climbers
when they’re halfway up a difficult cliff. They’re usually [INAUDIBLE],
sometimes they’re not. If they’re not, they’re
scared to death. And every once in a while, as
they’re just about to fall, they find some little tiny hole
to stick a fingernail in, and that keeps them
from falling. That’s called a thank-god
hole. So what I’m about to introduce
is the analog of those little places where you can stick
your fingernail in. It’s the thank-god
hole for dealing with boosting problems. So what happens if I add
all these [? Wi ?] up for the ones that the
classifier where produces a correct answer on? Well, it’ll be 1 over 2, and 1
over 1 minus epsilon, times the sum of the Wt for which
the answer was correct. What’s this sum? Oh! My goddess. 1 minus epsilon. So what I’ve just discovered is
that if I sum new w’s over those samples for which I
got a correct answer, it’s equal to 1/2. And guess what? That means that if I sum them
over wrong, it’s equal to 1/2 half as well. So that means that I take all of
the weight for which I got the right answer with the
previous test, and those ways will add up to something. And to get the weights for the
next generation, all I have to do is scale them so that
they equal half. This was not noticed
by the people who developed this stuff. This was noticed by Luis
Ortiz, who was a 6.034 instructor a few years ago. The sum of those weights is
going to be a scaled version of what they were before. So you take all the weights
for which this new classifier– this one you selected to give
you the minimum weight on the re-weighted stuff– you take the ones that it gives
a correct answer for, and you take all of those
weights, and you just scale them so they add up to 1/2. So do you have to compute
any logarithms? No. Do you have to compute
any exponentials? No. Do you have to calculate z? No. Do you have to calculate alpha
to get the new weights? No. All you have to do
is scale them. And that’s a pretty good
thank-god hole. So that’s thank-god
hole number one. Now, for thank-god hole number
two, we need to go back and think about the fact that were
going to give you problems in probability that involve
decision tree stumps. And there are a lot of decision
tree stumps that you might have to pick from. So we need a thank-god
hole for deciding how to deal with that. Where can I find some room? How about right here. Suppose you’ve got a space
that looks like this. I’m just makings this
up at random. So how many– let’s see. 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11. How many tests do I have to
consider in that dimension? 11. It’s 1 plus the number
of samples. That would be horrible. I don’t know. Do I have actually calculate
this one? How could that possibly be
better than that one? It’s got one more thing wrong. So that one makes sense. The other one doesn’t
make sense. So in the end, no test that
lies between two correctly classified samples will
ever be any good. So that one’s a good guy, and
that one’s a good guy. And this one’s a bad guy. Bad guy, bad guy bad
guy, bad guy. Bad guy, bad guy, bad buy. So the actual number of tests
you’ve got is three. And likewise, in the
other dimension– well, I haven’t drawn it so well
here, but would this test be a good one? No. That one? No. Actually, I’d better look over
here on the right and see what I’ve got before I draw
too many conclusions. Let’s look over this, since I
don’t want to think too hard about what’s going on in
the other dimension. But the idea is that
very few of those tests actually matter. Now, you say to me, there’s
one last thing. What about overfitting? Because all this does is drape
a solution over the samples. And like support vector machines
overfit, neural maps overfit, identification
trees overfit. Guess what? This doesn’t seem to overfit. That’s an experimental
result for which the literature is confused. It goes back to providing
an explanation. So this stuff is tried on all
sorts of problems, like handwriting recognition,
understanding speech, all sorts of stuff uses boosting. And unlike other methods, for
some reason as yet imperfectly understood, it doesn’t
seem to overfit. But in the end, they leave no
stone unturned in 6.034. Every time we do this, we do
some additional experiments. So here’s a sample that
I’ll leave you with. Here’s a situation in which we
have a 10-dimensional space. We’ve made a fake distribution,
and then we put in that boxed outlier. That was just put into the space
at random, so it can be viewed as an error point. So now what we’re going to do
is we’re going to see what happens when we run that guy. And sure enough, in 17 steps,
it finds a solution. But maybe it’s overfit that
little guy who’s an error. But one thing you can do is
you can say, well, all of these classifiers are dividing
this space up into chunks, and we can compute the size of the
space occupied by any sample. So one thing we can do– alas, I’ll have to get up
a new demonstration. One thing we can do, now that
this guy’s over here, we can switch the volume tab and watch
how the volume occupied by that error point evolves
as we solve the problem. So look what happens. This is, of course, randomly
generated. I’m counting on this working. Never failed before. So it originally starts
out as occupying 26% of the total volume. It ends up occupying
1.4 times 10 to the minus 3rd% of the volume. So what tends to happen is
that these decision tree stumps tend to wrap themselves
so tightly around the error points, there’s no room for
overfitting, because nothing else will fit in that
same volume. So that’s why I think that this
thing tends to produce solutions which don’t overfit. So in conclusion,
this is magic. You always want to use it. It’ll work with any kind
of [? speed ?] of classifiers you want. And you should understand it
very thoroughly, because of anything is useful in the
subject in dimension learning, this is it.

Leave a Reply