Efficient and Scalable Deep Learning

>>Hi, everyone. I’m very happy to introduce Wei Wen
from Duke University to give us his research presentation on Efficient and
Scalable Deep Learning. So many of us are working
on deep learning or use the area related to deep
learning especially in nowadays, we love to train large models
that had burdened GPT. But here, Wei is going to share
a very unique perspective of deep learning where he focus on
the efficiency of deep learning. In particular, he focus on model compression and
distributed training and AutoML. Wei also have rich
industry experience. He interned with NSR here and integrates Google Brain
and the Facebook AI, and now let’s learn from Wei.>>Thank you [inaudible]. Hi everyone, I’m Wei from
Duke. Thanks for coming. I’m really glad to
be here since I did two internships here and I’m
trying to get myself upgraded. So topic today is about Efficient and Scalable
Deep Learning and Beyond. So the general trend of
deep learning is we can always get better performance
if we are able to train a larger models given
we have a lot of data. So here is a figure which covers the majority of
the computer vision models, and the x-axis is the
computation cost, y-axis is the accuracy. So in general, if we
compute the larger model, we now lets get a better performance. This is typically only true for
one specific neural architectures. For example, for ResNet, we can get better performance if we can get the
deeper new networks. This is similar for Inception model. Similarly, in natural
language processing, we also get similar trend. So here is example for
language modeling on Wikitext. So if we can build the larger model, we can get a better perplexity. So we are still curious
about how can we go below if we can still build a
larger models in NLP programs. So the question is, why don’t we always
build a larger model? But there are obstacles. The first obstacle is
on the training side. Since training larger
model is very slow, so if we have a new model, we want to evaluate. So we want to check is this
new model is a good or not, then we will take a longer
training time to get the feedback. So usually we have to evaluate lot of model before we find a good model. So if the model is too large, we will extend our research
cycle and production cycle. The second obstacle is
on the inference part. So after we build a model, eventually, we want to use it. But if the model is too large, the inference turn is very slow. So it would be very
challenging to deploy those models to applications which have very limited
computing resources or memory resources like
the Microsoft Hololens. So my research in general
is trying to make the training faster and
also inference faster. So we can build a larger
model and also to make those model deliverable to applications in real
industry productions. So here is outline. I’ll first introduce my
previous research to make the training faster typically in distributed Training Systems and also my research on Sparse New Networks
to make the inference faster. Finally, I will introduce
my future research. So let’s go to the first part. How can we make the distributed
training system faster? So I’ll focus on one work I did on Ternary Gradients to reduce the communication in
distributed deep learning. So here is a little bit of background about
Distributed Deep Learning. So in synchronize SGD, we have our sender, parameter
server and we first split. We send the parameter to multiple machines and train those model in parallel
but using different data. After the training is
done at each machine, we will synchronize the gradient
to the sender parameter server. This finishes one run, one iteration though it will
keep doing this again and again. This is a good because
we have a lot of machine which means
we can train faster. But we have problem. We have the communication
bottleneck problem. So in general, if we
have more machine, then we can always reduce
our computation time. But since we have more machine, which means we have more
synchronization over the network. So the communication
time will increase. So your total time will
saturate at some point. So you cannot go beyond one point. So it limit the scalability of
distributed deep learning system. So my research goal
is trying to reduce the communication time and make the distributed
training more scalable. So that’s simple. So in distributed training system, we can slightly change
the communication pattern that we only communicate
gradients over the network. So each worker will compute the
gradients and send over the sender parameter which would average
the gradients and send back. So we only communicate
gradients over the network. So usually, gradients are floating precision which have
32 bits per element.>>[inaudible] thing of rotation.>>On location?>>Rotation. What’s G in
red and what’s G in blue?>>So rotation here, W is the weight and
the G is the gradient. So at each worker, it will first compute
gradient on each data. So it will send back to the
parameter server and it will average all the gradients from
all the worker and get the averaged
gradient and send back, then each worker will be updated
by the averaged gradients.>>So you assume that
the model has n copies?>>Yes.>>Then all the models
are always the same. So they [inaudible] by
the same [inaudible]?>>Yes. So the
initialization is the same. So the update is also the same, so it will always be the same.>>The distributed trend in
[inaudible]. The model is in the copy.>>Model is in the worker and they use the same
initialization seat. So they will be always the same. This way we can only
communicate the gradients and explain why it’s beneficial to only communicate
the gradients over the network. So usually, it’s 32. In this work, we do quantization. We reduce the precision
to only three levels. So only three discrete values. We call this ternary gradients. TernGran, in short. If we can successfully do this, we can reduce at least 16 times of
reduction of the communication. So this will be very
challenging because it will lose lot of precision
in the training process.>>As [inaudible] research
I did as [inaudible].>>Yes. I’m aware of that. Yes. So the basic idea is simple. So before I go to the detail, let’s go one step back. So in the supervised
learning in general, we want to minimize the average loss over all
the training samples, and the gradients can be updated by the gradient from all the dataset, and n is usually very large. So the computation is very costly. So in deep learning usually
we use a stochastic version. We randomly sample from
the dataset and we use the sample gradients to estimate
the original batch gradient. It works well because one reason is the expectation of the gradients
is the original batch gradient. So it’s unbiased. So if so, why don’t we do a quantization in a way such that
after the quantization, we still keep the expectation. So here the motivation is
we do ternarization on the floating gradient and we want to keep the expectation
as the original gradient. So it’s still unbiased gradient. So that’s the motivation. So how can we do it? It’s simple actually. So this one is the floating gradient. We first get the sign
of all the gradients. Then we have a scalar which
is usually very small.>>[inaudible].>>It’s either not learned. So s_t is the maximum absolute
value of all the gradients. So we have those two part
and then to element wise, multiplication with binary code
which is either one or zero. But this one is the random variable. So it’s basically
Bernoulli distribution. The probability for each element, its probability of being one
is just the absolute value of the gradients over
the maximum scalar here.>>What is k?>>k is the index of the gradients. So we have k gradients.>>Coordinate.>>Yes, coordinate. So the index. So gt is all the gradients and it has element wise
multiplied by a vector. So k is the index of the vector
which is a binary code like here. So let’s go to the example. So let’s say our floating
gradient is like this value and we get the maximum
absolute value which is 1.2. Then we get the sign
of all the gradients. Then we form our
Bernoulli distribution and its probability of being one. So let’s say for this element, its probability of being one is
just 0.3 over 1.2. So it’s simple. Then we draw a Bernoulli
distribution from this probability. We gather sample of the binary
code and we multiply all of them. So if you’re doing this way, we do a little bit of math here. So the expectation of the quantized gradients is just the expectation
of their original one. So its unbiased so we
keep the expectation. But we do increase a
little bit of variance, but I’ll go more detail about
how we reduce the variance. So then we only need
like two bits for each element and the one
single floating value. So it work significantly reduced
the communication volume.>>Can you explain the [inaudible]? So for example, why do we need b_t?>>Why do you need b_t
is we want to draw. We want to draw a
Bernoulli distribution. So first, eventually, the value is just one or
zero. So it’s binary. So there are only three values, three possible values here. Then we can encode it in
a very low precision.>>Assuming that is a
zero one and minus one.>>Zero minus t and s_t. So there is a scalar,
small scalar here.>>Excuse me. How did you
get from the last line? How did you get from the bottom, the final line, how did
you get from this to that?>>This here?>>Yes.>>So here, we have
expectation over two, and then we get expectation over z, and the expectation over b given z.>>Yeah, that’s it.>>So this one have low value
related to z, so it’s gone.>>Yes.>>Then this one is just b. So we get it there.>>So but how is expected value of b equals the expected value
of g. I don’t follow. I think, I’m missing something.>>This one?>>Yes.>>Exactly. I just want to
see your simplification here. How do you simplify this?>>The probability. The
[inaudible] probability.>>Yes.>>Just there. The probability
just g_tk over s_t.>>Yes.>>Didn’t expect how
you just got g_t.>>Yes.>>Original g_tk back.>>Yes.>>It’s basically variable. So we’ll not go into detail, but it’s just the product. This is just the data. So we only care about b_t. So the expectation of b_t
is just the probability. A probability which is this one and they multiply
this one with this one. It will reduce to that one. We proved the
convergence of TernGrad, and this one is the basic assumption to prove
the convergence of standard SGD, and so it’s standard, it’s not ours. So to prove the
convergence of TernGrad, we do need a little bit
stronger assumption on the gradient bound here. So in the standard SGD
the L2 should be bounded, but now we require the multiplication between the maximum norm
and L1 norm bounded. So this value is always
larger than this value, so TernGrad do need a
little bit stronger bound. But we propose some tricks to make those two bounds
closer to each other. So one is layer-wise ternarization. We do it layer-wisely because differently layer
have different distribution, and then we also do gradient clipping to limit the range of the gradients, the details is in the paper. So evaluation, we evaluate
on ImageNet or AlexNet. So on AlexNet there’s
low accuracy loss, even using three discrete
value in the gradients. So in some cases, we even observe higher accuracy, because when the batch
size is very large, the variance is too
small to learn well, but the variance from the quantization
will help the exploration. Then here is the convergence curve
compared with a standard SGD. So the convergence is also the same. We evaluate on Google app, we observe some small, loss but on average,
it’s two percent. One thing I want to emphasize is
we didn’t do any hyperparameter. All the hyperparameters are
used from the standard SGD, so we just use it. We use the same venerate. We use the same batch size. We use the same total epochs, but we could get a better
accuracy if we tuned that.>>So have you ever tried that? In the beginning you use
your TernGrad two to train, but in the end for example, standard training were
decreasing everywhere? For example, after you decreased
the learning rate twice, for this smallest number which you return to the full
precession of the gradient. So it’s recovered the
original accuracy or it stills get to better accuracy?>>We didn’t try exactly that way, but we tried like every 10 epochs. The first that are line
epochs, we used TernGrad, and the last one epoch, we used full precision. It is similar. It’s the same.>>So you mean its accuracy
is similar with TernGrad, it’s not similar with TernGrad?>>Right.>>Original SGD?>>All right. I didn’t try the one you said, probably that won’t help.>>Okay.>>So that’s all about the convergence and we can
reduce the communication. So in practice, how
can it make it faster? So here are the speed over the number of GPUs in
the distributed system, and the solid bars are the
standard floating point of SGD, and the shaded bars are hours. So we can always get speedup. So in general, TernGrad can give us a higher speedup if the communication time
occupies a higher ratio. That’s obvious because we
reduced the communication. So in general, it give us a higher speedup if
we use more machines. So if we use more machine, we have more communication ratio, or if our communication
bandwidth is very low, like we can get a higher
benefit if we use low-end, unlike a network like
Ethernet verses InfiniBand. Or if we trained giving
you network which have more hyperparameter
versus the computation. So I personally I think it
will give us more benefits in an NLP problem than CNN because NLP problem have more
parameter over competition. Also, it can give us
more benefits if you use GPU best distributed system because GPU computes faster and the communication is
more of the bottleneck. TernGrad is in production, it’s adopted by Facebook, AI Infra, and use it to reduce the communication
bottleneck in the AI Infra, and it’s evaluated by
the ads ranking model, which have a zero tolerance
of accuracy loss. They cannot tolerate
any accuracy loss. Otherwise, that means a lot of money. It is also available in the PyTorch. So this finished my first part
to make the training faster. I could take one
question if you have.>>Assume that you
test the [inaudible].>>Yes. We tested for
both momentum and add up.>>What’s the result?>>Its similar.>>Similar.>>Yes.>>So can you give
an intuition why you can reduce a context like gradient so much and the final accuracy is
still roughly the same?>>Because of the variance. We have higher variance. If you keep the expectation, but you increase the variance. Yeah.>>Okay.>>So I go to move to our second part of our research
on make the inference faster. So inference acceleration. I did a little bit more
research on this learn. So I even tried to put
the new network with a chip which only support
a spiking new network. I also tried to cluster the neurons
of this sparse new network. Why your congestion in the circuit
design to be more efficient. But I wouldn’t into too much detail. I will only focus on two research here on sparse
deep neural networks. By the way, one of the work was published when I
was an intern here. My mentor was [inaudible]. So when people talk about
the sparse neural network, usually they’re refering to
a neural network on which a lot of connections are removed. It can significantly reduce the storage size of a
given neural network, and if we can customize the hardware for the specific neural network, they can get a good speedup. However, when not a major the speed on a general platform CPU or GPU, the story is totally different. So here is a sparsity and we
get a 95 percent of sparsity. But when we measure
the speed on the CPU, on the GPU, the speed
is very limited. In many cases, when the speedup is y, which means the speed is the same. So this speed is very limited. In some cases, it’s even worse. So why? Because the
dispersity is non-structured. It’s randomly distributed,
but the hardware is very customized for
regular computation, but the non-structured
pattern just break the regularity for hardware
parallel [inaudible]. So you’d have a very
poor data locality, so you get a very trivial speedup. It’s better on a CPU platform. So this one is the speedup of random sparse neural
network and over sparsity. But it’s not as good as
the theoretic speedup. For example, here, the
sparsity is 90 percent, but only two times speedup, because of a similar issue I just mentioned in
the previous slides. So to make it more efficient. So we think we should use structure sparsity instead of
random non-structured sparsity. So here is an example to show the scalability of structured
sparsity versus non-structured one. So when I say a structured sparsity, I mean lot of rows and
the columns are all zero. So we can just remove those
zero rows and the columns, and then compress it to just
a small dense weight matrix. So because it’s dense and it’s small, it can compute much faster. So what’s structurally
sparse deep neural network? So it means connections or weights are removed group
by group not just one by one. So in Neural Architecture it means, we remove one dense
structure like one neuron, or it can be one layer, it can be one filter in
the convolution layer, or it can be one hidden state in
the recurrent neural networks. In terms of the perspective
of a weight matrix, structurally sparse deep
neural network means, we remove a weight block by block, so one group can be
one rectangle block, can be one row, can be one column, or even can be the whole matrix. So it’s pretty fun. So how can we achieve it? How can we learn structure sparsity? It simple again. Group lasso
regularization is all you need. So group lasso regularization was proposed and it’s very effective
to learn structure sparsity. So basically, how does it work? We first split the weights
to a lot of groups. Like here is a example. We split them into two groups. Then we add group lasso
regularization on each group, which is basically the
vector length of the group. So here is the group lasso
on those two groups. Then we add all those group lasso regularization as
one single regularization, and we add regularization to our data loss function
like a cross entropy. So we just learn it end-to-end
using stochastic gradient descent.>>What’s your criteria to split
weights into several groups?>>Good question. It depends on
what structure you want to learn.>>So this is structure
dependence [inaudible].>>Yes. So it depends on what
structure you want to learn. Let’s say you want to remove filters in the convolution
neural network, then one group is all the
weights in one filter. So if we want to remove a one row in the weight matrix then one group is one row
of all the weights. So it depends on what
structure you want to learn. So we refer to our visit as SSL. I’ll use this one a lot. So they are more regress proof about why group lasso regularization can learn structured sparsity, but nets explain in iteratively. So here is the group
of all the weights, so it’s the vector. The way we do gradient descent, it will be updated by regular
gradients which comes from the data loss plus one
additional gradients. This one is the additional gradients. Basically, it’s just
the unit vector going against the direction of the vector. So during the training, it will iteratively squeeze the
size of the vector and eventually, if it can, they all
will go to the zero, so it will remove all the weights. So many groups can
be pushed to zeros, then we can learn our
structured sparsity pattern.>>Sorry, just a good one, you said many groups
will push to zero, is the whole group push to zero?>>The whole group.>>The whole group.>>Many whole groups. Yes. So here is a
comparison on AlexNet, don’t laugh at me, AlexNet. It was a state-of-art, but is fair for comparison between non-structured sparsity and a
structured one, so it’s fair. When I say structured sparsity, it means this random sparsity, and so here, structured sparsity
I mean I remove rows and columns. There are a lot of information
here but let’s break it down. So bars here are
speedup on CPU and GPU, and lines here are like sparsity
across our five layers, and the orange color corresponds
to non-structured sparsity, and the green one
corresponds to our approach. So you can see the bar of
the green one are taller than the orange one which means
we can give a vertical speedup.>>Can you explain why the
speedup layers are different?>>Because you get
different sparsity.>>Is this related to [inaudible]>>Yes. So in general, shallow network have a less compact, because usually you
have fewer filters, but you have more filters in the deeper layer and also the features in the
deeper layer are more sparse. So you can achieve a higher
sparsity in the deeper layers.>>Are you think this
sparsity can help you to read this algorithm?>>Yes. I would suggest
that we don’t need that much filters in the
deeper layers in this case.>>Excuse me, what’s asked is
numbers of the parallel layers?>>This number?>>Yes.>>The index of the convolution.>>Okay.>>Convolution one, two.>>So what’s the meaning of
the column and the row line?>>This one?>>No. In the figure you
have four knots, Y is which?>>Okay, this one is the sparsity
of how many weights are removed.>>Okay.>>This one is how
many rows are removed. This one how many rows, this is Y, how many columns.>>Do lose any accuracy
by doing that?>>This one is two percent but the
accuracy are the same for both. They both have two percent.>>Two percent both?>>Yes, both.>>So for the baseline, when did you use the original paper that you [inaudible]
and then retrain?>>Not really, we use just
like we have used L1 but the sparsity is higher
than the original paper. So this is the AlexNet. So as I said, what
structure we want to learn depends on how
we split the groups. So here, if you want
to remove layers, then one group or weights are
all the weights in one layer. So if we remove one group,
then all the weights. So one day you can be removed.>>Does this mean that you can reduce one layer of
the whole network?>>Yes, that means we
can reduce the depth.>>Okay. Does it mean the features in both layers interact as
an important [inaudible]?>>Yes.>>Okay.>>Then the information can
go through the shortcuts. So here is a experiments
on [inaudible]. So on ResNet 32 we can reduce
the number of layers by 14, but we get a similar accuracy. There are a lot of redundancy
in the deep neural networks. Again, we can generalize it to LSTM. So again, what structure we want to learn depends on how
we split the weights. So in this case, we can remove the hidden size. So all the white strips here are the structures that are
associated with one hidden state. So if we want to remove
one hidden state, we have to remove all the structure
in the white strips back here. Here is a sophisticated
formula but let me cover it. So if you want to remove
one hidden state, that means we have to
remove two rows and four columns in the weight
matrix in the LSTM. So we basically group all the weights in one group and then
we remove many groups, then we can reduce the hidden size.>>Can I put the same
question [inaudible]. Well, different models, use different definitions
of the groups.>>Yes.>>So is there any heuristics
for the different groups?>>So it depends on what
structure you want to remove. So in this case, we
have to predefine. We predefine, say we want
to remove hidden states, then we find all the weights
associated with one hidden state.>>The most [inaudible] total which structure we
ought to [inaudible]>>In that case, then I would just split the
weights to many small blocks, let’s say eight by
eight and let it learn, then it will learn what
structure can be important.>>Well put. Have you ever
tried different heuristics say, okay I grouped the premises this way and that way
and read the comparison, first of all, in resonance because your original goal is
to use the left layers. So you assume initially at this time, you say okay, I want to
remove the hidden structures.>>So I would say we can
put all the structure, all the regularization
for all kinds of structure we want to learn
into one single loss. Then we can let it to learn but that would have
more hyper parameters.>>So in this case if you
perform them in the beginning, you do not know this structure. So you specify maybe the hidden size, hidden dimension to be one-sided. You put this [inaudible] to
learn their sparse structure. Under there, you find that okay, I can remove 200. So finally it’s 800,
800 hidden neuros. So we’ll you retrain the model with this 800 neuros again or will
you just use a previous one?>>In CNN, fine tuning will help. But in RNN, we find it
won’t happen a lot. It’s quite similar. So we didn’t do that.>>Also for seeing just now, I missed one part. So what do we mean by remove
row and remove column? Because it’s more like weight
times h times channel. So I’m not really understanding
what’s the meaning of row, what’s the meaning of column.>>So that depends on the
implementation of the lower level. So in general in cafe, one row, basically, is one 3D filter. But in that case, we are trying to
squeeze the size of the matrix. We could do it in the
perspective of computation. So one row is filter.>>So can I understand
one row is a channel?>>One row is a filter and one
column is more sophisticated. But that’s from the, yeah.>>When you remove a row it’s like removing one channel out of
this hidden state, right?>>Oh, so you’re going
back to L scale?>>Yeah, I’m talking about
the L scale in this case like the white row and
then the [inaudible]>>So all of them will be removed.>>Yes, I know. So when you
remove that row in that case, it will look like this a column and row and
each of these matrices. It’s like instead of
let’s say you have 1,000 dimension for
the hidden h value, it would make it 999 right?>>Yes.>>That’s the case?>>Yes.>>So if you train for
LSTM for a size of 999, will give the same result?>>I’ll get there.>>Or there is something
specific about that?>>I’ll get there. Yes. So I’m sorry, this one is better
but I’ll get there.>>No, I think I understand the idea. So what you’re going to get there
is going to say that no you can’t target for small
dimension right off the bat. Instead, you do this so the [inaudible] has the
ability to actually find the right structure or
the maximum sparsity. Okay. So is this correct
understanding if I say that you’re actually searching for the
right dimensions of the model? You have a method to find the
right size for the model.>>That’s one benefit.>>If you use sparsity
involved in the case, I just want to go and find
the right hidden dimension. This is the right way to do it.>>That’s yes, and
the speed is faster.>>I don’t because in this case, this kind of this facility
it’s very simple. You just see that okay, I just first use very
large hidden insights. Once I solved that, then I find okay, 980 is a good number. So I think that your
question is very rare to fine tune it or just first pick
the all recall sparsity number, not 80 and then train
from the beginning.>>Yes. I’ll get there. Let’s go
to here and answer your question. So we first use this
baseline and they have originally 1,500 hidden
stuff in this hidden size. The user of approach, we keep the original perplexity
but we reduce the size. So we get a significant speedup. So to equal your question, can we train a smaller model? Can we get the same performance? So we did the experiments here. So we train the same
on LSTM which have the same hidden size,
but from scratch. Instead, train is
smaller from scratch. But the perplexities is much
worse than to prove it down.>>Yeah. Okay. So the reason I’m
bringing this up is that there is another method that you
can reduce the size, has nothing do with sparsity, it has nothing to do with reduction, reducing rows or columns but they do single value composition after
they train a big matrix. So the idea is that if you were training the small
models at the beginning, you will never find the same
result as if you have trained the full big model and then
do single value composition.>>Exactly.>>Then do some fine
tuning after that.>>Yes. Similar thing.>>So it’s along the same line.>>Yes. It’s similar
here. Yes, it’s basically a larger model will give
you more exploration.>>Exactly.>>You find a better one.>>That’s exactly what
I wanted to get to.>>Yes. Agree. Okay. So basically here is just the
structure we learned. It’s very regular. It means we remove a lot of
rows and the columns. Finally we just inference, use this small model.>>So after you find the
structure’s sparsity, can you pack it to a regular matrix?>>Yes.>>To pack it back?>>Because it’s regular, yes. That’s the benefit of
structured sparsity. So we basically just create a smaller LSTM and we use the
non-zero weights to initialize it, then we get the performance. So in general, LSTM can reduce
the size while maintaining the same perplexity or we can
also make it a trade-off. We can reduce the Sparsity
Regularization with a little bit larger model but we can reduce the perplexity and it’s better than training a
smaller model from scratch.>>Why do you either reduce the parameters to
get them test perplexity?>>You mean why do we?>>No, a similar. The second one, create that fly, right? That background of the
last scene but still better reduce the parameters.>>Yes.>>I’m pretty sure if you
use these parameters, you can get better test perplexity.>>Yeah, this is test perplexity.>>Why does it get better?>>This is the benefit, OSSL.>>Okay.>>Yeah. So basically we can get a smaller model which
has a better performance.>>[inaudible] exactly, like
can you have a parameter?>>I think one single how
it’s going to be changed, it’s the job out ratio because we add additional regularization from the Sparse Regularization
then we don’t need that much regularization from this
job work. That’s the only one.>>So you change the upper parameter, you can get better perplexity. Is this how that works? I mean the green light, how did you achieve that green light exactly?>>This one?>>Yeah.>>We use SSL and then we change the Lambda
Regularization from the->>So you change the Lambda. Okay.>>Yeah.>>Is this PTV in Lambda?>>Yes.>>This is then dataset.>>Yes.>>Because it all depends on
how you do the regularization. It can only be part
tested for [inaudible]>>I’m not sure about the conclusion.>>Okay. If you test the
training [inaudible]. Are they similar?>>Yes.>>Even for the last one where
you train a smaller model?>>For this one?>>No, for the last.>>This one?>>Yeah.>>I can’t exactly
remember the numbers. So pardon me. We care more about tests.>>So what’s the exact meaning of those two pictures
in the last slide?>>That means the structured
pattern we learn. So the white regions are zeros
and the blue dots are non-zeros.>>So what’s the meaning of LSTM
1 and LSTM 2? Do you [inaudible]>>Those are two layers, layer one, layer two. So here, SSL can get a better trade-off in terms of performance
and the model size. We also did experiments on recurrent highway network and
we start from this baseline, and then we use SSL
to reduce the size, and then we either reduce the perplexity or we under
use a smaller model. Or we can keep the perplexity, but again they’re much smaller model. So from this trend we can see, if we start from a large
model then pull it down, then we can get the benefits. We can make a better trade-off. So the implication is we should start from a redundant
model and sparsify turn. So this is how how does SSL
work for very large model? But we are also
curious about how does it work for very small compact model. So we did experiments on BDAF model, which is a smaller model, have only 2.7 million parameters, and the hidden size is only 100. In that case because the
model is very compact, it’s originally very compact, so we cannot say keep their
performance but reduce the size. But we can still make a
good trade-off by SSL. So for example, we can reduce
the size less than one million by only drop two
percent of the F1 score.>>Which one is tasking?>>It’s a question-answering
on squash?>>Okay.>>So this conclude
my previous research. So summary. So we did the research to make the distributed training
more scalable using stochastic quantization over the
gradients and it’s effective. It’s effective by evaluated
by Facebook’s AI production. So we can reduce the
research cycle and production cycle because we can
make the model train faster. On sparse new network, we can enable more ubiquitous
AI on edge devices like our cell phone or self-driving car or any VR devices because the
computation is very limited. We also observe performance
gain if we can start from a very large model then
prune down, specified down.>>So before you continue
to the future directions. On the first one you said that you believe the turn grad work
because you preserve the variant, actually you had more variance.>>More variance.>>More variance. No bias
but more variance and that more variance helped
you in some situations. What is your understanding
of why it’s sparse? I mean, how would you explain why this method works better than like, for example, any other method
out there for exploring the correct sizes of the
model? Why does it work?>>Why does it work?
At first there is the deep model region needs to be redundant so we can prune the num. The second reason is comparing
with long structure one, structured pattern is efficient.>>I want you to specifically tell
me if you have understanding, why is going and searching in a
bigger space and then pruning it down is better than just
going directly [inaudible]? What I’m trying to say
is that maybe we don’t have a specific tool to
just find the right model. Or you have an understanding that searching in a bigger space is much better and then bringing
down gives us no loss.>>Exactly.>>I just want to see
your understanding.>>So my understanding is if
you start from a larger model, you have more directions to explore. Then the pruning process will find which direction is
the right direction. So you have a larger space, which means you have more exploration in the parameter model space. Then specification is a process
of trying to find the good one.>>Does the same concept of high-variance gradients
apply here too? Like the fact that you don’t have
some of these rows or columns, does it mean the gradients are
zero and therefore you have higher variance in the
gradients, the great gradients.>>From the perspective
exploration, they’re similar.>>That’s what I’m
trying to get. I mean, can I get the same message from the first one and
the second one too?>>Yes. The first one is
exploration in one model space.>>In the precision of the gradients, and the second one is the
variance in the inside gradient.>>The second one is in the
space of model space, yes. So go to the future. How can we go beyond my research and what I want to do in the future? So and how is my research
related to a recent research? So one related research is a lottery ticket hypothesis which is the actually a bit of
the paper this year. So a lottery ticket hypothesis says, for 10 randomly initialized
the feed-forward new network, there is a sub-network which
is a sparse new network. They’re also referred
to as winning tickets. So if you train this sub
network from scratch, you can reach the similar accuracy with the similar
number of iterations. The only hypothesis they existence, but we don’t know how to find it. So if you just randomly pick up
a random sparse new network, the accuracy will be very bad. So in our recent research, we found that SSL possibly can identify these sub-network at the early stage of
the training process. So in our experiments we find
that during SSL learning process, the SSL have very high sparsity in the very early stage
of our training process, and that those zero structure
will never come back. So if so, that means we can just
remove those useless structure, and we gradually remove those structures one-by-one and
then finally we just converge to our smaller and new network which probably is
the winning tickets. So this one gives us
about 40 percent of the training time reduction
for ResNet on ImageNet. This paper is nominated by Best Final Student papers in
supercomputing this year.>>From my memory, there is a big difference of your SSL and this Lottery
ticket hypothesis. Because in that hypothesis, if they find that sparse structure, and then they treat that first sparsity as an oracle and then they
train that first network, then the equation will be recovered. But in your case what you’re arguing, if you are given that sparsity or oracle and they
just trained from the beginning, you cannot reach the
same level of accuracy. Because you didn’t
use the exploration and power of first
explore on a very large, dense space and then finally
gradually go to this first person.>>No.>>Actually. Sorry.>>Go ahead.>>The lottery ticket story is that you can recover good
accuracy in the smaller network. If you start with the
same initialization as you did before your improvement. This is a way of preserving the residue of the initialization
after proving as well.>>So does that mean that in
your summarization slides, in that last point you
just said that it’s better to start from
a dense large network and then use SSL to find
the sparse network. That one give you high accuracy. So you try that if you
are directly given the sparse structure and
essential from the beginning, you cannot reach this in accuracy. But have you ever tried just like the lottery ticket
hypothesis just to remember your initialization and then use the same run the initialization
to train this sparse model. So exactly, we read through
the north ray [inaudible]?>>Yes. So that means we can use SSL to identify
the winning tickets. So that’s the relationship here. So in the paper, they just do the exploration on
non-structured sparsity pattern. So we’re thinking probably SSL
can identify the winning tickets.>>Okay. It also identify
structured winning tickets?>>Yes. Exactly.>>But have you ever run it, remember the random initialization of the [inaudible] with the
sparsity pattern again?>>Not yet, it’s open question. But I think we will try,
that’s visual work.>>[inaudible]. Because by no
way you can get around that. I think you let experiment
shows you get the same results.>>So I want to understand
what’s the connection between the SSL and the
lottery ticket hypothesis, whether they are the same
thing or they are different.>>Yes.>>That’s a very interesting
future research. So the second related work to
my previous research is AutoML. So an automated machine learning to use a machine to
design machine learning model. So when the community make
lot of progress on AutoML, they found that it’s much more
efficient to do in this way. So they designed a very
large model which have all the operation options enabled
in this single one-shot model. They coined one-shop
model because they have all the options enabled. The goal is to pick
up the optimal path. So the experiments show this method
is very efficient for AutoML. So the question is can
we use SSL for AutoML. So in SSL basically trying
to remove structures. So a second open question is, can we use SSL for AutoML. So another research direction inspired by my previous research
is on scaling NLP problem. So going back to my statement at the very beginning to
train of deep learning is, if we can train a large model we can always get a
better performance. This is even more true in NLP
for an unsupervised learning. So here is the ablation
study for the part and if they can train a larger model
they get a better performance. But personally, I think NLP is
in the registering the error and it’s very hard
to further scale up because of the computation
and the memory cost. So we need more scalable
training method and the most scalable best
models to scale up. So one direction we can go is we design more scalable
training method. So basically it means given
our architecture let’s say, can we make the training faster? So we could use TernGrad, SSL, but what’s more?>>SSL does not make training faster.>>It can.>>We just mention one [inaudible]>>But you still to have
the entire architecture to reduce it otherwise unless you
reduce it in the training.>>We just save the checkpoint
and create a new smaller one.>>Okay. If you want to
truly go and the original model and you want to find
that you still have to keep the original size or oracle.>>But our experiments show zero
architecture rarely come back. So we just remove that and
train a smaller one gradually.>>Okay.>>Yeah. So the second directions
design more scalable new models. So we should design our fast accurate and compact based model and scale up. So and I also believe that AutoML will play a very important
role in designing compact, small, and no people models. So this brings me to another future research I
would want to do on AutoML. So this figure also equals
to my previous statement. This is the computer vision models. In recent years people generally
design more scalable model. So this smaller model but high accuracy designs more
scalable model and scale up. So this model is designed by AutoML, it’s not by human. So I believe AutoML will
be very important in design compact NLP
models in the future. But there are two problems in AutoML. The first problem is, AutoML it’s memory is everything
so it’s very inefficient. The second problem
is AutoML relies on human-designed operations and
only search the combinations. It cannot event any
new architectures. That’s the problem of AutoML. So in the past half years, I did some research on the first problem and to make
the sample more efficiency. So the x-axis is the
number of our network. We have to train to read
some test accuracy. So this one is a random method. This one is the state of art regularized evolution and my
recent research is getting here. So now we can find the best oracle and model
within hundreds of samples, we have to only train with
hundreds of network to find the optimal and that will make
it public in next month. With that conclude my talk
and I’m open to questions.>>The question with trained grip. So apparently you transmit
two bits for any great game, but you only had to
use three [inaudible].>>Yes, we risk a lot.>>So have you found out the better or efficient way
to use four bits fully to make it more accurate
or anything like that?>>Yes. So if we use two
bits in real production, we definitely should use four levels. We don’t want to waste it. But that’s the research, you have more interest in how can we really aggressively to
reduce the precision. So in that paper, we
use the three levels. But in production,
they were mentioned, they used the eight bits. So that’s the trade off.>>Okay. I’ll ask one. So let’s say I want to
apply SL to transformers, and I would then place
them to your suggestions. I understand that SL has the flexibility where you
can design the groups, just that they didn’t
foresee as the LSTM. Let’s say, I want to design some
groups to form transformers. My goal can be, one, I want to get some
interpretation. Sorry.>>Interpretability?>>Yes, and second one is I want to reduce the computation
cost to be done. I know that this could
be different goals. Can you say something that, how I can design groups so that
I can achieve these goals?>>Simultaneously or?>>No, not simultaneously. Simultaneously, it will be hard.>>So to reduce the size of a model. So let’s say we can use it to reduce the hidden sides and like bird or we can reduce
the number of heads. So that’s basically how
we split the groups. For the first on the
interpretability, so usually, we say simpler
model is more interpretable. So if we use SSL, we reduce the size of the model then that will be more interpretable.>>For transformers, one interesting thing is
the attention mechanism. If I can apply it perfectly
to transformers, let’s say, if I can guess something
like for a given token, this token attends only a few part of the other tokens in a sentence. This could be something interesting
for people to interpret.>>Then, I think we can use
a mask parameter to mask their attention and then
regularize sparsity. If the corresponding
weights in the max is zero, that means that attention is useless.>>I suggest that we
can learn the mask?>>Yes.>>I think in the current
exploration mask now is pretty fine.>>I mean, not known that mask. I mean the mask over
the whole attention, that’s the parameter we can learn.>>Thank you.>>I think this is as
if the boss who wrote a book with a lot cluster while
the protocol described models. Basically, it’s that we train you as a profession at the
90 proportion task. If effective adults incurs this, while some components
should be easier. Because if you will notice, well, sometimes it’s just some layers
is not important property. You want to compress your [inaudible] module into the
general module, [inaudible]. How do you compare using [inaudible]>>So for the unsupervised learning, people usually use a
language modeling.>>Yeah.>>So I have shown the
effectiveness of SSL on language modeling.
That’s one thing. Another thing is how we can separate this specification for
unsupervised learning and for a specific task. So given a specific task, we can specify the new
on-premise for specific task.>>Go back to the
unsupervised learning model. Basically, you still
test on same tasks. You said after you compress the
learning model and in fact view this language model to factual
other task [inaudible].>>[inaudible].>>Okay.>>So a quick question. I’m not still familiar with
the AutoML literature. But for example, the biggest thing we’ve seen in terms of
shrinking somebody for example, where we had done is that you
end up with parameter sharing. How would I care? So first of all, I guess
there’s a couple of questions. One is, we probably don’t have the right architectures
in the first place. But related to that
is we may not have the right set of actions to
put into your framework. So for example, can I put
into an AutoML framework? Yeah. You should be searching over all possible sets of parameters that should be shared and
not actually duplicated. Or what are the kind of
limits here because we just don’t actually even know what the building blocks are that
we should be searching over.>>Yes. That’s the second
problem I mentioned in AutoML. It cannot event any new architecture. So I think we could be but this search space
that will be very large, so it will take a very long time. So for example, what can we
learn or convolution you upload new network from a
fully connected new network? Theoretically, yes. But the search space
is just too large. So I think that would be
the trade off between.>>Did you have thoughts
on how we bridge that? So in the sense that like is there a way to use
AutoML to basically say, “Hey human, please help
me out a little bit.” I’m seeing some sparsity over here. Maybe it’s nothing. Maybe there’s something here. Then, you could go in and you could do this as a
more iterative fashion. I think the thing that bothers
me about the formulation both for the structure
sparsity and for AutoML is this notion that our priority have already created the
right basic architectures, because I just don’t buy the
assumption in the first place. But I think all of our models
are probably very wrong. So it’s great that we
can make them better. But then is there a way to
extend the framework so that you have a more iterative process
of improving these things or?>>Yes. So one suggestion is we
can feel the AutoML framework, but we human can propose
some new architecture, we add it into the Search Space, and the AutoML will automatically define where should you
put the new architecture. So in that sense, AutoML will help. But I think AutoML and
the human experts, if they work together, we’ll get much more benefits.>>More questions? If not, let’s thank our speaker again.>>Thank you.

Leave a Reply