Convolutional Neural Networks – Ep. 8 (Deep Learning SIMPLIFIED)

If there’s one deep net that has completely
dominated the machine vision space in recent years, it’s certainly the convolutional
neural net, or CNN. These nets are so influential that they’ve made Deep Learning one of the
hottest topics in AI today. But they can be tricky to understand, so let’s take a closer
look and see how they work. CNNs were pioneered by Yann Lecun of New York
University, who also serves as the director of Facebook’s AI group. It is currently believed
that Facebook uses a CNN for its facial recognition software. A convolutional net has been the go to solution
for machine vision projects in the last few years. Early in 2015, after a series of breakthroughs
by Microsoft, Google, and Baidu, a machine was able to beat a human at an object recognition
challenge for the first time in the history of AI. It’s hard to mention a CNN without touching
on the ImageNet challenge. ImageNet is a project that was inspired by the growing need for
high-quality data in the image processing space. Every year, the top Deep Learning teams
in the world compete with each other to create the best possible object recognition software.
Going back to 2012 when Geoff Hinton’s team took first place in the challenge, every single
winner has used a convolutional net as their model. This isn’t surprising, since the
error rate of image detection tasks has dropped significantly with CNNs, as seen in this image. Have you ever struggled while trying to learn
about CNNs? If so, please comment and share your experiences. We’ll keep our discussion of CNNs high level,
but if you’re inclined to learn about the math, be sure to check out Andrej Karpathy’s
amazing CS231n course notes on these nets. There are many component layers to a CNN,
and we will explain them one at a time. Let’s start with an analogy that will help describe
the first component, which is the “convolutional layer” Imagine that we have a wall, which will represent
a digital image. Also imagine that we have a series of flashlights shining at the wall,
creating a group of overlapping circles. The purpose of these flashlights is to seek out
a certain pattern in the image, like an edge or a color contrast for example. Each flashlight
looks for the exact same pattern as all the others, but they all search in a different
section of the image, defined by the fixed region created by the circle of light. When
combined together, the flashlights form what’s a called a filter. A filter is able to determine
if the given pattern occurs in the image, and in what regions. What you see in this
example is an 8×6 grid of lights, which is all considered to be one filter. Now let’s take a look from the top. In practice,
flashlights from multiple different filters will all be shining at the same spots in parallel,
simultaneously detecting a wide array of patterns. In this example, we have four filters all
shining at the wall, all looking for a different pattern. So this particular convolutional
layer is an 8x6x4, 3-dimensionsal grid of these flashlights. Now let’s connect the dots of our explanation:
– Why is it called a convolutional net? The net uses the technical operation of convolution
to search for a particular pattern. While the exact definition of convolution is beyond
the scope of this video, to keep things simple, just think of it as the process of filtering
through the image for a specific pattern. Although one important note is that the weights
and biases of this layer affect how this operation is performed: tweaking these numbers impacts
the effectiveness of the filtering process. – Each flashlight represents a neuron in the
CNN. Typically, neurons in a layer activate or fire. On the other hand, in the convolutional
layer, neurons perform this “convolution” operation. We’re going to draw a box around
one set of flashlights to make things look a bit more organized. – Unlike the nets we’ve seen thus far where
every neuron in a layer is connected to every neuron in the adjacent layers, a CNN has the
flashlight structure. Each neuron is only connected to the input neurons it “shines”
upon. The neurons in a given filter share the same
weight and bias parameters. This means that, anywhere on the filter, a given neuron is
connected to the same number of input neurons and has the same weights and biases. This
is what allows the filter to look for the same pattern in different sections of the
image. By arranging these neurons in the same structure as the flashlight grid, we ensure
that the entire image is scanned. The next two layers that follow are RELU and
pooling, both of which help to build up the simple patterns discovered by the convolutional
layer. Each node in the convolutional layer is connected to a node that fires like in
other nets. The activation used is called RELU, or rectified linear unit. CNNs are trained
using backpropagation, so the vanishing gradient is once again a potential issue. For reasons
that depend on the mathematical definition of RELU, the gradient is held more or less
constant at every layer of the net. So the RELU activation allows the net to be properly
trained, without harmful slowdowns in the crucial early layers. The pooling layer is used for dimensionality
reduction. CNNs tile multiple instances of convolutional layers and RELU layers together
in a sequence, in order to build more and more complex patterns. The problem with this
is that the number of possible patterns becomes exceedingly large. By introducing pooling
layers, we ensure that the net focuses on only the most relevant patterns discovered
by convolution and RELU. This helps limit both the memory and processing requirements
for running a CNN. Together, these three layers can discover
a host of complex patterns, but the net will have no understanding of what these patterns
mean. So a fully connected layer is attached to the end of the net in order to equip the
net with the ability to classify data samples. Let’s recap the major components of a CNN.
A typical deep CNN has three sets of layers – a convolutional layer, RELU, and pooling
layers – all of which are repeated several times. These layers are followed by a few
fully connected layers in order to support classification. Since CNNs are such deep nets,
they most likely need to be trained using server resources with GPUs. Despite the power of CNNs, these nets have
one drawback. Since they are a supervised learning method, they require a large set
of labelled data for training, which can be challenging to obtain in a real-world application.
In the next video, we’ll shift our attention to another important deep learning model – the
Recurrent Net.


  1. DeepLearning.TV

    December 18, 2015 at 7:14 pm

    CNNs are really cool deep nets – they are mainly used in machine vision. Enjoy 🙂

  2. DeepLearning.TV

    December 18, 2015 at 7:14 pm

    Also the next clip is on Recurrent nets, and I'll publish that next Monday (Dec 21, 2015)

  3. Ivar Vasara

    December 18, 2015 at 11:20 pm

    So traditional CV doesn't perform as well as CNNs, but do people typically just drop all the traditional CV approach altogether and just use CNN or is there typically a hybrid approach ?

  4. kyuhyoung choi

    December 21, 2015 at 12:36 am

    Great series of videos. I can not wait for the next video. So far so good. For me, it will be better if it is little bit less simplified. Just little bit

  5. Kirill Kochubey

    December 29, 2015 at 1:03 am

    @DeepLearning.TV. On every video lady asks if you have experience then put your comment. I think this video originally oriented to beginners and on each video I would answer no and logically should not comment.

    I would suggest to redo videos and ask something like "Would you like to use this network on your project, let us know in comments?" or "How would you use this solution on your project, put comments". So answer should be more yes than no and comment that might help others. I am sure people with experience will put comments anyway.

  6. The Development Channel

    January 6, 2016 at 12:32 am

    Best introductory video on convnets so far!
    As you stated deep convnets need a lot of trainning, could you also mention that one technique to make this trainning larger is to do some synthetic images augmentation like (Scale, flip horizontally, change contrast, traslate, color change, etc…)

  7. mukul arora

    January 7, 2016 at 5:06 am

    I am trying to implementing CNN on Theano for binary classification of images…input images are 128×128 i use 4 conv-pool layers to finally project the images to 5×5 grid…using pool size of 2 and one fully connected layer with last layer as Softmax LAyer. But i am facing problem that…after training for 90 epochs….it results in all tests labels as belonging to one class….misclassified! Can u suggest what changes i shall make to increase the accuracy and sensitivity?

  8. Dhananjay Mehta

    February 4, 2016 at 11:16 pm

    Small question: If you search images on google for example – Laughing panda, crying panda, playing panda, eating panda. What learning algorithm must have been used in order to detect the emotions, actions and other such features in animals and other such abstract searches? How are these classifiers trained?

  9. enlighted Jedi

    February 12, 2016 at 6:52 pm

    Optimizing the filters in convolutional networks is probably great with genetic algorithms as genes and and filters are alike structures. Is that true?

  10. david

    March 20, 2016 at 1:44 am

    can't help to say that you have a beautiful voice

  11. Elia Karam

    April 3, 2016 at 9:49 am

    hy im actually having trouble with inputs what type of inputs should my neural net have

  12. digging deep

    April 30, 2016 at 4:57 am

    Thank you so much. I am trying to translate your great clips into Korean since many of Koreans want to understand what you're saying. Until now, I have finished ep.2. 3.4. 5. 6. 8. You can see one of them here. If possible or if you want, I would like to send my Korean subtitles to you and put this on original clips.

  13. Hany El-Ghaish

    May 5, 2016 at 10:18 pm

    Thanks for your videos. I have a question for you: How can I use CNN in action recognition(stream of frames).
    I searched for any code (using Theano and Keras) for this task but I don't find any direct link.
    I have another question: what are the possible Deep networks that can be used for action recognition?
    you can take UCF101 data set as an example.
    Really, I want your help..

  14. Hany El-Ghaish

    May 11, 2016 at 4:13 am

    Hi, I want to use CNN in Theano in action recognition problem. Can you give me a recommendation on how to use CNN in action recognition when the # of frames is not equal for each action ?Here the problem of variable lenght

    I know that Convolution3D in keras but the question: How to make the # of frames are equal for each action? I donot know how to prepare my data, and How to do sampling to make # of frames are equal for each action.Is CNN+RNN can be used for action recognition? if so, please can you guide me to the way that can implement this Keras or Theano?Thanks

  15. Airiel Salvatore

    May 12, 2016 at 5:31 am

    anyone else's siri go off when she said "certainly"?

  16. Ahmed Ramzy

    May 15, 2016 at 4:25 am

    There is a little question for me i can't understand in CNN ,
    Many examples of CNN show that at the first conv. layer it starts to learn some lines and blobs and after that it starts to get more complex shapes.
    My question is how i can get more complex shapes although i am applying some filters on lines and blobs ?

  17. Prisma Dynamics LLC

    May 16, 2016 at 12:43 am

    is deep learning a cleaver way to use wavelets or is it something completely different?

  18. Lyautey

    May 25, 2016 at 11:47 am


  19. David L. Chang

    June 1, 2016 at 10:40 pm

    Thank you for this series. It's very helpful for newbies like me to briefly understand the deep learning.

  20. Chanchana Sornsoontorn

    June 14, 2016 at 7:21 pm

    I giggle every time you say "please leave comments and share your experiences."

  21. Amina Bh

    June 27, 2016 at 10:28 am

    Thanks alot for making such great videos ! as a PhD student working day and night on Deep Learning, I find it hard to select easy terms in order to describe what I'm doing for others who have no idea what a CNN is. Your videos has helped alot with that ! Thanks 🙂

  22. Shravan Shravan

    August 20, 2016 at 4:03 am

    simply awesome

  23. fungussa

    September 1, 2016 at 6:53 pm

    A CNN kinda seems like a brute-froce (computationally intensive, and somewhat inelegant) approach of detecting patterns

  24. Kathiravan Natarajan

    October 5, 2016 at 2:15 am

    awesome work 🙂 thanks for the videos

  25. bhavishya goyal

    October 5, 2016 at 5:43 am

    Really awesome . please let me know how we can use it for Trading

  26. Tee Jay

    October 5, 2016 at 9:16 pm

    Hi DeepLearning, Impressive! How did you learn all these? 🙂

  27. Carlo Alessi

    November 10, 2016 at 8:32 am

    is the Caffe framework proper to do this? also, is it a good idea for a bachelor thesis to train a standard CNN from the framework and compare the performance between our own model?

  28. 끝UyomBaby

    November 20, 2016 at 8:43 am

    is CNN works for dataset like microarray?

  29. Sagar Chand

    November 23, 2016 at 5:34 pm

    might start from 2:12

  30. Sebastian Förster

    December 12, 2016 at 10:04 pm

    that videos couldn't be watched on a smartphone… you should try to use the full screen… I don't care about the bubbles..

  31. s0mnath

    December 16, 2016 at 1:38 pm

    Hi, Can a DBN be used in conjunction with a CNN to reduce the need for a large sample sets? As in where the connected net is used for labelling or am I over simplifying things here.
    PS: The videos are very helpful in understanding the basics

  32. Vishakh Rameshan

    December 17, 2016 at 4:06 pm

    Yes i tried to learn CNN but found difficulty in understanding it. As i want to use it to classify images

  33. priyank pande

    December 19, 2016 at 6:27 am

    Hi, I am trying to work on data deduplication problem, what could be better approach for this ?

  34. Damal Islam

    December 24, 2016 at 2:09 am

    I have a question, I want to classify the medical images. The images are from human colon. Which one will be a good network to classify them DBN or CNN?

  35. dina saif

    January 7, 2017 at 10:47 am

    I want to run CNN on ARFF file or I can convert it to excel sheet , it contains binary features (0,1) also two classes binary label(0,1) can you help me how can I make classification for that by CNN

  36. Kazathul

    January 12, 2017 at 7:50 pm

    After watching the video for 4 minutes, I still had no idea about what and how CNNs work.
    Thumbed down.

  37. 곽성실

    February 1, 2017 at 1:05 pm

    Thank you for your kindness so kind..

  38. satish jasti

    February 8, 2017 at 2:32 pm

    Thanks a lot

  39. pasdavoine

    February 8, 2017 at 3:00 pm

    Thank you so much I think I have understood this important thing better !

  40. Ahmad Khalil

    February 14, 2017 at 10:27 pm

    fuckin' boring

  41. David Mata

    February 24, 2017 at 11:41 pm

    Hi, I am beginer on deep learning, but in you last video you said that backpropagation training methon has the desaventage of vanish gradient. Can we change the method training to use deep belief network in the fully connected layer?

  42. David Mata

    February 24, 2017 at 11:43 pm

    Are the kernels trained using backpropagation? (the kernels are weights right?) (What about the weights in the fully connected layer, are those weights trained using the backpropagation? and are the kernels weights and the fully connected layer weights trained at the same time?

  43. Bryan Lozano

    March 12, 2017 at 10:49 am

    Not sure the flashlight analogy was worth it

  44. Jennifer Mew

    March 14, 2017 at 10:30 pm

    I like your videos a lot, and I recommended to all of my friends interested in deep learning

  45. classical666

    March 27, 2017 at 9:01 am

    Super cool video! Now I have a good understanding of what a CNN is. Thank you so much!

  46. Aanshik Gupta

    April 10, 2017 at 8:37 pm

    In the flashlight on the wall example, I didn't understand why CNNs formed a 3D net, when the images we use as input are 2D. How can it find pattern in three dimensions?

  47. Daniel Shin

    April 16, 2017 at 4:00 pm

    Fuck! Too difficult!

  48. Vineeth Bhaskara

    May 19, 2017 at 10:01 am

    Great point on the ReLUs vs the vanishing gradient problem for backprop in CNNs.

  49. Sicknasty Gaming

    May 27, 2017 at 11:09 pm


  50. Shahmustafa Mujawar

    June 15, 2017 at 2:39 am

    1) I am trying to understand, how the features are extracted in CNN i.e, from low-level feature to high-level as layer increases or vice-versa??
    2) And how the image (that many pixel values) get tagged by only one value??

  51. Dark Side

    July 11, 2017 at 7:13 am

    FAKE MEDIA NEURAL NETWORKS lol , Only if you get it.

  52. AlooMinati

    July 13, 2017 at 7:39 am

    CNN from silicon valley to classify hotdog or not hotdog

  53. AlooMinati

    July 13, 2017 at 7:41 am

    Deep Convolution Net to classify the template of the given meme

  54. Bashar Kernel

    August 16, 2017 at 7:00 pm

    Good talk(y).I think you can say that the convolution operation is just the operation of moving a scanning censor over the image to scan the image. so the censor will not see the full image at one step but it will see just one part of the image at a time the part that the censor is over on the image. By moving the censor when it reach the end of the image the censor will scan all the image and will have knowledge over the full image. The point that makes this censor powerful is that we can use a small censor which is easier to train and will find any pattern where ever it is on the image because it is moving.

  55. M T

    September 3, 2017 at 8:07 pm

    could i have your email i need to contact you directly i am working on CNN on my research thank you

  56. Rafael Gauna Trindade

    September 6, 2017 at 12:28 am

    Hi, i read that ReLU is an activation function for convolutional layers, where f(x) = max(0, x).
    But here in the video you display ReLU like a layer. This is really a correct approach?

  57. Rakesh Mallick

    November 13, 2017 at 8:18 pm

    Its about time to add a video of capsule networks as well.

  58. ri3it483qthirf

    November 23, 2017 at 12:28 am

    what's the advantage of relu activation over logistic activation? could you use both? I'm training a network that only ends up predicting one class for all samples, even if I test it on the training set. Any ideas?

  59. Dan Houston

    December 25, 2017 at 5:31 am

    I love your videos and i love your voice. Please don't let these turd trolls convince you to change a thing. It's all so nice and appreciated.

  60. Averros Apollo

    March 26, 2018 at 10:26 pm

    I am almost 100% sure the narrator doesn't understand the subject !

  61. Blue

    May 26, 2018 at 11:58 am

    i still don't get it… the explanation is just meeh

  62. 김유민

    June 11, 2018 at 7:01 am

    A nice explanation. Thanks!

  63. ankitharish

    July 20, 2018 at 3:09 pm

    excellent video..
    you cant understand deep neural networks easier then this….

  64. Anthony Sarkis

    November 14, 2018 at 9:13 pm

    Hey are you looking to build your own visual intelligence? Check out diffgram

  65. WechselWissen

    March 20, 2019 at 8:59 pm

    Great video!

  66. Edwin Sng

    October 16, 2019 at 12:32 pm

    Hi I am a physician interested in AI in radiology. What is texture analysis and how is it related to deep learning?

Leave a Reply