How to use Q Learning in Video Games Easily

Hello World! Its Siraj point A to point B using a special type
of reinforcement learning called Q learning reinforcement learning means
learning by interacting with an environment through positive feedback or
reinforcement it’s similar to how you give a dog a treat but only if it rolls
over and it’s evolved over the past few decades in the late 1950s an American
mathematician named Richard bellman was trying to solve what he called the
optimal control problem this describes the problem of designing an agent to
minimize some behavior of a system over time eventually he and his colleagues
finally discovered a possible solution to it which was later called the bellman
equation it describes the value of a problem at a certain point in time in
terms of the payoff made by previous decisions and it also describes the
value of the remaining decision problems that result from the initial decisions
by involving a systems various states in this way it broke the problem down into
simpler subproblems the bellman equation is now used in many many fields it helps
with minimizing flight time for airplanes maximizing profits for hedge
funds minimizing the beef that soldier boy seems to have with everyone yeah
John episode no no no no no no Suraj it wasn’t b1 though bellman Squad was
making waves throughout the math community but meanwhile a psychologist
named Edward Thorndike was trying to understand how learning works by
studying the animal kingdom he came up with what he called the law of effect
which states responses that produce a satisfying effect in a particular
situation become more likely to occur again in that situation and responses
that produce a discomforting effect become less likely to occur again in
that situation thanks Captain Obvious now is actually pretty important
discovery one of his experiments was putting a cat in a wooden box and
observing it while it tried a bunch of different ways of getting out until it
finally hit the lever that opened the box when he put the cap back in the box
it immediately knew to hit the lever to get out and it was able to do that
because of the process of trial and error which is what Thorndike was
essentially describing in the law of effect a couple decades later a British
computer scientist named Chris Watkins thought that perhaps these two ideas
could be combined to create a new type of learning algorithm the idea of
designing an agent that minimizes some behavior of a system over time like the
bellman equation and does through the process of trial and error
similar to the law of effect and so he invented a novel reinforcement learning
technique he called q-learning so what is this let’s say we had five
rooms in a building connected by doors and we’ll just think of everything
outside of the building as one big room all of space-time is room five we can
think of this system as a graph each room is a node and each door is a link
like room one has doors to both room five and three so they’re connected our
goal is to put an agent in any room and for it to learn how to get to room five
through trial and error so the goal room is room five the set room five is a goal
we can associate a reward value with each door which is the link between
nodes so doors that lead immediately to the goal room get an instant reward of
100 doors not directly connected to our goal room gets zero reward in q-learning
the goal is to reach the state with the highest reward through a set of actions
so if each room is a state each action is represented by an arrow and the
mapping of state to action is the agents policy it uses the reward value as a
signal to improve its policy over time and it stores what it has learned
through experience and what’s called the Q matrix the rows represent the possible
states and the columns are possible actions leading to the next state it
updates the Q matrix over time as it learns the best actions to maximize the
reward seems pretty useful right shoe learnings got to be used everywhere in
video games right consumer video game BOTS need to be good but not so good
that a human couldn’t beat them the bots that used Q learning to master games
like chess and checkers and most recently Atari games become insanely
good at whatever they play academic AI learns while consumer game AI generally
just makes educated guesses it doesn’t really learn and its actions are all
scripted but as different as they are the two fields are converging as we
discover more about machine learning for example in Forza Motorsport you can
create a drive ATAR for yourself it’s an AI that learns how you drive by
observing you and can then imitate your driving style having adapted behavior
like this will make games more interesting and there’s a lot of
potential for more of it so let’s write out a 10 line high-level Python script
that uses cue learning to train a bot to get from point A to point B this game is
a 5 by 5 grid our agent is a yellow square and the goal is to find its way
to the green square or the Red Square to end the game each cell represents a
state the agent can be in and there are 4 actions up down left and right moving
a step will give us a reward of neck point zero for the red cell gives us
negative one and the green cell gives us positive one so we ideally want to get
to the green cell every time the game world is already built for us so we’ll
just start off by importing that at the top then in our main function we can go
ahead and create a while statement set to true because we want our agent to run
indefinitely next we’ll initialize our bots position
from a world class and set it to the agent variable so now we want our bot to
pick the right action to take in the game world and the question is how do we
decide that I got a grid of squares measured five by
five and I’m gonna get to grade in one piece alive I’m gonna make a cube matrix
initialize keeping every single reward AB line archived then I’ma pick an
action straight how to queue then I’m gonna do it yeah you just fresh and new
update queue width reward on B and once I got that I’ll go ahead and repeat yeah
we’ll use our box position as a parameter for the max Q function which
we’ll choose an action from our queue matrix as well as a potential reward
then we can perform that action by inputting the action as a parameter to
the do action method which will return our bot the action we took the reward we
received and the updated bots position after taking the action now we’re ready
to update our Q matrix so we’ll use the updated BOTS position as the parameter
we’ll print out both parameters to terminal so we can observe the results
we’ll run the script by typing in Python Lerner PI into terminal and it’ll pop up
as a GUI the bot will immediately start trying out possible paths to get to the
green one and we can observe the score in terminal improving over time this bot
in particular gets really good really fast like in 10 seconds it’s already
found the ideal path and is just going to keep doing it so to break it down
reinforcement learning is the process of learning by interacting with an
environment through positive feedback cue learning is a type of reinforcement
learning that minimizes the behavior of a system over time through trial and
error and it does this by updating its policy which is a mapping of state to
action based on a reward the coding challenge for this video is to modify
this code so that the game world is bigger and has more obstacles let’s make
it harder for our Q learning BOTS to find the optimal strategy details are in
the readme poster github link in the comments and I’ll announce the winner
next video for now I’ve got to optimize my life so thanks for watching


  1. Joseph K J

    March 10, 2017 at 2:33 pm

    Hey Siraj, whats the editor that you are using?

  2. Mohamed Mnete

    March 17, 2017 at 5:26 am

    Thanks bro, really helpful.

  3. Mohamed Mnete

    March 17, 2017 at 5:26 am

    By the way I was thinking on creating a tic tac toe game using keras to do this q learning.

  4. Hitesh Vaidya

    March 19, 2017 at 1:05 pm

    Traceback (most recent call last):
    File "", line 23, in <module>
    Q[(i, j)][action] = w
    KeyError: (4, 1)

    This is what I get after running the program shared by siraj via github link. Can anyone help?

  5. Pranav Gaur

    March 29, 2017 at 7:41 am

    Hi @Siraj Rawal,

    Great tutorial and it surely helped me a lot. I was wondering how this approach can be used for a continuous state space(if that's the right term to use)? For eg. A race car finding its way around a track.
    If possible, can you do a tutorial on that too? 😛

  6. Sam Arnold

    March 31, 2017 at 5:04 am

    Dude Siraj! you are the man! thank you for your videos!

  7. Ajinkya Gawali

    April 6, 2017 at 8:09 am

    Your raps make me cringe so bad, It makes your videos very difficult to watch.

  8. Matthew Felgate

    April 6, 2017 at 9:50 am

    Nice video. Is it correct to call it Reward? Isn't it both Rewarded and Punished. Like both a Carrot AND Stick method of learning?

  9. Matthew Felgate

    April 6, 2017 at 10:09 am

    Whats the reasoning behind making the value of walk_reward as -0.04? I mean why was this value chosen? Could you write something to optimise this number?

  10. Anal Khan

    April 8, 2017 at 7:02 am

    hey can u plz make a video for sarsa algorithm.

  11. qiman wang

    April 10, 2017 at 4:26 pm

    Markov chains

  12. qiman wang

    April 10, 2017 at 4:26 pm

    Markov chains

  13. Lloyd Sloan

    April 13, 2017 at 5:44 am

    What IDE do you use for python?

  14. danny

    April 17, 2017 at 5:16 pm

    Siraj, your eyebrows are amazing. And I think learning rate selection can be really important depending on your application of the decision process.

  15. BK

    April 23, 2017 at 8:09 pm

    At first I was annoyed by the level of excitement, but then I realized it made it way easier to pay attention to than my lectures. Thanks for that extra effort!

  16. 张航

    May 3, 2017 at 3:43 am

    Rather good. Thank you for sharing!

  17. Kumar Developer

    May 7, 2017 at 7:41 am

    Started AI and ML Recently. Thank God for finding you guys soon. I do have some queries to you. How could I contact you ? Any Forums ? How could I start with Q Learning ? What is the basic environmental Steps needed ?

  18. swag gamer

    May 10, 2017 at 4:53 pm

    can you make an appnana bot

  19. Adrián Rubio

    May 10, 2017 at 11:29 pm

    Man, this is just amazing. Every single topic of every single subject must be like this… fun, interactive, fascinating.

  20. Aidan Gomez

    May 15, 2017 at 5:14 am

    this seems like a markov chain

  21. Abhinav a.k.a Demkeys

    May 29, 2017 at 9:22 am

    You're a fucking legend!!! Awesome video!

  22. C C

    June 4, 2017 at 7:56 pm

    Seem wrong Image at 3:15, the image is R matrix no a Q matrix

  23. Frenchpet

    June 15, 2017 at 5:57 pm

    Can you do one for Atari games?

  24. WildLachii

    June 16, 2017 at 2:05 am

    put speed to 0.75 and thank me later

  25. Azhar Khan

    June 19, 2017 at 5:13 pm

    Machine learning with the rap!
    Now I have seen everything 🙂
    Very innovative Siraj. I enjoy your videos.

  26. Gautham Venkataraman

    June 20, 2017 at 6:20 am

    this guy is underrated asf

  27. hamtsammich

    June 25, 2017 at 7:43 am

    Hey Siraj, how would you go about recreating the image generation used in ?

  28. dLoLe

    July 12, 2017 at 12:13 pm

    Entertaining and fun. Not really useful if you actually want to understand it and learn something

  29. Amber 180

    July 12, 2017 at 10:05 pm

    You legit have a cool science show! Wait, did you just rap??

  30. Arul Kajan

    July 13, 2017 at 2:45 pm

    Dear Siraj
    can i have your contact detail

  31. Daniel Astillero

    July 13, 2017 at 7:39 pm

    can't really argue w/ that sick rappin'

  32. APB

    July 14, 2017 at 11:02 pm

    is that possible with windows 10?

  33. James Ballari

    July 19, 2017 at 4:46 am

    Hi raj, great video as always 🙂 I wrote my self with the help of the internet a python code to implement Tic Tac Toe game with Reinforcement Learning ( Q- Learning), it learns to play the game by playing with itself and learning from the outcomes and gets better at it.
    Here's the github link for the same:

  34. dLoLe

    July 31, 2017 at 11:12 pm

    There are a lot of videos on youtube that explain this topic really well but this is not one of them. It doesn't really help you understand the idea behind it but only briefly outlines it with a few buzzwords and makes the thing seem more complicated than it really is

    I don't want to spread negativity but I sincerely believe that videos are just ckickbait with some buzzwords mixed with some forced jokes without any useful information that you could use to really learn something.

  35. A Achyut

    August 1, 2017 at 2:57 pm

    There's Physics(graduate level) on the blackboard not math

  36. edantonio505

    August 8, 2017 at 1:18 pm

    Great video! thanks!

  37. Dong Wang

    August 11, 2017 at 10:11 am

    Thanks Siraj for this series of awesome tutorial, the rap is great, I don't know you have the talent of rap before, go for American Talents!

  38. Harshit Agarwal

    August 17, 2017 at 1:20 pm

    at 3.16, he should be showing Q matrix but he's showing R matrix

  39. oshaho rana

    August 23, 2017 at 5:35 pm

    plz speak slowly, that non native english can get ur word

  40. Oussama Bezzad

    August 29, 2017 at 6:58 pm

    I'm too late but here's my challenge's code:

  41. Fast Solution

    September 3, 2017 at 6:30 pm

    where to learn A.I from scratch for my unity games

  42. No'ah

    September 20, 2017 at 6:21 pm

    you your so
    so awesome

  43. Eduardo Martín

    September 22, 2017 at 3:16 pm

    holy moly man, please…you can teach a bit slower… xD good job anyways

  44. micsodaki

    September 25, 2017 at 3:23 pm

    I still don't understand how can we set the Q matrix in case of e.g. games where we have no idea whether an action is good or not.

  45. MonmonstarTV

    September 26, 2017 at 5:08 pm

    This dude is probably the most cringy tutor I have seen, but he explains everything very well

  46. Amairani Coba

    September 27, 2017 at 7:53 pm

    Hi siraj, I have a question. This helps to get from point A to B. Now let's say I want to cover each space in the grid as my goal, is qlearning a way to do it? Or what do u Suggest? Great videos. 😊😊

  47. Jordan Shackelford

    October 2, 2017 at 5:49 am

    Teach me how to optimize my life pleasee

  48. William John

    October 7, 2017 at 1:29 pm

    print "create siraj! you have points to assign to strength, health, wisdom, and dexterity."
    name=raw_input("Siraj Hair ")
    attributes=("python", "tensorflow", "neural network", "iq")
    Tensorflow =0
    neural network=0
    while True:
    print "you have", points, "points left."
    1-add points
    2-take points
    3-see points per attribute
    choice=raw_input("bla bla bla ")
    if choice=="1":
    attribute=raw_input("Siraj is bored ")
    if attribute in attributes:
    add=int(raw_input("how many points? "))
    if add<=points and add>0:
    if Siraj=="Python":
    print name, "now has", strength, "Blabla points."
    elif attribute=="health":
    print name, "now has", health, "health points."
    elif attribute=="wisdom":
    print name, "now has", wisdom, "wisdom points."
    elif attribute=="dexterity":
    print name, "now has", dexterity, "dexterity points."
    print "invalid number of points."
    print "invalid attribute."
    elif choice=="2":
    attribute=raw_input("which attribute? strength, health, wisdom, or dexterity? ")
    if attribute in attributes:
    take=int(raw_input("how many points? "))
    if attribute=="strength" and take<=strength and take>0:
    print name, "now has", strength, "strength points."
    elif attribute=="health" and take<=health and take>0:
    print name, "now has", health, "health points."
    elif attribute=="wisdom" and take<=wisdom and take>0:
    print name, "now has", wisdom, "wisdom points."
    elif attribute=="dexterity" and take<=dexterity and take>0:
    print name, "now has", dexterity, "dexterity points."
    print "invalid number of points."
    print "invalid attribute."
    elif choice=="3":
    print "strength -", strength
    print "health -", health
    print "wisdom -", wisdom
    print "dexterity -", dexterity
    elif choice=="4":
    if points==0:
    print "use all your points!"
    print "invalid choice."
    print "congrats! you're done designing "+name+'.'
    print name, "has", strength, "strength points,", health, "health points,", wisdom, "wisdom points, and", dexterity, "dexterity points."

  49. Bob dabiuld

    October 12, 2017 at 8:04 pm

    The pictures (and some content) are literally copied from

  50. VibeX MaTxeMa

    October 22, 2017 at 2:38 pm

    thanks . it worked.. great help!!!!

  51. Kinos141

    October 22, 2017 at 11:49 pm

    Q learning looks like GOAP AI. I could be wrong.

  52. Harsha Kada

    October 27, 2017 at 3:38 am

    hii Siraj,

    I have recently working on a project called raspberry Cluster computer,
    I have built a small 6 node raspberry, can i use that cluster for q learning to make it learn Mario game.

    thank you.

  53. Lolpop HD

    November 3, 2017 at 10:28 am

    I cant do the import world pls help

  54. Mekaveli.

    November 3, 2017 at 3:44 pm

    I think you should stop using words like 'ez' because it's not , especially for beginners.

  55. gsrini

    November 4, 2017 at 4:45 pm

    how did you chose the reward as -0.1

  56. Shad Colebroke

    November 6, 2017 at 6:24 pm

    the corn is strong in this one

  57. Winston Mcgee

    November 21, 2017 at 3:56 pm

    will you stop adding unnecessary things you think you're funny? just get to the point you fucking indian

  58. David M. Johnston

    December 11, 2017 at 8:57 pm

    Can I give you an opinion?
    First, your videos are very interesting and helpful.
    But to me it seems like there's a conflict in the audience you target. On the one hand, you're speaking to developers who like to think in a structured and very logical way, but on the other hand, your quick paced style of editing, intercut with funny clips, seems made to attract passive viewers who only stay here because it moves a lot but don't really listen.
    This is fine, but at the same time it can be distracting for the focused viewer who really tries to deeply understand the core concepts. I'm just pointing this out because sometimes it's hard to pinpoint what is good and what is bad from the inside, and an outsider's opinion can help.

  59. Jack Currie

    December 13, 2017 at 2:23 am

    Siraj, your videos are so excellent that they have driven me to make my first ever comment on Youtube. Just wanted to tell you to keep it up! You strike the perfect balance between between helpfulness and entertainment. And don't listen to the haters, keep rapping 🙂

  60. xXSillyBilly79Xx

    December 24, 2017 at 3:35 pm

    Cool information but the platform makes me cringe.

  61. Paul Ahrenholtz

    December 31, 2017 at 11:47 pm

    i ain't ran from nothing but the loss funct…
    never running from nothing but the loss funct..


  62. Luciano Martinez

    January 3, 2018 at 7:47 pm

    He's so excited omg

  63. Bram Dhont

    January 17, 2018 at 10:46 am

    I thought everything was funny and cool but than comes the rapping

  64. Ashutosh pandey

    January 18, 2018 at 5:47 pm

    Bro you are simply awsome with lots of stuffs you came up with,presentation style and everything is incredible!!!!! ….

  65. Dane Lee

    March 10, 2018 at 10:22 am

    hhhhh I'm speechless at this video, awesome!

  66. Tristan W

    March 14, 2018 at 4:28 pm

    I've never seen someone gesticulate so much.

  67. Vinil Vasani

    March 24, 2018 at 6:49 pm

    I'm a novice in the field of machine learning and have some experience with different frameworks. I have a simple project in mind that I want to executed. I want to use machine learning to set signal light timers (especially green light duration) based on the number of cars (for a 4-way junction ). Basically a dynamic traffic light which will set a dynamic timer for each red light based on the number of cars at that red light. What kind of model should I use for this and a link to relevant resources will be very useful as well.

  68. Zylar Gray

    March 28, 2018 at 1:47 am

    You…. missed the perfect opportunity to *insert Christopher Walken meme here*…

  69. Savior NT

    March 29, 2018 at 9:21 am

    When the cat was in the box, and it was not being observed, was the cat dead, alive, or did it exist in both states simultaneously?

  70. GameCasters

    April 1, 2018 at 1:01 pm

    are there any q learning videos for 3d video games in android? that's what I'm interested in tbh

  71. GameCasters

    April 1, 2018 at 1:02 pm

    and i havent found anything on youtube talking about q learning for android games

  72. Prasad Madhale

    April 6, 2018 at 10:24 pm
    This might help!

  73. Paulo Jose Castro

    April 7, 2018 at 9:38 am

    meh why is markov didn't mention?
    also made a blog for different approach of qlearn:

  74. 1Rennae2

    April 10, 2018 at 7:25 pm

    Is this what is going on in the chans…as in 4chan & 8chan? Sure sounds like it to me. 🙂

  75. Saeed Ahmed

    April 13, 2018 at 7:38 am

    As a Beginner, I found it useful… I will continue to learn from here…

  76. 16hbm

    April 20, 2018 at 10:13 pm

    Could u please do a video on Luna Lander implementing q -learning??

  77. Rami M

    April 28, 2018 at 1:14 pm

    how can anybody understand this!!

  78. Сергей Лютько

    May 17, 2018 at 11:36 pm

    Just awesome!

  79. Vu Coc

    May 20, 2018 at 2:19 pm

    It's great! I am just a student first years and I am trying to studing IT. Your video make my problem easier. I hope you to make a lot of video about IT, learning! (Y) Better luck in the future

  80. 陈宙斯

    May 21, 2018 at 8:13 am

    这视频有毒😂要是YouTube在中国也开放,播主药火~RL remix

  81. GameSect

    August 5, 2018 at 1:44 pm

    Hey there your a rockstar get your game on

  82. BlindfoldChess

    August 23, 2018 at 11:00 am

    Yo ! RL in golang.

  83. Thiago Araujo

    August 28, 2018 at 1:53 pm

    This is just awesome hahaha

  84. Adam Louis

    October 6, 2018 at 1:39 am

    hey can anyone tell me where can i find the whole package of the demo? If it is on GitHub please send me the link bros ! thanks

  85. paul antony

    October 16, 2018 at 11:46 am

    How do we apply q learning for a turn based game(games like tic tac toe,chess etc) does max(s',a') be max q value in opponents turn

  86. Sukanya Murali

    November 26, 2018 at 11:16 am

    I installed the "World"… But still I m getting the error as ModuleNotFoundError: no module named 'World'.

  87. Ahm Zahid Hossain

    December 27, 2018 at 4:08 pm

    I am using anaconda 3 python env, was not able to install TKinter. Tried all different ways. If anyone can help me …

  88. lemon274

    February 21, 2019 at 1:35 pm

    Im sad I can't seem to import the World library after pip installing it :

  89. Jan michael Besinga

    March 14, 2019 at 4:15 am

    I set the playback speed to 0.75 since he started rapping.

  90. adhil Parammel

    March 29, 2019 at 6:07 am

    How to make, ai learn pubg😋😋

  91. Eshaan Barkataki

    March 30, 2019 at 11:40 pm

    Thx, Great Video. First youtube programmer I saw that uses meme pics. Appreciate that. 🙂

    (Ps: thx for putting the source code too. Helped a lot)

  92. Nuno Soares

    April 16, 2019 at 9:54 pm

    Cat: It might be this lever (translated from meow)
    Person: puts back on the box
    Cat: I HAVE NO CLUE (translated from meow)

  93. kenivia

    May 19, 2019 at 8:20 am


  94. Ricardo Lerch

    May 26, 2019 at 4:16 pm

    why does it always choose the same path, being others with the same reward?

  95. Roland Lawrence

    May 27, 2019 at 9:25 am

    watching this video series has changed my life 🙂 amazing stuff! coming from an old SQL programmer…

  96. John Nesbit

    May 27, 2019 at 6:30 pm

    the code is so high level it made things more confusing until I read a lower level script plase stp wit da high levl

  97. N0ught

    June 29, 2019 at 9:14 pm

    okay… what?

  98. aryan chauhan

    July 24, 2019 at 3:23 am

    This is so cool whenever some visual action is there I am in

  99. Tarvis

    August 19, 2019 at 3:08 am

    why are there only 5 states in the final code if there is 25 possible states?

Leave a Reply