Learning to See [Part 4: Machine Learning]
Last time, we left off wondering how to improve on our knowledge engineering approach to identifying fingers in images. Fortunately for us, there is a better approach, and its roots go all the way back to the beginning of AI. In the late 1940s, IBM engineer Arthur Samuel took on an interesting side project, writing a computer programing to play checkers. Samuel developed his algorithm in his spare time on IBM-701 computers that were often still on the production line. After a few years of work, it became apparent that Samuel had achieved something remarkable. His program could not only beat its creator; it could defeat all but the very best checkers players. And what’s really remarkable here is how Samuel accomplished this. Samuel didn’t write a program to play checkers; he wrote a program that learned to play checkers. The difference is powerful. Rather than spending his time coding up optimal checkers-playing rules, Samuel allowed his program to find the best strategies by testing their performance on real games. Playing game after game against itself, his program made incremental improvements until it could compete with the very best players. Samuel’s approach is called machine learning, and it’s going to help us find a better solution to our finger-counting problem. Thanks to our labeling back in part 2, we have lots of examples of fingers and non-fingers. Machine learning says that instead of writing our own rules, we should instead write a program that learns rules from our examples. One way to do this is to simply allow our examples to be rules. To see what I mean here, let’s have a closer look at our examples. Remember that we’re sampling a nine-by-nine grid around each pixel and using that information alone to determine if a pixel belongs to a finger or something else. Our three images contain a total of 7,867 pixels, and of these, 495 correspond to fingers. This does not mean we have 7,867 examples, however, because quite a few are redundant. The most redundant example is a completely empty grid. This makes sense because our images are mostly empty. We can have a look at the other most common patterns to get a better feel for our data. After removing the redundancies, we have 3,090 unique examples, and 413 of these examples show fingers. Our 413 finger examples should be pretty useful to us because we know they correspond to actual fingers in actual images. If a new example matches one of our 413 finger examples, we’ll call it a finger; otherwise, we wont. This is our first machine learning algorithm. Our algorithm learns by remembering a bunch of examples of fingers, directly comparing new data to these examples, and looking for matches. So, how will our first machine learning algorithm perform? Hopefully, it will outperform our knowledge engineering approach from part 2. But before we dive into comparing our two approaches, let’s make sure that our comparison will be fair. Back in part 2, we talked about confusion matrices. These give us a good idea of how a single algorithm is performing but are difficult to compare across algorithms. We need a way of scoring our confusion matrices, some type of performance metric. With a little investigation, there turns out to be a huge number of performance metrics we could compute from our confusion matrix. The simplest to interpret is accuracy, which is the total number of correct classifications divided by the total number of examples. The results from our knowledge engineering approach last time give us an impressive accuracy of 94.2%. Unfortunately, accuracy doesn’t always tell the whole story. Although our knowledge engineering–based approach *is* 94.2% accurate, we can clearly see that we’re not correctly classifying most finger pixels. So how does such a crappy classifier achieve such a high accuracy? It’s all about baselines. Notice that 93.7% of our pixels don’t correspond to fingers. We can exploit this imbalance in our data by constructing the world’s simplest classification algorithm: all examples are negative. By simply classifying all pixels as not-fingers, we, of course, miss all the fingers. But who cares? They only account for 6.3% of the data. We’re 93.7% accurate. Clearly, our choice of performance metrics matters. And what we need here are some better metrics. We’ll use a popular pair of metrics that will give us a far more nuanced indication of how our algorithm is doing: recall and precision. In our case, recall is the portion of all finger pixels correctly identified, and precision is the portion of all finger predictions that are correct. These metrics take a little more time to wrap your head around but are critical if we’re going to make meaningful comparisons between algorithms. We can compute our precision and recall for our approaches thus far and get a much better representation of how we’re doing. To keep things organized, we’ll keep track of the approaches we’ve tried thus far in one big table. After all that work on performance metrics, we’re finally ready to see how our machine learning approach compares to our baseline and knowledge engineering approach. Remember that our machine learning approach consisted of matching data to our existing examples of fingers. If the data matches, it’s a finger. Otherwise, it’s not. Let’s try it out on our three examples. As we can see, our performance is incredible. In fact, our recall is 100%, and our precision is 97%. We’ve correctly identified all finger pixels and made almost no mistakes in the process. Absolutely incredible performance. Now, if this number seems suspiciously high to you, it should. We’ve fallen into the most alluring trap in machine learning. Next time, we’ll see why we shouldn’t trust these numbers. Let’s quickly catch up with Arthur Samuel and his checkers algorithm. Ironically, the success of Samuel’s algorithm led to its demise a few years later by attracting so much press attention that IBM’s president, Thomas J. Watson, who was perhaps even more ironically named after the same Watson as this Watson, shut the program down because he believed it was a waste of money. And IBM’s marketing department felt that consumers were threatened by the idea of intelligent computers. Thanks for watching.