ml5.js: Transfer Learning with Feature Extractor
ml5.js: Transfer Learning with Feature Extractor - YouTube
https://www.youtube.com/watch?v=kRpZ5OqUY6Y
Transcript:
(00:00) [BELL RINGS] Hello. Welcome to another ml5.js video. Now I am really very excited about this one. I have made three videos so far at the moment at the time of this recording, sort of Intro to ml5 and Machine Learning, a video about doing image classification, and a video about doing image classification with real-time images coming in from a webcam.
(00:20) And both of those image classification projects use a pre-trained model called MobileNet. So just to pick off where we last left off, here it is. This is the image classification MobileNet with MobileNet example. You can see that I am, today, a-- ah, come on. I want to be a snorkel. I really feel like a snorkel.
(00:39) I am kind of like an oboe. But we could see here, I think, ah, if I get this ukulele and put it in here, it's going to see it's an acoustic guitar pretty quickly. Let me put this down over here. So here's the thing. We have determined that this model is not good at recognizing certain things. Like it cannot recognize a train whistle.
(00:58) It thinks it's a syringe or an oboe. It cannot recognize my purple water bottle, sponsor of The Coding Train water. It thinks it's a power drill or a microphone. [SINGING] Hello. OK. So what if I forgot what I was doing or why I'm here? What if I want to have this example recognize things that I have here in this room that it doesn't recognize? Could I train it with my own data? So this brings up so many questions, and there's so many different paths we could go down.
(01:34) But the path I want to go down in this video today is something called Transfer Learning. So with the concept of machine learning, I could have a massive database of images and train a model, and label all those images, and train a model based on that data, and show it new images, and it'll tell me what's inside those new images based on what it learned.
(02:01) But I am a person with no massive database of images. So something that I could do is use somebody else's model that already was trained with a massive database of images and just kind of say, eh, I'm going to use these images on top of it. It's not a perfect solution, but it is a quite powerful one that allows you to do certain kinds of things very quickly.
(02:19) In fact, I just came over here apparently to write Transfer Learning. There is a project that I want to show you-- let me just hit Back here-- called Teachable Machine. So Teachable Machine is a project made by various-- a collaboration of many different researchers at Google, led by the Google Creative Lab.
(02:36) And I'm going to run through this in a second. And basically, what I'm going to do, I'm going to introduce the idea of-- show you Teachable Machine, introduce the idea of transfer learning, talk about how it works in ml5 with MobileNet, we're going to take a break, and then I'm going to come back and actually make the code example.
(02:52) I'm going to make a teachable machine. So let's first just run this and see. So I'm going to skip the tutorial, and I'm going to just open this up. I'm going to zoom in a little bit here. So I should say that Teachable Machine is using a slightly different algorithm behind the scenes than what I'm ultimately going to implement.
(03:10) And I'll kind of talk about maybe the differences between those after I kind of get done with all this. But conceptually, it's exactly the same thing. So right now, I can see there are three categories-- green purple, orange. So what I'm going to do is I'm going to say-- so now I'm going to attempt to train the teachable machine with my own images.
(03:29) So I'm going to step out of the frame just for the time being. I'm going to say, Train Green. I'm holding this down, so I'm kind of-- I'm giving it lots of examples of a ukulele at different kind of angles. Of course, it has my arm in it too. That's a part of what it's learning.
(03:41) Then I'm going to stop. I'm going to put the ukulele down awkwardly. I'm going to grab this train whistle, and I'm going to hit Train Purple. I'm going to give it lots of examples of the train whistle, and then I'm going to let go. And then let's do one more. Oh, I really should have made this one purple.
(04:03) I'm going to train it orange, because it is purple. This is my water bottle. So I'm going to train it with the water bottle a bunch of times. I'm sort of in the frame. I'm going to get out of the frame, and I'm going to let go. All right, so I finished training this teachable machine. If I come into the picture, it actually thinks I'm the water bottle.
(04:20) I'm orange. But look at this. Let me now show it the ukulele. Confidence-- 99%. Let me now show it the train whistle. Confidence purple-- 99%. And now let me show it the water bottle. Confidence orange, so you can see this works quite well. If I stand in there, it's kind of confused now. You see, I was standing in a lot of the training images with the water bottle, so it really thinks my train whistle is the water bottle.
(04:51) If I'm not in the picture, it knows it's a train whistle. So this is very important. You have to remember the machine learning system is not learning anything about these particular objects. It's learning about the exact sets of pixels you're showing it to. So if I'm standing in the background with the ukulele every time, and then I'm not in the background anymore, it's going to be confused.
(05:12) We see this in some machine learning models that if you hold something up, no matter what you hold up, it always just says Cell Phone, because there's so many training images of people holding a cell phone, it thinks, well, whatever you're holding, it's got to be a cell phone. So this is the idea, so the wonderful thing about this is it's really fun.
(05:31) There's a really nice interface. It's designed really well. I can have it show different GIFs, play different sounds, but I'm limited to these three categories-- green, purple, orange. And I want to be able to do something like this in my own interactive experiments. OK, so let's talk about how does it even work.
(05:49) How does this even work? Before we get to the code, let's talk about how it works. All right, so let's talk about how image classification works. We said, or I said, I think, there is something called MobileNet. This is the pre-trained model that I've talked about. It has 1,000 image classes, and it was trained on a database of like 15 million images from a database called ImageNet.
(06:17) When we use it, we send our own image unit. We send an image. Maybe it's from a webcam. Maybe it's a PNG. We send it into MobileNet, and then MobileNet gives us back a label and a probability. So maybe it says something like, cat, 90%, bird, 5%, clock, 5%. This is what it gives us back. So how are we going to retrain this model? Well, here's the thing.
(06:52) There's a lot of stuff going on in here, and in order to retrain it, we need to kind of like peel it open a little bit. And the thing that we're going to use to peel it open is something that's built into the ml5 library called a feature extractor. OK, well, internally inside of the MobileNet model, the MobileNet model is running a neural network.
(07:18) You might have heard that term before, and at some point in this video series, we might get more into that. But a neural network is something that has layers. It has multiple layers. Maybe it has Layer 1, Layer 2, Layer 3, and that data is actually being passed into Layer 1. It's being processed, and then sent into Layer 2.
(07:41) It's being processed again. It's being sent to the Layer 3. It's processed again. Those processes-- there are different kinds of processes. For example, most likely, since we're sending it image data, it's using something called-- you might have heard this term-- a convolution. A convolution is actually an image process, which is the same thing that happens like in Photoshop or any kind of image processing utility, where you open up an image and say, hey, let's make it brighter.
(08:05) So a neural network is doing that to the image. It's doing all of these processes over and over and over again to try to reduce the image. This thing's got a lot of pixels in it. Let's process it down to something smaller and down to something smaller, and let's do that many, many, many, many times over multiple layers to eventually get to something, which we can call features.
(08:25) So if I say that the last layer, after it does all these processes, is something called features, and then those features, which exist here, they are then sent into-- they are then converted-- sorry-- through another layer into labels and probabilities. So there's this whole process that's happening inside of MobileNet.
(08:46) The image comes in. It's processed through a convolutional layer, maybe through another convolutional layer, then through some other kind of layer, blah, blah, blah, ends with all these num features, which are really just numbers, a whole lot of numbers. And then those numbers are processed one more time to get probabilities.
(09:02) Well, what if what transfer learning is is it's, hey, let's just delete this part. Let's go into MobileNet and stop right here. Feature extractor-- let's make a version of the MobileNet model, where we stop right here at the feature extractor. And then we take those features and tr-- we put our own training images in.
(09:24) We say, take this training image of the ukulele. Send it all the way through. Don't bother to get the label. Don't get acoustic guitar. Just stop here, and say, hey, you know what? These features, these features are 100% ukulele. So we're going to retrain the model to map-- basically map of the features to our own labels instead of the labels that are previously existed in MobileNet.
(09:51) All right, so what are the features? So here's the thing. In theory, we could-- this-- I mean I think in some crazy sort of theoretical sense, we could eliminate all of these layers and just teach a machine to learn that this set of pixels are a cat, and this set of pixels are a bird. And I'll look at this whole other set of pixels.
(10:12) Which set of pixels does it resemble? That's kind of what we're doing, but what you have to remember is that images have so many pixels. And so when you have to compare images pixel by pixel by pixel by pixel, there's so much data. And so actually, this idea of features-- the idea of features is boiling the essence of an image down to a smaller set of manageable numbers.
(10:38) So in essence, this image, which might have started as like a 512 by 512-- you do the math-- 512 squared number of pixels maybe ends up as 100 features, just 100 numbers. I don't actually remember what it is in MobileNet. We should look it up. And then those numbers are typically just numbers between 0 and 1.
(11:00) This is also often referred to as a vector, meaning a list of numbers. It's the essence-- it's the numeric essence of that image that you just passed in. And it's been-- it's learned these features over lots and lots of time of being trained with millions and millions of images. So what we need to do in ml5 is ml5 is a higher level library.
(11:21) We don't have to do all this manipulation ourselves. We basically just say, hey, instead of making an image classifier with MobileNet, we're going to make a feature extractor with MobileNet, and then turn that feature extractor into a classifier, and train it with our own images. So this is what I'm going to do in the next video.
(11:40) I'm actually going to write the code to do exactly this, and I'm going to train it with a few sets of images in here. And actually, what you'll see is what's interesting is you can actually get it to work with things like different facial expressions or different gestures. There's a lot of wonderful possibilities there.
(11:55) [MUSIC PLAYING]
Comments
Post a Comment