How do I select features for Machine Learning?
Hi there! This is Kevin, from Data School. You’re about to watch an excerpt from a private Q&A webcast that I hold every single month. This question is about how to select the best features for your machine learning model. Stick around till the end to find how YOU can join these webcasts in the future. Thanks! So, the first question I’ve got for today is from Davis Vickers and he asks: “Hey Kevin! I discovered your YouTube channel as an aspiring data scientist. Your content has helped me so much! Question on sklearn: In logistic regression and other classifier algorithms, Could you demonstrate a way to extract the best features, with their coefficients and/or score that is used in the model. I’m working with a logistic regression model that has over 400 features and I’m trying to determine the best way to use a feature importance/selection.” OK! So, that was a long question, but let me boil it down to this: Davis is asking how to perform feature selection. Okay? So, he’s got 400 features for his classification model, though the process would be roughly the same for a regression model, and he wants to reduce features. Now, why do you want to perform feature selection in the first place? The reason you do feature selection is because removing irrelevant features results in a better performing model, in a easier to understand model, and in a model that runs faster. So those are just 3 reasons that feature selection is useful. So, I was trying to think of the best way to answer this question, because unlike some questions, this one doesn’t have a simple answer. and there are books about feature selection, I assume, but it’s certainly a book length topic. But, I think the best way to answer this is… 2 years ago, I was at PyData DC, in Washington DC, and I attended a presentation called: “A practical guide to dimensionality reduction”, and it was an awesome presentation, it’s a great video. But what I’m gonna do is, I’m gonna pull up his slides on screen, and I’m gonna briefly go through a couple of his ideas for how you feature selection, and then I’ll talk very briefly about…, you know, how you do it from a practical level not just, kind of a high level. So, it will take me just a moment. I need to share one of my tabs… And let’s do that. Alright! You should see on screen the slides. So it’s called “A Practical Guide to Dimensionality Reduction”, but it’s about feature selection because dimensionality refers to the dimensions of your training data, and number of rows, and number of columns. So, when we’re talking about features, we’re talking about columns, and when we’re talking about dimensionality reduction, that means reducing the number of features. Ok? So… Let me go ahead and scroll down, and he has this big list of… 12 techniques that he uses for feature selection, and I’m just gonna highlight a few of these and talk briefly about them, ok? So, the first one he talks about is based on percent missing values. So, again, Davis has 400 features and he’s asking: “How do I remove some?” and the first idea is to remove features that have a high percentage of missing values. Now, why is that useful? Because features that are mostly missing values are hard to learn from. A machine learning model learns from your data, and when most of the values are missing, it’s hard to learn from. Now, that being said, you can turn a missingness into a feature, because… well, it might actually be a useful feature. a binary feature of, like… “is missing” or “not”. So even if you’re dropping the feature itself, you might encode the missingness as a feature, ok? So, that’s one idea. Next idea: amount of variation. So, the basic idea is, if a feature is mostly all the same value, then the model is not going to learn anything from it, so you should drop it. Ok? So that’s the next one. Next idea is pairwise correlation. So, if two of your features are highly correlated, you can drop one, because they’re redundant. So, if you drop one, you won’t actually be losing that much information in terms of what your model can learn from. Ok? So that’s another idea. And as I’m going through this, you’re probably thinking: “Which one of these should I do?” And I will answer that at the end. I will say that, I’m gonna provide a lot of ideas, but I’m not gonna give you like, “Here’s the one thing you need to do”. Because there is no “one thing”. If it was easy, everyone would do it the same way, and everyone would just be like: “Here’s what you do!” And everyone would do it, and it would work. But there is no “one easy way”. There is lots of ideas for how to do feature selection. The next one I was going to talk about was correlation with the target. So, if a variable, aka feature, has a very low correlation with the target, then, you can probably drop it. Now, this might, like any these techniques, can miss a useful feature because there might be a feature interaction such that, you know… Variable A does not correlate with the target, Variable B does not correlate with the target, But variable A and B together, if you turn them into a combined feature, then they are… But there’s only so much you can do… You can’t, generally speaking, try every possible combination of features And especially when you have 400 features. So you have to use some sort of technique to do this. Ok ! Number 8, 9 and 10 talk about forward, backward and stepwise selection. And the way forward selection works, and then I’ll talk about the other two, is, you start with one feature that you believe to be the best feature and you evaluate… I mean, you could actually write a loop, even, just try one feature in your model, loop through all 400, and, you know, do cross-validation with the relevant evaluation metric and figure out: “Ok, here is my one best feature”, and you add that to your model, then, you try adding a second feature. Which do you add? Well, the best one as determined by some criteria. So you keep doing that until some threshold is met: Is there a certain number of features you’ve defined, some performance metric, etc., etc. Backward selection is essentially the reverse, You start with all of them, you subtract one, you subtract the least important one, and you keep subtracting and subtracting again until you meet some sort of stopping criteria. And then, stepwise is kind of a combination of the two. Ok? I’ll just have a couple more and then, I’ll kind of summarize what I’ve talked about here and provide some other tips. So, two more, in his recommendations, LASSO: LASSO is actually an algorithm for creating a regularised linear model, ok? You may have heard of LASSO regression and rich regression There’s two types of regularised regression. Well, a nice property of LASSO is you change this regularisation parameter with LASSO like, when either the value is very large or very small, I don’t remember, there’s no regularisation and you just have a plain linear model, then, if you increase or decrease that regularisation parameters slightly, then, it does regularisation which with LASSO actually drops coefficients all the way to zero, and a coefficient of zero means the feature has been dropped. So it essentially does feature selection for you. Now, I just said LASSO is for a regularised linear model, (It?) LASSO is for regression, but there’s regularised logistic regression, for example, that can work the same way. I think I’ve said enough about that. Final idea in this presentation is, Tree based models, and you may know that with ensembles of trees, of decision trees, such as random forest, and other similar models, It automatically computes something called feature importances. You could set a threshold and say: “If my model says a given feature importance is below a certain threshold, then, remove it from the model. So that’s another idea. These last two ideas are only useful if that is your model that you’re using, or you could theoretically use a tree based model to look at feature importance, and then, not actually use a tree based model for your model that you’re building. Okay! So, I know that was a lot of ideas thrown at you quickly, I’ll just wrap up with some advice and then talk about how to implement this. In terms of implementation, scikit learn does supports some of these. Just search for scikit learn feature selection or I’ll have a link in the webcast notes, and they have a page in the user guide about feature selection techniques they support. And many of these are included in some form or fashion. The most sexy one is forward and backward selection, because if feels like it’s doing a ton of work for you, and it is! That’s not currently available in scikit learn, it is available in a package called mlxtend and I will link to that in the webcast notes. Though, that actually might get merged into scikit learn at some point. I think someone is working on that right now, I saw that on a scikit learn mailing list. So, what is my general advice? I’ve given a lot of ideas. My general advice is to try simple techniques, because the more complicated you get, the easier it is to sink a bunch of time into it and to make mistake. My next piece of advice is, always check if what you’re doing is actually helping or hurting So, don’t assume that any given technique is useful. You need to set up your model evaluation procedure first. And then, try these things and see if they’re actually helping, because if they’re not, then abandon them. And you’re never gonna know a priori whether something’s gonna work on a given dataset or a given problem, so you have to try it. And final piece of advice, is to, generally, I say focus on things built into scikit learn, because when you’re writing custom code, it’s easy to make a mistake and that’s a lot of work. Scikit learn code is GOOD. Even like, mlextend, it’s good code. I respect the person who wrote the package. But that’s not his focus, and so, over time, it will probably get out of date. And, if there’s bugs, they probably won’t get fixed, because you don’t have a bunch of contributors focused on it like you do with scikit learn. So, do simple things, check whether it’s working, and generally, stick to things available on scikit learn, because it will make your life easier, and you’re less prone to make mistakes. Hope this video was helpful to you. If you’d like to join my monthly webcasts and ask your own question, sign up for my membership program at the 5 dollar level, by going to: https://www.patreon.com/dataschool There’s a link in the description below, or you can click the box on your screen. Thank you so much for watching, and I’ll see you again soon.