How do I select features for Machine Learning?

Hi there! This is Kevin, from Data School. You’re about to watch an excerpt from a private Q&A webcast that I hold every single month. This question is about how to select the best features for your machine learning model. Stick around till the end to find how YOU can join these webcasts in the future. Thanks! So, the first question I’ve got for today is from Davis Vickers and he asks: “Hey Kevin! I discovered your YouTube channel as an aspiring data scientist. Your content has helped me so much! Question on sklearn: In logistic regression and other classifier algorithms, Could you demonstrate a way to extract the best features, with their coefficients and/or score that is used in the model. I’m working with a logistic regression model that has over 400 features and I’m trying to determine the best way to use a feature importance/selection.” OK! So, that was a long question, but let me boil it down to this: Davis is asking how to perform feature selection. Okay? So, he’s got 400 features for his classification model, though the process would be roughly the same for a regression model, and he wants to reduce features. Now, why do you want to perform feature selection in the first place? The reason you do feature selection is because removing irrelevant features results in a better performing model, in a easier to understand model, and in a model that runs faster. So those are just 3 reasons that feature selection is useful. So, I was trying to think of the best way to answer this question, because unlike some questions, this one doesn’t have a simple answer. and there are books about feature selection, I assume, but it’s certainly a book length topic. But, I think the best way to answer this is… 2 years ago, I was at PyData DC, in Washington DC, and I attended a presentation called: “A practical guide to dimensionality reduction”, and it was an awesome presentation, it’s a great video. But what I’m gonna do is, I’m gonna pull up his slides on screen, and I’m gonna briefly go through a couple of his ideas for how you feature selection, and then I’ll talk very briefly about…, you know, how you do it from a practical level not just, kind of a high level. So, it will take me just a moment. I need to share one of my tabs… And let’s do that. Alright! You should see on screen the slides. So it’s called “A Practical Guide to Dimensionality Reduction”, but it’s about feature selection because dimensionality refers to the dimensions of your training data, and number of rows, and number of columns. So, when we’re talking about features, we’re talking about columns, and when we’re talking about dimensionality reduction, that means reducing the number of features. Ok? So… Let me go ahead and scroll down, and he has this big list of… 12 techniques that he uses for feature selection, and I’m just gonna highlight a few of these and talk briefly about them, ok? So, the first one he talks about is based on percent missing values. So, again, Davis has 400 features and he’s asking: “How do I remove some?” and the first idea is to remove features that have a high percentage of missing values. Now, why is that useful? Because features that are mostly missing values are hard to learn from. A machine learning model learns from your data, and when most of the values are missing, it’s hard to learn from. Now, that being said, you can turn a missingness into a feature, because… well, it might actually be a useful feature. a binary feature of, like… “is missing” or “not”. So even if you’re dropping the feature itself, you might encode the missingness as a feature, ok? So, that’s one idea. Next idea: amount of variation. So, the basic idea is, if a feature is mostly all the same value, then the model is not going to learn anything from it, so you should drop it. Ok? So that’s the next one. Next idea is pairwise correlation. So, if two of your features are highly correlated, you can drop one, because they’re redundant. So, if you drop one, you won’t actually be losing that much information in terms of what your model can learn from. Ok? So that’s another idea. And as I’m going through this, you’re probably thinking: “Which one of these should I do?” And I will answer that at the end. I will say that, I’m gonna provide a lot of ideas, but I’m not gonna give you like, “Here’s the one thing you need to do”. Because there is no “one thing”. If it was easy, everyone would do it the same way, and everyone would just be like: “Here’s what you do!” And everyone would do it, and it would work. But there is no “one easy way”. There is lots of ideas for how to do feature selection. The next one I was going to talk about was correlation with the target. So, if a variable, aka feature, has a very low correlation with the target, then, you can probably drop it. Now, this might, like any these techniques, can miss a useful feature because there might be a feature interaction such that, you know… Variable A does not correlate with the target, Variable B does not correlate with the target, But variable A and B together, if you turn them into a combined feature, then they are… But there’s only so much you can do… You can’t, generally speaking, try every possible combination of features And especially when you have 400 features. So you have to use some sort of technique to do this. Ok ! Number 8, 9 and 10 talk about forward, backward and stepwise selection. And the way forward selection works, and then I’ll talk about the other two, is, you start with one feature that you believe to be the best feature and you evaluate… I mean, you could actually write a loop, even, just try one feature in your model, loop through all 400, and, you know, do cross-validation with the relevant evaluation metric and figure out: “Ok, here is my one best feature”, and you add that to your model, then, you try adding a second feature. Which do you add? Well, the best one as determined by some criteria. So you keep doing that until some threshold is met: Is there a certain number of features you’ve defined, some performance metric, etc., etc. Backward selection is essentially the reverse, You start with all of them, you subtract one, you subtract the least important one, and you keep subtracting and subtracting again until you meet some sort of stopping criteria. And then, stepwise is kind of a combination of the two. Ok? I’ll just have a couple more and then, I’ll kind of summarize what I’ve talked about here and provide some other tips. So, two more, in his recommendations, LASSO: LASSO is actually an algorithm for creating a regularised linear model, ok? You may have heard of LASSO regression and rich regression There’s two types of regularised regression. Well, a nice property of LASSO is you change this regularisation parameter with LASSO like, when either the value is very large or very small, I don’t remember, there’s no regularisation and you just have a plain linear model, then, if you increase or decrease that regularisation parameters slightly, then, it does regularisation which with LASSO actually drops coefficients all the way to zero, and a coefficient of zero means the feature has been dropped. So it essentially does feature selection for you. Now, I just said LASSO is for a regularised linear model, (It?) LASSO is for regression, but there’s regularised logistic regression, for example, that can work the same way. I think I’ve said enough about that. Final idea in this presentation is, Tree based models, and you may know that with ensembles of trees, of decision trees, such as random forest, and other similar models, It automatically computes something called feature importances. You could set a threshold and say: “If my model says a given feature importance is below a certain threshold, then, remove it from the model. So that’s another idea. These last two ideas are only useful if that is your model that you’re using, or you could theoretically use a tree based model to look at feature importance, and then, not actually use a tree based model for your model that you’re building. Okay! So, I know that was a lot of ideas thrown at you quickly, I’ll just wrap up with some advice and then talk about how to implement this. In terms of implementation, scikit learn does supports some of these. Just search for scikit learn feature selection or I’ll have a link in the webcast notes, and they have a page in the user guide about feature selection techniques they support. And many of these are included in some form or fashion. The most sexy one is forward and backward selection, because if feels like it’s doing a ton of work for you, and it is! That’s not currently available in scikit learn, it is available in a package called mlxtend and I will link to that in the webcast notes. Though, that actually might get merged into scikit learn at some point. I think someone is working on that right now, I saw that on a scikit learn mailing list. So, what is my general advice? I’ve given a lot of ideas. My general advice is to try simple techniques, because the more complicated you get, the easier it is to sink a bunch of time into it and to make mistake. My next piece of advice is, always check if what you’re doing is actually helping or hurting So, don’t assume that any given technique is useful. You need to set up your model evaluation procedure first. And then, try these things and see if they’re actually helping, because if they’re not, then abandon them. And you’re never gonna know a priori whether something’s gonna work on a given dataset or a given problem, so you have to try it. And final piece of advice, is to, generally, I say focus on things built into scikit learn, because when you’re writing custom code, it’s easy to make a mistake and that’s a lot of work. Scikit learn code is GOOD. Even like, mlextend, it’s good code. I respect the person who wrote the package. But that’s not his focus, and so, over time, it will probably get out of date. And, if there’s bugs, they probably won’t get fixed, because you don’t have a bunch of contributors focused on it like you do with scikit learn. So, do simple things, check whether it’s working, and generally, stick to things available on scikit learn, because it will make your life easier, and you’re less prone to make mistakes. Hope this video was helpful to you. If you’d like to join my monthly webcasts and ask your own question, sign up for my membership program at the 5 dollar level, by going to: There’s a link in the description below, or you can click the box on your screen. Thank you so much for watching, and I’ll see you again soon.


  1. Назар Тропанец

    November 13, 2018 at 5:00 pm


  2. TheAlderFalder

    November 13, 2018 at 5:02 pm

    I‘m the first. That’s why I‘m gonna become rich prior to all of you!!! Except Kev maybe.

  3. Dean Prickle

    November 13, 2018 at 5:20 pm

    HI kevin….What's the difference between PCA and LDA ?

  4. Gabriel Joshua Miguel

    November 13, 2018 at 5:47 pm

    XGboost model automatically calculates feature importance

  5. Evan Chugh

    November 13, 2018 at 9:00 pm

    Do you have any tips on how to handle datasets where there is a strong class imbalance? (ie. 95% of class A, 5% of class B?) Thanks, these videos are extremely helpful!

  6. DataScience DS79

    November 13, 2018 at 9:17 pm

    I did Recursive Feature Elimination with Cross Validation and Variance Inflation Factor for dimentionality reduction 🙂

  7. Davis Vickers

    November 14, 2018 at 3:35 am

    Thank you so much Kevin! Your response was very succinct and clear! I actually showed your video to my colleagues during our machine learning Friday sessions at work and we all loved it. It was a timely topic for us since we’re all fairly new to building ML models.

  8. Sara Gorzin

    November 14, 2018 at 9:40 am

    Thank you for your great and helpful videos

  9. Kshitij Bhargava

    November 14, 2018 at 4:33 pm

    could you please share this ppt with us

  10. Dilip Gawade

    November 18, 2018 at 12:20 pm

    Hey Kevin, Thanks for your videos. They are extremely helpful. I have some knowledge on Python and Tableau and would like to switch my career to machine learning. I have been watching many videos on machine learning but confused from where to start. Please guide me how should I learn it stepwise. Thanks

  11. Electronics Inside

    November 27, 2018 at 12:58 pm

    How to work with Plotly and Cufflinks in visual studio code ??

  12. Kiran Achanta

    November 28, 2018 at 3:25 pm

    Hello Kevin, Can you make a video on finding multicollinearity with VIF using sklearn library or may be with some other library.

  13. Lone Wolf

    November 30, 2018 at 6:35 am

    This video was by far the best video on feature selection

  14. Monu Vishwakarma

    December 9, 2018 at 5:16 am

    Sir,can you make video on data visualizatuin using all distributions of statistics? ?

  15. Marcel Augusto Borssato Cortapasso

    December 10, 2018 at 4:22 pm

    Great video, again. Thanks so much for sharing these valuable tips.

  16. Lydia Aidyl

    December 12, 2018 at 7:56 pm

    I am trying to learn machine learning on my own so I can't quite understand the steps you take. So based on what you said about choosing features, if one wants to eliminate features using forward selection should they know beforehand which algorithm they are going to use and try to do forward selection on the specific algorithm? Or should one do forward selection using logistic/linear regression and then having found the significant variables choose an algorithm (e.g Decision trees, kNN,..)? Thanks in advance.

  17. karthikeyan mg

    December 13, 2018 at 8:13 pm

    I'm working with a 2000 dimension data, Is it ok to use pca to reduce them to 50 and then use forward feature selection to further reduce to 20 or is it ok go from 2000 to 20 using pca itself??
    Is it ok to use 2000 to 20 pca reduction method?

  18. fdaflkj

    January 10, 2019 at 9:40 am

    Great channel

  19. 李奎慶

    March 5, 2019 at 11:39 pm

    Awesome lesson! This topic is quite important in text classification while the number of words and phrases extracted from text are somehow overwhelmed.

  20. Balajee Muggalla

    March 7, 2019 at 9:23 am

    Hey..thanks for the video. Can you make a video on how to identify multicollinearity, correlation etc from the dataset?

  21. Amr Del

    March 8, 2019 at 9:45 am

    i am a phd student from ALGERIA and i d like to thank u for your helpfull vedeos and the effort you put to do them , can i ask you please to show us an example of how to build train and test an adaboost classifier in scikit learn like u did with knn and please can you tell us can we use SVM as a weak learner for adaboost ?? and how to make that weak learner loop in the classifier and compute those params error alpha of the weak learner and weight update ?? thanks in advance sir

  22. Nikhil Kenvetil

    March 15, 2019 at 2:49 pm

    So does that mean we may do this on every dataset, or is it imperative that we do all of this in all datasets?

  23. 3rdTeen

    March 16, 2019 at 2:59 am

    Hi. Thanks for your nice video. I am from India. I need help.
    If I want to filter data frame based one column with specific value (like: football) where number of times ouwn column value is max. How do I write. Please help.

  24. mersha nigus

    March 16, 2019 at 5:04 pm

    thank you for your nice video and with good presentation and i have question, have data set but the data does not have Labeled and i want to made feature selection for classification? how can i select features for unlabeled data

  25. Anand Deshmukh

    April 3, 2019 at 9:12 am

    the way of Superior teaching!

  26. Khang Tran

    April 6, 2019 at 6:04 pm

    That speech clarity

  27. bharadwaj chivukula

    April 17, 2019 at 4:42 pm

    Can you please explain in detail about Onehot encoding various features in detail because it would be helpful for many , Thank you

  28. Jason Tarimo

    April 22, 2019 at 6:20 pm

    Great one Kevin. When are you going to do one on time series?

  29. Chetan Rane

    April 23, 2019 at 5:58 pm

    Awesome explaination of concept

  30. djamila meghraoui

    April 30, 2019 at 1:54 pm

    easy to understand your explanation thank you !

  31. Ayya Samy

    May 2, 2019 at 9:30 am

    Good one !!

  32. RCP ARG

    May 4, 2019 at 10:38 am

    Hi you are a great teacher, very clear! I´m starting with DS and I want to ask you if you have the video of the presentation to share and deepen the topic of dimensionality reduction, thanks in advance, Kika

  33. Ignasius Harvey

    May 5, 2019 at 12:23 pm

    Hey, I don't quite get this part
    "Tree based feature selection is only useful if that is your model that you're using or you could theoretically use a tree based model to look at feature importance, and then not actually use a tree based model for your model that you're building."
    Why is it? I think that because of those features are important (using tree based) then we can build a great model using tree based algorithm. Or maybe I am missing something here?

  34. Kartikey Riyal

    May 23, 2019 at 7:16 pm

    Best school too learn. I am learning it by my self as I I don'have enough bills toh py the fee. I have learned complete pandas from you thanks alooot, fantastic work and bless you

  35. hadya asghar

    May 27, 2019 at 7:02 am

    Hey, Kevin, your content is great. I did a whole project by taking help solely from your content 😊

  36. Muhammad Kiru

    May 29, 2019 at 2:42 pm

    Hi Kavin, it was nice going through your videos. They are amazing. my question is please which software do you use for making your video?

  37. SK N

    May 31, 2019 at 1:48 am

    another way would be the automated backward elimination with a loop

  38. Mustafa Bohra

    July 25, 2019 at 1:53 pm

    Even google can't provide so exact answer to the feature selection as you have comprehended in 10mins!!!!

    Thank you so much!!!

  39. Mohammad Meraj

    August 18, 2019 at 9:06 am

    wonderfully explained!!

  40. Surat Asvapoositkul

    August 23, 2019 at 2:03 pm

    Hi Kevin! Thanks for a very clear explanation. This video is very useful as I'm very new in machine learning.

    I have one question related to the feature selection. I started learning ML by implementing the decision tree. Most of the online tutorials just put all the features into the decision tree and let the DT select the features by itself. However, what if you have tons of features (let's say 100,000 variables), is it better to perform some feature selection before building the DT model? or it doesn't matter since DT can use Gini to automatically select the potential attribute to the model.

  41. Khawja Farhan

    September 3, 2019 at 2:12 pm

    Really good tips for feature selection.

  42. Jazmín Sutcliff

    September 3, 2019 at 5:38 pm

    Thanks dear!

  43. Sagar Solanki

    October 2, 2019 at 6:37 pm

    Great video. I learned so much in just one short video that would need a huge number of articles. One question, can you use ensemble models like decision trees and random forest to look at the feature importance and then use it to train another machine learning model (Say logistic regression)? Aren't the feature_importance given by an ensemble technique specific to themselves?

Leave a Reply