Session 54 - Feature Selection Part 1 | Filter Methods | Variance Threshold | Chi-Square | DSMP 2023
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Feature selection is a core part of feature engineering that reduces dimensionality debt, lowers compute cost, and can preserve predictive accuracy.
Briefing
Feature selection is presented as a practical, project-critical step in machine learning pipelines: it trims hundreds of input columns down to a smaller, more useful subset without wrecking predictive accuracy—and it can make models faster and easier to interpret. The session frames the core problem as “dimensionality debt”: adding too many features can degrade performance, distort distance-based intuition in high dimensions, and increase computational cost during training and inference. The takeaway is that feature selection isn’t just an interview buzzword; it’s an everyday engineering task that supports accuracy, speed, and interpretability.
The lecture begins by defining features as input columns used to predict a target (output) variable, then zooms in on why selecting features matters even when a model can technically ingest all columns. With large datasets (the session later uses a Human Activity Recognition with smartphone dataset with hundreds of sensor-derived columns), keeping everything can slow training and inference and can hurt generalization. Feature selection is positioned as part of feature engineering, distinct from feature extraction: selection keeps existing features and chooses a subset, while extraction transforms raw inputs into new features.
Three filter-based techniques are introduced in sequence—Variance Threshold, Correlation-based filtering, and ANOVA (with Chi-square introduced later as a categorical counterpart). Filter methods are described as statistics-driven: they evaluate each feature (or feature pair) using measures like variance or association strength, then “filter out” features that fail a criterion. The session emphasizes that filter methods are fast and model-agnostic, making them good preprocessing steps before more expensive methods.
Variance Threshold is taught first. It removes features with low variance—especially constant features that never change—because such columns provide little signal for prediction. A practical guideline is given for choosing a threshold after normalization (values between 0 and 1), and the method is demonstrated on the smartphone dataset. The workflow includes an important hygiene step: dropping duplicate features before applying selection. The session shows how duplicates can inflate dimensionality (e.g., reducing from 561 columns down to 540 after removing duplicates), then applies VarianceThreshold to further cut columns (down to 349 in the example).
Next comes correlation filtering. The logic is to reduce multicollinearity by removing highly linearly related features. The lecture explains correlation coefficients (near +1 strong positive linear relationship, near -1 strong negative, near 0 no linear relationship) and demonstrates building a correlation matrix, then selecting features to drop when absolute correlation exceeds a cutoff (example cutoff: 0.95). The method uses a set to avoid dropping the same feature repeatedly and results in a further reduction (down to 152 columns in the demonstrated run). Caveats are highlighted: correlation-based removal assumes linear relationships and can miss interactions among multiple features.
ANOVA is then introduced for numerical features with categorical targets. It uses an F-test to compare between-group variance versus within-group variance, producing p-values to decide whether a feature’s mean differs across target classes. The session demonstrates using SelectKBest with an ANOVA F-test to keep the top K features (example: reducing to 100 columns). It also lists key assumptions for ANOVA: normality within groups, homogeneity of variance, independence of observations, sensitivity to outliers, and limitations around feature interactions.
Finally, Chi-square is described as the categorical-feature counterpart. It compares observed versus expected counts in contingency tables and uses p-values to rank features by association with the target. The session notes practical requirements like sufficient sample size per cell and warns that these tests don’t capture feature interactions.
The session closes by stressing evaluation: after each filtering stage, the model (logistic regression in the demo) should be retrained on the reduced feature set, and accuracy should be checked against the baseline using all features. If accuracy stays close, the feature selection strategy is validated; if it drops sharply, the filtering criteria likely removed useful signal.
Cornell Notes
Feature selection is framed as essential preprocessing that reduces dimensionality debt, improves computational efficiency, and can preserve or even enhance model accuracy. The session focuses on filter methods that use statistics to score features and remove unhelpful ones without training a full model for every subset. It demonstrates a pipeline on a Human Activity Recognition dataset: first remove duplicate columns, then apply Variance Threshold to drop constant/low-variance features, then correlation filtering to reduce highly linearly redundant features, and finally ANOVA (SelectKBest with an F-test) to keep the top K numerical features. A categorical counterpart, Chi-square, is introduced later for ranking features by association with categorical targets using contingency tables and p-values. Accuracy is treated as the final judge after retraining with the selected features.
Why does “dimensionality debt” make feature selection necessary even when a model can ingest all columns?
What does Variance Threshold remove, and how should the threshold be chosen?
How does correlation-based filtering decide which features to drop?
What is the key idea behind ANOVA feature selection, and what does the p-value mean here?
When should Chi-square be used instead of ANOVA?
Why must feature selection be evaluated by retraining and checking accuracy?
Review Questions
- In what ways can low-variance features still be useful, and why does Variance Threshold risk removing them?
- How do correlation filtering and ANOVA differ in what relationships they can detect (linear redundancy vs. group mean differences)?
- What assumptions must hold for ANOVA to be reliable, and what kinds of data issues (like outliers or time-series dependence) can break those assumptions?
Key Points
- 1
Feature selection is a core part of feature engineering that reduces dimensionality debt, lowers compute cost, and can preserve predictive accuracy.
- 2
Filter methods score features using statistics (variance, correlation, ANOVA F-test, Chi-square) and remove features without training a model for every subset.
- 3
Always remove duplicate features early because they inflate dimensionality and can distort subsequent selection steps.
- 4
Variance Threshold removes constant/low-variance columns; choose thresholds after normalization so variance scales are comparable.
- 5
Correlation filtering targets multicollinearity by dropping features with high absolute linear correlation (using a cutoff like 0.95), but it can miss non-linear relationships and multi-feature interactions.
- 6
ANOVA (SelectKBest with F-test) ranks numerical features by whether their means differ across categorical target classes; it relies on assumptions like normality, equal variances, and independence.
- 7
After any feature selection step, retrain the model and compare accuracy to the baseline to confirm that removed features weren’t actually useful.