Session 54 - Feature Selection Part 1 | Filter Methods | Variance Threshold | Chi-Square

TL;DR

Feature selection is a core part of feature engineering that reduces dimensionality debt, lowers compute cost, and can preserve predictive accuracy.

Briefing Cornell Notes

Briefing

Feature selection is presented as a practical, project-critical step in machine learning pipelines: it trims hundreds of input columns down to a smaller, more useful subset without wrecking predictive accuracy—and it can make models faster and easier to interpret. The session frames the core problem as “dimensionality debt”: adding too many features can degrade performance, distort distance-based intuition in high dimensions, and increase computational cost during training and inference. The takeaway is that feature selection isn’t just an interview buzzword; it’s an everyday engineering task that supports accuracy, speed, and interpretability.

The lecture begins by defining features as input columns used to predict a target (output) variable, then zooms in on why selecting features matters even when a model can technically ingest all columns. With large datasets (the session later uses a Human Activity Recognition with smartphone dataset with hundreds of sensor-derived columns), keeping everything can slow training and inference and can hurt generalization. Feature selection is positioned as part of feature engineering, distinct from feature extraction: selection keeps existing features and chooses a subset, while extraction transforms raw inputs into new features.

Three filter-based techniques are introduced in sequence—Variance Threshold, Correlation-based filtering, and ANOVA (with Chi-square introduced later as a categorical counterpart). Filter methods are described as statistics-driven: they evaluate each feature (or feature pair) using measures like variance or association strength, then “filter out” features that fail a criterion. The session emphasizes that filter methods are fast and model-agnostic, making them good preprocessing steps before more expensive methods.

Variance Threshold is taught first. It removes features with low variance—especially constant features that never change—because such columns provide little signal for prediction. A practical guideline is given for choosing a threshold after normalization (values between 0 and 1), and the method is demonstrated on the smartphone dataset. The workflow includes an important hygiene step: dropping duplicate features before applying selection. The session shows how duplicates can inflate dimensionality (e.g., reducing from 561 columns down to 540 after removing duplicates), then applies VarianceThreshold to further cut columns (down to 349 in the example).

Next comes correlation filtering. The logic is to reduce multicollinearity by removing highly linearly related features. The lecture explains correlation coefficients (near +1 strong positive linear relationship, near -1 strong negative, near 0 no linear relationship) and demonstrates building a correlation matrix, then selecting features to drop when absolute correlation exceeds a cutoff (example cutoff: 0.95). The method uses a set to avoid dropping the same feature repeatedly and results in a further reduction (down to 152 columns in the demonstrated run). Caveats are highlighted: correlation-based removal assumes linear relationships and can miss interactions among multiple features.

ANOVA is then introduced for numerical features with categorical targets. It uses an F-test to compare between-group variance versus within-group variance, producing p-values to decide whether a feature’s mean differs across target classes. The session demonstrates using SelectKBest with an ANOVA F-test to keep the top K features (example: reducing to 100 columns). It also lists key assumptions for ANOVA: normality within groups, homogeneity of variance, independence of observations, sensitivity to outliers, and limitations around feature interactions.

Finally, Chi-square is described as the categorical-feature counterpart. It compares observed versus expected counts in contingency tables and uses p-values to rank features by association with the target. The session notes practical requirements like sufficient sample size per cell and warns that these tests don’t capture feature interactions.

The session closes by stressing evaluation: after each filtering stage, the model (logistic regression in the demo) should be retrained on the reduced feature set, and accuracy should be checked against the baseline using all features. If accuracy stays close, the feature selection strategy is validated; if it drops sharply, the filtering criteria likely removed useful signal.

Cornell Notes

Feature selection is framed as essential preprocessing that reduces dimensionality debt, improves computational efficiency, and can preserve or even enhance model accuracy. The session focuses on filter methods that use statistics to score features and remove unhelpful ones without training a full model for every subset. It demonstrates a pipeline on a Human Activity Recognition dataset: first remove duplicate columns, then apply Variance Threshold to drop constant/low-variance features, then correlation filtering to reduce highly linearly redundant features, and finally ANOVA (SelectKBest with an F-test) to keep the top K numerical features. A categorical counterpart, Chi-square, is introduced later for ranking features by association with categorical targets using contingency tables and p-values. Accuracy is treated as the final judge after retraining with the selected features.

Why does “dimensionality debt” make feature selection necessary even when a model can ingest all columns?

High-dimensional inputs can degrade performance and increase compute. The session links this to (1) worse generalization when too many features add noise, (2) distorted distance intuition in very high dimensions (distance becomes less informative), and (3) higher training/inference cost because algorithms must process more columns. Feature selection reduces space/time complexity and can also improve interpretability by shrinking the feature set.

What does Variance Threshold remove, and how should the threshold be chosen?

Variance Threshold drops features whose variance falls below a chosen cutoff. Constant features (variance near 0) are removed because they provide no predictive signal. The session recommends normalizing features first so variance values fall into a predictable range (e.g., between 0 and 1), then using a small threshold such as 0.05 (the example uses point-one/0.1 guidance and then demonstrates with a chosen value). In the demo, applying VarianceThreshold reduced columns from 540 to 349 after duplicate removal.

How does correlation-based filtering decide which features to drop?

It computes pairwise correlation coefficients between features and removes features that are highly linearly related (multicollinearity). The session uses a cutoff (example: absolute correlation > 0.95). It builds a correlation matrix, loops through features, collects those involved in high-correlation pairs, converts the list to a set to avoid repeated drops, and then removes the selected redundant columns. The demo reduced columns further (down to 152).

What is the key idea behind ANOVA feature selection, and what does the p-value mean here?

ANOVA uses an F-test to compare between-group variance (differences in feature means across target classes) against within-group variance (spread inside each class). If the p-value is below a threshold (commonly 0.05), the feature’s distribution differs significantly across classes, so it’s kept. The session demonstrates SelectKBest with ANOVA F-test to keep the top K features (example: reducing to 100). It also emphasizes assumptions like normality within groups, homogeneity of variance, independence of observations, and sensitivity to outliers.

When should Chi-square be used instead of ANOVA?

Chi-square is used when the target is categorical and the feature is categorical (or discretized into categories). It builds a contingency table, compares observed counts to expected counts under independence, and ranks features by p-value. The session warns that Chi-square needs sufficient sample size per cell (e.g., at least ~5 samples) and still doesn’t capture interactions between multiple features.

Why must feature selection be evaluated by retraining and checking accuracy?

Filter methods score features using statistics, not model performance. The session stresses that selected features might not contribute equally to accuracy for the eventual model. Therefore, after each reduction step (duplicates removal, VarianceThreshold, correlation filtering, ANOVA/Chi-square), the model should be retrained on the reduced feature set and accuracy compared to the baseline using all features. If accuracy stays close, the selection is validated.

Review Questions

In what ways can low-variance features still be useful, and why does Variance Threshold risk removing them?
How do correlation filtering and ANOVA differ in what relationships they can detect (linear redundancy vs. group mean differences)?
What assumptions must hold for ANOVA to be reliable, and what kinds of data issues (like outliers or time-series dependence) can break those assumptions?

Key Points

1
Feature selection is a core part of feature engineering that reduces dimensionality debt, lowers compute cost, and can preserve predictive accuracy.
2
Filter methods score features using statistics (variance, correlation, ANOVA F-test, Chi-square) and remove features without training a model for every subset.
3
Always remove duplicate features early because they inflate dimensionality and can distort subsequent selection steps.
4
Variance Threshold removes constant/low-variance columns; choose thresholds after normalization so variance scales are comparable.
5
Correlation filtering targets multicollinearity by dropping features with high absolute linear correlation (using a cutoff like 0.95), but it can miss non-linear relationships and multi-feature interactions.
6
ANOVA (SelectKBest with F-test) ranks numerical features by whether their means differ across categorical target classes; it relies on assumptions like normality, equal variances, and independence.
7
After any feature selection step, retrain the model and compare accuracy to the baseline to confirm that removed features weren’t actually useful.

Highlights

Dimensionality debt can hurt both performance and efficiency: more columns can slow training/inference and degrade generalization.

Variance Threshold is a simple, fast filter that drops constant/near-constant features; normalization helps make thresholding practical.

Correlation filtering reduces redundancy by removing features that are highly linearly related, but it won’t reliably capture interactions or non-linear effects.

ANOVA feature selection uses an F-test and p-values to keep features whose class-wise means differ significantly; it comes with strict statistical assumptions.

Chi-square is the categorical-feature counterpart, using contingency tables and p-values, but it requires enough samples per cell and also ignores feature interactions.

Topics

Feature Selection
Filter Methods
Variance Threshold
Correlation Filtering
ANOVA F-Test
Chi-Square Test

Mentioned

F1
F2
F3
F4
ANOVA
F-test
p-value

Session 54 - Feature Selection Part 1 | Filter Methods | Variance Threshold | Chi-Square | DSMP 2023