
Photo by author
# Introduction
As a machine learning practitioner, you know that feature selection is an important but time-consuming task. You need to identify which features actually contribute to model performance, remove redundant variables, detect multicollinearity, filter out noisy features, and find the best feature subset. For each selection method, you test different thresholds, compare results, and track what works.
This becomes more difficult as your feature space increases. With hundreds of engineered features, you’ll need to take a systematic approach to evaluating feature importance, removing redundancies, and selecting the best subset.
This article covers five Python scripts designed to automate the most effective feature selection techniques.
You can find the scripts on GitHub..
# 1. Filtering constant features with variation threshold
// Pain point
Features with low or zero variance provide little information for prediction. A feature that is constant or nearly constant across samples cannot help distinguish between different target classes. Identifying these features manually means calculating the variance for each column, setting appropriate thresholds, and handling edge cases such as binary features or features with different scales.
// What does the script do?
Identifies and removes low-variance features based on configurable thresholds. Handles both continuous and binary features properly, normalizes variance calculations for fair comparisons across different scales, and provides detailed reports detailing which features were removed and why.
// How it works
The script calculates the variance of each feature applying different strategies based on the feature type.
- For continuous features, it computes the standard deviation and can optionally normalize to the feature range to compare the range.
- For binary characteristics, it calculates the minority class proportion because variation in binary characteristics is related to class imbalance.
Features that fall below the threshold are flagged for removal. The script maintains the mapping of removed features and their different scores for transparency.
⏩ Get a feature selector script based on variation range.
# 2. Eliminating redundant features through correlation analysis
// Pain point
Highly correlated features are redundant and can cause multicollinearity problems in linear models. When two features have a high correlation, retain both Adds dimension without adding information.. But with hundreds of features, identifying all the relevant pairs, deciding which ones to keep, and making sure you keep the features most aligned with the target requires systematic analysis.
// What does the script do?
identifies the most relevant feature pairs using Pearson correlation For numerical properties and Cramér’s V For clear features. For each correlated pair, automatically selects which feature to keep based on the correlation with the target variable. Removes redundant features maximizing predictive power. Generates correlation heatmaps and detailed reports of removed features.
// How it works
The script computes the correlation matrix for all features. For each pair exceeding the correlation threshold, it compares the correlations of the two features with the target variable. A feature with low target correlation is marked for removal. This process continues iteratively to handle chains of related features. The script handles missing values, mixed data types, and provides visualizations showing correlation clusters and selection decisions for each pair.
⏩ Get a correlation-based feature selector script.
# 3. Identifying significant characteristics using statistical tests
// Pain point
Not all characteristics have a statistically significant relationship with the target variable. Features that show no meaningful association with the target add noise and often increase the risk of overfitting. Each characteristic test requires choosing appropriate statistical tests, computing p-values, correcting for multiple testing, and correctly interpreting results.
// What does the script do?
The script automatically selects and applies the appropriate statistical test based on the characteristic and target variable types. It uses an analysis of variance (ANOVA) F-test for numerical characteristics with a categorical target, a chi-square test for classification, mutual information scoring to capture non-linear relationships, and a regression F-test when the target is continuous. Then either applies. Bonferroni or False Discovery Rate (FDR) Corrects to account for multiple testing, and returns all features ranked by statistical significance, along with their p-values and test statistics.
// How it works
The script first determines the feature type and target type, then routes each feature to the correct test. For classification tasks with numerical features, ANOVA tests whether feature means differ significantly across target classes. For implicit features, a chi-square test tests for statistical independence between the feature and the target. Correlational information scores are calculated with these to reveal any non-linear relationships that may be missed by standardized tests. When the target is continuous, a regression F-test is used instead.
After running all tests, p-values are adjusted using either the Bonferroni correction—where each p-value is multiplied by the total number of features—or the false discovery rate method for a less conservative correction. Features with adjusted p-values below the default significance threshold of 0.05 were flagged as statistically significant and prioritized for inclusion.
⏩ Get a feature selector script based on a statistical test.
If you are interested in a more rigorous statistical approach to feature selection, I recommend that you further refine this script as described below.
// Which you can also explore and improve.
Use non-parametric alternatives where assumptions are broken. ANOVA assumes approximately normality and equal variance across groups. For highly skewed or unusual features, change to a Kruskal-Wallis test A more robust choice is one that makes no distributional assumptions.
Handle sparse categorical features with care. Chi-square requires that the expected cell frequency be at least 5. When this condition is not met—which is common with high-cardinality or infrequent categories. Fisher’s exact test There is a safer and more accurate alternative.
Consider mutual information scores separately from p-values. Because mutual information scores are not p-values, they do not naturally fit into Bonferroni or FDR correction frameworks. A cleaner approach is to rank the features independently according to the mutual information score and use this as a complementary signal rather than integrating them into a single importance pipeline.
Prioritize false discovery rate optimization in high-dimensional settings. Bonferroni is conservative by design, which is appropriate when false positives are very expensive, but when you have a lot of them it can waste genuinely useful features. Benjamini-Hochberg FDR correction Offers more statistical power in large datasets and is generally preferred in machine learning feature selection workflows.
Include effect sizes with p-values. Statistical significance alone does not tell you how meaningful a feature is in practice. Combining p-values with measures of effect size provides a complete picture of which properties are worth keeping.
Add a rank-based significance test. For complex or mixed-type datasets, permutation testing offers a model-agnostic way to estimate significance without relying on any distributional assumptions. It works by repeatedly shuffling the target variable and testing how often a feature scores by chance alone.
# 4. Classify features with model-based importance scores
// Pain point
Model-based feature importance provides direct insight into which features contribute to prediction accuracy, but different models give different importance scores. Running multiple models, deriving significance scores, and combining the results into a coherent ranking is complex.
// What does the script do?
Trains multiple model types and values extracts from each. Normalizes significance scores across models for fair comparison. Calculates significance by averaging or ranking across models. Provides value to configuration as a model-agnostic alternative. Returns ranked features with importance scores from each model and proposed feature subsets.
// How it works
The script trains each model type on the full feature set and extracts local importance scores such as tree-based importance for forests and coefficients for linear models. For order significance, it randomly changes each feature and measures the reduction in model performance. Significance scores are normalized to a sum of 1 in each model.
The ensemble score is calculated as the average rank or mean common significance across all models. Features are ranked by pairwise importance, and the top N features or those exceeding the importance threshold are selected.
⏩ Get a model-based selector script.
# 5. Optimizing the feature subset through iterative elimination
// Pain point
The best feature subset is not always the top N most important features individually. Characteristic interactions are also important. A feature may seem weak alone but can be valuable in combination with others. An iterative feature elimination test characterizes subsets by repeatedly removing the weakest features and retraining the models. But this requires hundreds of model training iterations and tracking performance across different subset sizes.
// What does the script do?
Systematically removes features in an iterative process, retrains models and evaluates performance at each step. Starts with all features and removes the least important feature in each iteration. Tracks model performance across all subset sizes. The optimal feature subset identifies the one that maximizes the performance or achieves the target performance with a minimum number of features. Supports cross-validation for robust performance estimation.
// How it works
The script starts with the full feature set and trains a model. It ranks the features in order of importance and removes the lowest ranked feature. This process is repeated, training a new model with less specificity at each iteration. Performance metrics such as accuracy, F1, and AUC are recorded for each subset size.
The script applies cross-validation to obtain stable performance estimates at each step. The final output includes performance curves that show how the feature count and matrix change with the optimal feature subset. This means you either see optimal performance or an elbow point where adding features reduces profitability.
⏩ Get the recursion feature elimination script.
# wrap up
These five scripts address the fundamental challenges of feature selection that determine model performance and training efficiency. Here’s a quick overview:
| The script | Description |
|---|---|
| Variance threshold selector | Removes non-informative constant or near-constant properties. |
| Correlation based selector | Eliminates redundant features while preserving predictive power. |
| Statistical Test Selector | Identifies features with significant relationships to the target. |
| Model-based selector | Ranks features using pairwise importance from multiple models. |
| Elimination of recursive feature | Finds the best feature subset through iterative testing. |
Each script can be used independently for specific selection tasks or combined into a complete pipeline. Happy feature selection!
Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she’s working on learning lessons and sharing her knowledge with the developer community, writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource reviews and coding tutorials.