

Photo by author
# Introduction
As a machine learning practitioner, you know that feature engineering is laborious, manual work. You need to generate interaction terms between features, explicitly encode dependent variables, extract temporal patterns from dates, generate aggregates and transform distributions. For each possible feature, you test whether it improves the model’s performance, iterate over variations, and track the effort you’ve made.
This becomes more difficult when your dataset grows. With dozens of features, you’ll need systematic methods to generate candidate features, evaluate their utility, and make the best selections. Without automation, you’ll likely miss out on valuable feature combinations that could significantly boost your model’s performance.
This article covers five Python scripts specifically designed to automate the most impressive feature engineering tasks. These scripts help you systematically generate high-quality features, evaluate them rationally, and generate optimized feature sets that maximize model performance.
You can find the code on GitHub.
# 1. Explicit properties of encoding
// A word of pain
Implicit variables are ubiquitous in real-world data. You need to encode these categories, and choosing the right encoding method matters:
- A warm encoding works for low-cardinality features but creates dimensionality problems with high-cardinality categories.
- Label encoding is memory efficient but implies ordinality
- Target encoding is powerful but vulnerable to data leakage
Implementing these encodings correctly, handling unseen categories in the test data, and maintaining consistency across train, validation, and test splits requires careful, error-prone code.
// What does the script do?
The script automatically selects and applies appropriate encoding strategies based on characteristic properties: cardinality, target correlation, and data type.
It handles one-hot encoding for low-cardinality features, target encoding for features associated with the target, frequency encoding for high-cardinality features, and label encoding for common variables. It also automatically groups rare categories, gracefully handles unseen categories in test data, and maintains encoding consistency across all data splits.
// How it works
The script analyzes each variable attribute to determine its cardinality and relationship with the target variable.
- For features with fewer than 10 unique values, it applies a hot encoding
- For high cardinality features with more than 50 unique values, it uses frequency encoding to avoid dimensional explosion.
- For features showing correlation with the target, it applies target encoding with smoothing to prevent overfitting.
- Rare categories appearing in less than 1% of rows are grouped into the “other” category
All encoding mappings are stored and can be applied consistently to new data, with unusual category encodings or global mean handling categories handled by default.
⏩ Get the categorical feature encoder script
# 2. Changing numerical properties
// A word of pain
Raw numerical properties often require transformation prior to modeling. Scad distributions must be normalized, outliers must be handled, features with different scales require standardization, and non-linear relationships may require polynomial or logarithmic transformations. Manually testing different transformation strategies for each numerical feature is tedious. This process needs to be repeated for each numerical column to ensure that you are actually improving the performance of the model.
// What does the script do?
The script automatically tests several transformation strategies for numerical properties: log transforms, Replacement of box coaxsquare root, cube root, standard, normalization, robust scaling, and power transforms.
It evaluates the effect of each transformation on the normality of the distribution and model performance, selects the best transformation for each feature, and continuously applies the transformations to training and testing data. It also handles zero and negative values properly, avoiding conversion errors.
// How it works
For each numeric feature, the script examines multiple transformations and evaluates them using normality tests. Shapiro-Wilk And Anderson Darling – and distribution metrics such as skewness and kurtosis. For features with features greater than 1, it prefers log and box cox transformations.
For features with outliers, it applies robust scaling. The script maintains the transformation parameters applied to the training data and applies them consistently to the validation and test sets. Attributes with negative values or zeros are handled with changed variables or Yeo Johnson Transformations that work with any real values.
⏩ Get the Numerical Feature Transformer script
# 3. Creating characteristic interactions
// A word of pain
Interactions between features often contain valuable signals that are missed by individual features. Revenues may vary across customer segments, advertising spending may have seasonal effects, or the combination of product price and category may be more predictable than either. But with dozens of features, testing every possible pair of interactions means evaluating thousands of candidates.
// What does the script do?
This script creates attribute interactions using combinations of arithmetic operations, polynomial attributes, ratio attributes, and categories. It evaluates the predictive power of each candidate interaction using mutual information or a model-based significance score. It returns only the most valuable interactions, capturing the most impactful combinations while avoiding feature explosion. It also supports custom interaction functions for domain-specific feature engineering.
// How it works
The script generates candidate interactions between each feature pair:
- For numerical properties, it creates products, ratios, sums and differences
- For categorical features, it creates a shared encoding
Each candidate is scored using mutual information with the target or feature importance from the random forest. Only interactions exceeding the significance threshold or ranking in the top N are retained. The script handles edge cases such as division of zero, infinite values, and correlation between generated features and original features. Results include clear feature names showing which original features were combined and how.
⏩ Get the feature interaction generator script
# 4. Extracting datetime properties
// A word of pain
Datetime columns contain useful temporal information, but using them effectively requires extensive manual feature engineering. You need to do the following:
- Extract components such as year, month, day, and hour
- Create derived attributes such as day of day, quarter, and weekend flags
- Time differences between events by reference date and time after time
- Handle cyclical patterns
Writing this extraction code for each datetime column is repetitive and time-consuming, and practitioners often forget valuable temporal features that can improve their models.
// What does the script do?
The script automatically extracts composite datetime properties from the timestamp columns, including principal components, calendar properties, Boolean notation, cyclic encodings using sine and cosine transformations, weather notation, and time differences from reference dates. It also detects and flags holidays, handles multiple datetime columns, and computes time differences between datetime pairs.
// How it works
The script takes a datetime column and systematically extracts all associated temporal patterns.
For cyclic properties such as month or hour, it produces a sine and cosine transformation:
\(
\text{month\_SIN} = \sin\left(\frac {2\pi\times \text{month}}{12}\right)
\)
This ensures that December and January are close in feature space. It calculates the time delta from a reference point (days after the contract, day to day of a particular date) to capture trends.
For datasets with multiple datetime columns (eg order_date And ship_date), it computes the differences between them to find the period processing_time. Boolean flags are created for special day, weekend and period ranges. All features use clear naming conventions that indicate their source and meaning.
⏩ Get the DateTime Feature Extractor script
# 5. Selecting features automatically
// A word of pain
After feature engineering, you usually have multiple features, many of which are redundant, irrelevant, or over-fitting. You need to identify which features actually support your model and which ones should be removed. Manual feature selection means repeatedly training models with different feature subsets, tracking the results in a spreadsheet, and trying to understand complex feature importance scores. The process is slow and subjective, and you’ll never know if you’ve found the optimal feature set or just got lucky with your trials.
// What does the script do?
The script automatically selects the most valuable features using several selection methods:
- Variance-based filtering removes constant or near-constant features
- Correlation-based filtering removes redundant features
- Statistical tests such as analysis of variance (ANOVA), chi-square, and correlational information
- Importance of a tree-based feature
- L1 regularity
- Elimination of the repetition feature
It then combines the results of multiple methods into a composite score, ranks all features by importance, and identifies the optimal feature subset that maximizes model performance while minimizing dimensionality.
// How it works
The script implements a multistage selection pipeline. At each stage it is the same:
- Remove features with zero or near-zero variance because they provide no information
- Remove the most correlated feature pair by having another correlation with the target
- Calculate feature significance using a variety of methods, such as random forest significance, mutual information scores, statistical tests, and L1 regularization coefficients.
- Normalize and combine scores in different ways
- Use iterative feature elimination or cross-validation to determine the optimal number of features
The result is a ranked list of features and a recommended subset for model training, as well as detailed importance scores from each method.
⏩ Get an automated feature selector script
# The result
These five scripts address the core feature engineering challenges that machine learning projects use most of the time. Here’s a quick recap:
- The categorical encoder intelligently handles encoding based on cardinality and target correlation.
- The numerical transformer automatically finds the optimal transformation for each numerical feature
- A dialog generator systematically discovers valuable feature combinations
- Datetime Extractor extracts complex temporal patterns and cyclical features
- The feature selector identifies the most predictive features using ensemble methods
Each script can be used independently for specific feature engineering tasks or combined into a complete pipeline. Start with encoders and transformers to develop your base features, use interaction generators to discover complex patterns, extract temporal features from datetime columns, and finish with feature selection to refine your feature set.
Happy feature engineering!
Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include devops, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces and more. Bala also engages resource reviews and coding lessons.