What is feature engineering and can it be automated? 

Feature engineering is the process of using domain knowledge, mathematics, and statistics to extract features (also called explanatory variables or predictors) from raw data via data mining techniques.
February 14, 2022
Last updated February 14, 2022

Introduction

According to a 2020 survey by Anaconda, data scientists spend 45% of their time on data preparation. It was 80% in 2016, according to a report by Forbes.

There seems to be an improvement thanks to automation tools but data preparation still constitutes a large part of data science work. This is because getting the best possible results from a machine learning model depends on data quality and creating better "features" can help provide better quality data.

What is feature engineering?

Feature engineering is the process of using domain knowledge, mathematics, and statistics to extract features (also called explanatory variables or predictors) from raw data via data mining techniques. Often, data is spread across multiple tables and must be gathered into a single table with rows containing the observations and features in the columns.

The features in your data will directly influence the predictive models you use and the results you can achieve. You can say that: the better the features that you prepare and choose, the better the results you will achieve. Feature engineering can be considered as applied machine learning (ML) itself.

Is feature engineering still relevant? Yes, feature engineering is critical because if we provide wrong hypotheses as an input, ML cannot make accurate predictions. The quality of any provided hypothesis is vital for the success of an ML model. The quality of the feature is critically important for accuracy and interpretability.

What are feature engineeringg processes?

Feature engineering can involve:

  • Feature construction: Constructing new features from the raw data. Feature construction requires a good knowledge of the data and the underlying problem to be solved with the data

  • Feature selection: Selecting a subset of available features that are most relevant to the problem for model training

  • Feature extraction: Creating new and more useful features by combining and reducing the number of existing features. Principal component analysis (PCA) and embedding are some methods for feature extraction

What are some feature engineering techniques?

  • One-hot encoding - With one-hot encoding, each categorical value is converted into a new categorical column and assigned a binary value of 1 or 0 to those columns.

  • Log transformation - Log-transformation replaces each value in a column with its logarithm. It is a useful method to handle skewed data.

  • Outlier handling - Outliers are observations that are distant from other observations. They can be due to errors or be genuine observations. Handling methods are dependendent on datasets. You either remove those observations because they are probably erroneous or replace outlier values with mean or medians of the attribute.

  • Binning - Binning groups observations under 'bins'. The decision for binning depends on what you are trying to obtain from the data.

  • Handling missing values - To handle missing values, you can:

    • Fill missing observations with mean/median of the attribute if it is numerical.

    • Fill with the most frequent category if the attribute is categorical.

    • Use ML algorithms to capture the structure of data and fill the missing values accordingly.

    • Predict the missing values if you have domain knowledge about the data.

    • Drop the missing observations.

Why does feature engineering need automation?

The traditional approach to feature engineering is to build features one at a time using domain and technical involving interdisciplinary expertise, a resource-intensive, time-consuming, and error-prone process known as manual feature engineering. The code for manual feature engineering is problem-dependent and must be re-written for each new dataset.

The data science team builds features by working with domain experts, testing hypotheses, building and evaluating ML models, and repeating the process until the results become acceptable for businesses. Because in-depth domain knowledge is required to generate high-quality features, feature engineering is widely considered the "black-arts" of experts, and not possible to automate even when a team often spends 80% of their effort on developing a high-quality feature table from raw business data.

It depends on the problem, the dataset, and the model so there is not a single method that solves all feature engineering problems. However, there are some methods to automate this process:

  • Open source Python libraries for automated feature engineering such as feature tools. Featuretools uses an algorithm called deep feature synthesis to generate feature sets for structured datasets.

  • There are also AutoML solutions that offer automated feature engineering.

However, it should be noted that automated feature engineering tools use algorithms and may not be able to incorporate valuable domain knowledge that a data scientist may have.

Resources

I found these resources helpful in understanding feature engineering:

© 2023 Discover Financial Services. Opinions are those of the individual author. Unless noted otherwise in this post, Discover is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners