Feature Engineering - Yousef's Notes

Transforming raw data into tidy data (Raw and Tidy Data), i.e. into a feature vector. Fundamental part of ML Engineering.

Algorithm-specific formatting of feature vectors. e.g. transforming categorical attributes into numerical features with certain properties .

Feature engineering is a process of first conceptually and then programmatically transforming a raw example into a Feature Vector. It consists of conceptualizing a feature and then writing the programming code that would transform the entire raw example, with potentially some help of some indirect data, into a feature.

For feature engineering with text, you can use feature engineering techniques like One-Hot Encoding and bag of words.

There are other encoding techniques as well:

#Feature Concatenation

Records/examples can have multiple parts, each transformed by an encoder into a (sub)feature.

We can concatenate those features into a single feature (order does not matter). We can concatenate all the (sub)features. Choose the same order for each record.

Note: useful because each (sub)feature can have a different dimensionality.

#Best Practices

- Start simple: try features that require little coding time/computing resources. - Reuse old non-ML algorithms: use their output as a feature of the new ML one! - Reduce cardinality: categorical feature with many values help to train different modes for a model (e.g., country, zipcode). If we do not need ‘modes’: - Feature hashing: Slide 21, Lecture 4. - Group similar values: e.g., Group states into regions if states are not needed to solve the problem. - Group the long tail: e.g., group infrequent values under ‘other’. - Remove the feature: (almost) all values are unique. - Careful! Reducing cardinality is tricky: e.g., we might inadvertently destroy the information that would allow the model to distinguish one “Springfield” from another! - Use counts with caution: counts tend to change over time, outdating the model. E.g., number of calls per mobile customer increasing with the subscription time.

Select features only when necessary: reasons to do it:
We need an explainable model; thus, we keep only the most significant predictors.
We have not enough computing resources, e.g., RAM, drive space, etc.
We do not have time to experiment and/or rebuild the model in production.
We expect a significant distribution shift between two model trainings.
Test the code carefully:
Unit tests for all automated feature extractors.
Test each feature for speed and memory consumption in the deployment environment.
Test for external dependencies, i.e., DB, remote APIs, etc.
Rerun all the tests in the production environment.
Don’t fail silently.
Keep code, model, and data in sync: version must be the same.
Isolate feature extraction code: it can be debugged and tested.
Log the production feature values: useful for dev, debugging, testing.