Feature Engineering - Yousef's Notes
Feature Engineering

Feature Engineering

Transforming raw data into tidy data (Raw and Tidy Data), i.e. into a feature vector. Fundamental part of ML Engineering.

Algorithm-specific formatting of feature vectors. e.g. transforming categorical attributes into numerical features with certain properties .

Feature engineering is a process of first conceptually and then programmatically transforming a raw example into a Feature Vector. It consists of conceptualizing a feature and then writing the programming code that would transform the entire raw example, with potentially some help of some indirect data, into a feature.

For feature engineering with text, you can use feature engineering techniques like One-Hot Encoding and bag of words.

There are other encoding techniques as well:

#Feature Concatenation

Records/examples can have multiple parts, each transformed by an encoder into a (sub)feature.

We can concatenate those features into a single feature (order does not matter). We can concatenate all the (sub)features. Choose the same order for each record.

Note: useful because each (sub)feature can have a different dimensionality.

#Best Practices

Start simple: try features that require little coding time/computing resources. -  Reuse old non-ML algorithms: use their output as a feature of the new ML one! -  Reduce cardinality: categorical feature with many values help to train different modes for a model (e.g., country, zipcode). If we do not need ‘modes’:   - Feature hashing: Slide 21, Lecture 4.   - Group similar values: e.g., Group states into regions if states are not needed to solve the problem.   - Group the long tail: e.g., group infrequent values under ‘other’.   - Remove the feature: (almost) all values are unique.   - Careful! Reducing cardinality is tricky: e.g., we might inadvertently destroy the information that would allow the model to distinguish one “Springfield” from another! -  Use counts with caution: counts tend to change over time, outdating the model. E.g., number of calls per mobile customer increasing with the subscription time.

  • Select features only when necessary: reasons to do it:
  • We need an explainable model; thus, we keep only the most significant predictors.
  • We have not enough computing resources, e.g., RAM, drive space, etc.
  • We do not have time to experiment and/or rebuild the model in production.
  • We expect a significant distribution shift between two model trainings.
  • Test the code carefully:
  • Unit tests for all automated feature extractors.
  • Test each feature for speed and memory consumption in the deployment environment.
  • Test for external dependencies, i.e., DB, remote APIs, etc.
  • Rerun all the tests in the production environment.
  • Don’t fail silently.
  • Keep code, model, and data in sync: version must be the same.
  • Isolate feature extraction code: it can be debugged and tested.
  • Log the production feature values: useful for dev, debugging, testing.