Feature Hashing - Yousef's Notes

Feature hashing, or hashing trick, converts text data, or categorical attributes with many values, into a feature vector of arbitrary dimensionality.

One-Hot Encoding and bag of words create feature vectors of many dimensions that are sparse and computationally expensive.

To keep the data manageable, we can use the hashing trick that works as follows.

First, we decide on the desired dimensionality of our feature vectors.
Then, using a hash function, we first convert all values of the categorical attribute (or all tokens in the collection of documents) into a number, and
then we convert this number to an index of our feature vector.

The problem with this technique is collisions will cause one feature representing multiple values. This is the tradeoff between speed and quality of learning.

Commonly used hash functions are: MurmurHash3, Jenkins, CityHash, and MD5.