Feature hashing, or hashing trick, converts text data, or categorical attributes with many values, into a feature vector of arbitrary dimensionality.
One-Hot Encoding and bag of words create feature vectors of many dimensions that are sparse and computationally expensive.
To keep the data manageable, we can use the hashing trick that works as follows.
- First, we decide on the desired dimensionality of our feature vectors.
- Then, using a hash function, we first convert all values of the categorical attribute (or all tokens in the collection of documents) into a number, and
- then we convert this number to an index of our feature vector.
Commonly used hash functions are: MurmurHash3, Jenkins, CityHash, and MD5.