Feature engineering means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering
Mapping numeric values to features is easy. Integer and floating-point data don't need a special encoding because they can be multiplied by a numeric weight. As suggested in Figure 2, converting the raw integer value 6 to the feature value 6.0 is trivial:
Categorical features have a discrete set of possible values. For example, there might be a feature called street_name with options that include:
{'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}
Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values.
One quick approach is to map our street names to numbers:
However, this approach has two disadvantages:
So, to overcome these limitations, we can instead create a binary vector for each categorical feature in our model that represents values as follows:
1
.0
.The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1
This approach effectively creates a Boolean variable for every feature value (e.g., street name). Here, if a house is on Shorebird Way then the binary value is 1 only for Shorebird Way. Thus, the model uses only the weight for Shorebird Way
One-hot encoding extends to numeric data that you do not want to directly multiply by a weight, such as a postal code
Similarly, if a house is at the corner of two streets, then two binary values are set to 1, and the model uses both their respective weights
Suppose that you had 1,000,000 different street names in your data set that you wanted to include as values for street_name. Explicitly creating a binary vector of 1,000,000 elements where only 1 or 2 elements are true is a very inefficient representation in terms of both storage and computation time when processing these vectors. In this situation, a common approach is to use a sparse representation in which only nonzero values are stored. In sparse representations, an independent model weight is still learned for each feature value, as described above