In this section, We will explore what kinds of values actually make good features within those feature vectors
Each feature should have a clear and obvious meaning to anyone on the project. For example, the following good feature is clearly named and the value makes sense with respect to the name:
Bad Feature: house_age: 851472000
Good Feature: house_age_years: 27
Good feature values should appear more than 5 or so times in a data set. Doing so enables a model to learn how this feature value relates to the label. That is, having many examples with the same discrete value gives the model a chance to see the feature in different settings, and in turn, determine when it's a good predictor for the label. For example, a house_type feature would likely contain many examples in which its value was victorian:
Bad Feature: unique_house_id: 8SK982ZZ1242Z
Good Feature: house_type: victorian
Good floating-point features don't contain peculiar out-of-range discontinuities or "magic" values. For example, suppose a feature holds a floating-point value between 0 and 1. So, values like the following are fine:
quality_rating: 0.82 quality_rating: 0.37
However, if a user didn't enter a quality_rating, perhaps the data set represented its absence with a magic value like the following:
Bad Feature: quality_rating: -1
To explicitly mark magic values, create a Boolean feature that indicates whether or not a quality_rating was supplied. Give this Boolean feature a name like is_quality_rating_defined
In the original feature, replace the magic values as follows:
The definition of a feature shouldn't change over time. For example, the following value is useful because the city name probably won't change. (Note that we'll still need to convert a string like "br/sao_paulo" to a one-hot vector.)
city_id: "br/sao_paulo"
But gathering a value inferred by another model carries additional costs. Perhaps the value "219" currently represents Sao Paulo, but that representation could easily change on a future run of the other model:
Bad Feature: inferred_city_cluster: "219"