Home » Machine Learning

Cleansing data in Machine Learning

Understand your training datasets, go through the features and eliminate any bad data or outliers. Follow the below principles:

Keep in mind what you think your data should look like.
Verify that the data meets these expectations (or that you can explain why it doesn’t).
Double-check that the training data agrees with other sources (for example, dashboards).

In this tutorial, you will learn the following cleansing techniques:

Scrub your datasets
Handling extreme outliers in your datasets
Binning

Scrub your datasets

Until now, we've assumed that all the data used for training and testing was trustworthy. In real-life, many examples in data sets are unreliable due to one or more of the following:

Omitted values. For instance, a person forgot to enter a value for a house's age.
Duplicate examples. For example, a server mistakenly uploaded the same logs twice.
Bad labels. For instance, a person mislabeled a picture of an oak tree as a maple.
Bad feature values. For example, someone typed in an extra digit, or a thermometer was left out in the sun.

In addition to detecting bad individual examples, you must also detect bad data in the aggregate. Histograms are a great mechanism for visualizing your data in the aggregate. In addition, getting statistics like the following can help:

Maximum and minimum
Mean and median
Standard deviation

Handling extreme outliers in your datasets

The following plot represents a feature called roomsPerPerson from the California Housing data set. The value of roomsPerPerson was calculated by dividing the total number of rooms for an area by the population for that area. The plot shows that the vast majority of areas in California have one or two rooms per person. But take a look along the x-axis.

What if we simply "cap" or "clip" the maximum value of roomsPerPerson at an arbitrary value, say 4.0?

Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data

Binning

The following plot shows the relative prevalence of houses at different latitudes in California. Notice the clustering—Los Angeles is about at latitude 34 and San Francisco is roughly at latitude 38

In the data set, latitude is a floating-point value. However, it doesn't make sense to represent latitude as a floating-point feature in our model. That's because no linear relationship exists between latitude and housing values. For example, houses in latitude 35 are not more expensive (or less expensive) than houses at latitude 34. And yet, individual latitudes probably are a pretty good predictor of house values

To make latitude a helpful predictor, let's divide latitudes into "bins" as suggested by the following figure:

Instead of having one floating-point feature, we now have 11 distinct boolean features (LatitudeBin1, LatitudeBin2, ..., LatitudeBin11). Having 11 separate features is somewhat inelegant, so let's unite them into a single 11-element vector. Doing so will enable us to represent latitude 37.4 as follows:

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Thanks to binning, our model can now learn completely different weights for each latitude.

Next Section: Feature Scaling

SQL.info