Good Data - Yousef's Notes

#Informative

Good data contains enough information that can be used for modeling.

e.g. if you want to train a model that predicts whether the customer will buy a specific product, you will need to possess both the properties of the product in question and the properties of the products customers purchased in the past.

#Good Coverage

Good data has good coverage of what you want to do with the model.

e.g. if you want to use a model to classify web pages by topic and you have a thousand topics of interest, then your data has to contain examples of documents on each of the thousand topics in quantity sufficient for the algorithm to be able to learn the difference between topics.

#Reflects Real Inputs

Good data reflects real inputs that the model will see in production.

e.g. if you train your models with photos of cars in daylight, when people use it with photos of cars at night in production, it will make more mistakes.

#Unbiased

Good data is unbiased as possible.

#Not a Result of a Feedback Loop

Good data is not a result of the model itself. This echoes the problem of the feedback loop.

e.g. you can’t train a model that predicts the gender of a person from their name, and then use the prediction to label a new training example.

#Consistent Labels

Inconsistency can come from several sources:

Different people do labeling according to different criteria (or interpretations).
The definition of some classes evolved over time. Occurs when two similar feature vectors receive two different labels.
Misinterpretation of user’s motives. e.g. user ignores news article because already seen it, not because of dis-interest.

#Big Enough

Good data is big enough to allow generalization.

#Summary

it contains enough information that can be used for modeling
it has good coverage of what you want to do with the model
it reflects real inputs that the model will see in production
it is as unbiased as possible
it is not a result of the model itself
it has consistent labels
it is big enough to allow generalization