Bias in data is an inconsistency with the phenomenon that data represents.
#Types of Bias
#Selection Bias
Selection bias is the tendency to skew your choice of data sources to those that are easily available, convenient, and/or cost-effective.
#Self-Selection Bias
Self-selection bias is a form of selection bias where you get the data from sources that “volunteered” to provide it. Most poll data has this type of bias.
#Omitted Variable Bias
Omitted variable bias happens when your featurized data doesn’t have a feature necessary for accurate prediction.
e.g. if your churn prediction model doesn’t know that a competitor has started to offer the same service for a lower price.
#Sponsorship/Funding Bias
E.g. a news agency is sponsored by a video-game company so they won’t say bad things about it. If you’re training a model on the news articles, your model will be suboptimal.
#Sampling Bias
Sampling bias (also known as distribution shift) occurs when the distribution of examples used for training doesn’t reflect the distribution of the inputs the model will receive in production.
#Prejudice / Stereotype Bias
Prejudice / Stereotype Bias is often observed in data obtained from historical sources, such as books or photo archives, or from online activity such as social media, online forums, and comments to online publications.
#Systematic Value Distortion
Systematic value distortion is bias usually occurring with the device making measurements or observations. This results in a machine learning model making suboptimal predictions when deployed in the production environment.
#Experimenter Bias
Experimenter bias is the tendency to search for, interpret, favor, or recall information in a way that affirms one’s prior beliefs of hypotheses. In ML often translates to the dataset being obtained from the answers to a survey given by a particular person, one example per person.
#Labeling Bias
Labeling bias happens when Labels are assigned to unlabeled examples by a biased process or person.
#Ways to Avoid Bias
Question everything.