The suitability of a model for its intended purpose depends on choices that the modeler makes. There are three fundamental choices:
- The data
- The response variable
- The explanatory variables
#The Data
How were the data collected? Are they a random sample from a relevant sampling frame? Are they part of an experiment in which one or more variables were intentionally manipulated by the experimenter, or are they observational data? Are the relevant variables being measured? (This includes those that may not be directly of interest but which have a strong influence on the response.) Are the variables being measured in a meaningful way? Start thinking about your models and the variables you will want to include while you are still planning your data collection.
When you are confronted with a situation where your data are not suitable, you need to be honest and realistic about the limitations of the conclusions you can draw.
#The Response Variable
The appropriate choice of a response variable for a model is often obvious. The response variable should be the thing that you want to predict, or the thing whose variability you want to understand. Often, it is something that you think is the effect produced by some other cause.
For example, in examining the relationship between gas usage and outdoor temperature, it seems clear that gas usage should be the response: temperature is a major determinant of gas usage. But suppose that the modeler wanted to be able to measure outdoor temperature from the amount of gas used. Would it make sense to take temperature as the response variable, what do you think?
Similarly, wages make sense as a response variable when you are interested in how wages vary from person to person depending on traits such as age, experience, and so on.
But suppose that a sociologist was interested in assessing the influence of income on personal choices such as marriage. Then the marital status might be a suitable response variable, and wage would be an explanatory variable.
#Explanatory Variables
Much of the thought in modeling goes into the selection of explanatory variables and we will see several ways to decide if an explanatory variable ought to be included in a model. Of course, some of the things that shape the choice of explanatory variables are obvious. Do you want to study gender-related differences in wage? Then gender had better be an explanatory variable. Is temperature a major determinant of the usage of natural gas? Then it makes sense to include it as an explanatory variable.
You will see situations where including an explanatory variable hurts the model, so it is important to be careful. A much more common mistake is to leave out explanatory variables. Unfortunately, few people learn the techniques for handling multiple explanatory variables and so your task will often need to go beyond modeling to include explaining how this is done.
When designing a model, you should think hard about what are potential explanatory variables and be prepared to include them in a model along with the variables that are of direct interest.