3. Build the Model - Yousef's Notes

With our logistic regression model we can:

Model the probability of an event occurring starting from a set of variables.
Estimate the probability of an event occurring for a random observation vs. the probability of it not occurring (odds).
Predict the effect of a set of variables on a binary categorical variable.
Classify observations by estimating the probability of falling into a given category.

#Maximum Likelihood Estimation

In the same way as in linear regression, we have to find a way to estimate the model coefficients (betas). In linear regression we use the least squares method and in logistic regression we use the maximum likelihood method.

Through this method we can estimate the coefficients of the model and thus obtain the expected log(odds) of an observation.

Having used the maximum likelihood method to estimate the log(Odds), we can calculate the effect of changing a variable on the odds of the event (Odds ratio).
With the estimated coefficients we can calculate the p of the event occurring in an observation and, with the result obtained → CLASSIFY THE OBSERVATION.

#Least Squares and Maximum Likelihood

Both the least squares method and the maximum likelihood method are techniques used to fit a regression model to a data set. Both methods aim to find the values of the model parameters that best fit the observed data.

The main difference between the two methods is how “best fit” is defined. The least squares method seeks to minimize the sum of the squared errors between the observed values and the values predicted by the model. This means that points that deviate more from the model will have a greater impact on the fit.

On the other hand, the maximum likelihood method seeks to find the values of the model parameters that make the observed data more likely to have been generated by the model. This means that the maximum likelihood method takes into account the probability distribution of the data and how it relates to the model.

In summary, the main similarity between the two methods is that they are used to fit a regression model to a data set and find the best fitting parameter values.

The main difference lies in how “best fit” is defined, with the least squares method minimizing the sum of squared errors and the maximum likelihood method maximizing the likelihood of the observed data under the model.

In general, the process of maximizing the likelihood involves finding the parameter values that make the likelihood function maximal. This is typically done by using optimization techniques, such as the Newton-Raphson method or the gradient descent method (out of the scope of this course).

In summary, the maximum likelihood method maximizes the likelihood of the observed data under the model by finding the parameter values that make the observed data more “plausible” under the model, through maximization of the likelihood function (or minimization of the -log likelihood function).

#Logistic Regression in Python, Logit() Function

In order to estimate the beta parameters then we will use the maximum likelihood method. To do that, in Python, we can use the Logit() function instead of the ols().

Data Description

This dataset compiles the statistics achieved by NBA players during their first year. The dependent variable is whether the player had a career of 5 years or more (0-No/ 1-Yes).

As we already know, there are some previous steps that we have to take before building the model.

Data cleaning
Exploratory data analysis
Cross validation In this example, we are going to skip them, but we have to see the structure of our dataset and the name of each of the columns.

nba.head()

Now it’s time to build the model.

logit_model = smf.logit(formula="target_5yrs ~ gp", data=nba).fit()
print(logit_model.summary())

To begin with, we will fit the model with only one independent variable: gp, which is the number of games played by each of the players in the dataset. Do this in your script. What do you get?

In order to properly interpret the coefficients, we have to obtain the exponential of the coefficients. Why? np.exp(logit_model.params)

#Exponential of the Estimators $\hat{\beta_0}$ and $\hat{\beta_1}$

The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by $e^𝜷$.

Let’s see an example on how to interpret the coefficients before doing it in our nba example. Suppose we want to study the effect of Smoking on the 10-year risk of Heart disease. The table below shows the summary of a logistic regression that models the presence of heart disease using smoking as a predictor:

#Interpreting the Estimator $\hat{\beta_1}$

The question is: How to interpret the coefficient of smoking: 𝜷1 = 0.38? First notice that this coefficient is statistically significant (associated with a p-value < 0.05), so our model suggests that smoking does in fact influence the 10-year risk of heart disease. And because it is a positive number, we can say that smoking increases the risk of having a heart disease.

But by how much?

If smoking is a binary variable (0: non-smoker, 1: smoker) Then: $e^{𝛽_1} = e^{0.38} = 1.46$ will be the odds ratio that associates smoking to the risk of heart disease. This means that → the smoking group has a 1.46 times the odds of the non-smoking group of having heart disease. Alternatively we can say that → the smoking group has 46% (1.46 – 1 = 0.46) more odds of having heart disease than the non-smoking group.
If smoking is a numerical variable (lifetime usage of tobacco in kilograms) Then → $e^{𝛽_1} = e^{0.38} = 1.46$ tells us how much the odds of the outcome (heart disease) will change for each 1 unit change in the predictor (smoking). Therefore → An increase of 1 Kg in lifetime tobacco usage multiplies the odds of heart disease by 1.46. Or equally → An increase of 1 Kg in lifetime tobacco usage is associated with an increase of 46% in the odds of heart disease.
Note for negative coefficients If 𝜷 = – 0.38, then $e^𝜷$ = 0.68 and the interpretation becomes: smoking is associated with a 32% (1 – 0.68 = 0.32) reduction in the relative risk of heart disease.

#Interpreting the estimator $\hat{\beta_0}$

How to interpret the intercept (smoking is a binary variable (0: non-smoker, 1: smoker)) The intercept is -1.93 and it should be interpreted assuming a value of 0 for all the predictors in the model. The intercept has an easy interpretation in terms of probability (instead of odds) if we calculate the inverse logit using the following formula:
- $e^{𝜷_0} ÷ (1 + e^{𝜷_0}) = e^{-1.93} ÷ (1 + e^{-1.93}) = 0.13$, so: The probability that a non-smoker will have a heart disease in the next 10 years is 0.13.

We will now interpret the exponentials of the coefficients of your model on nba data. Write the interpretations in your Notebook.

Then → $e^{𝜷_1} = e^{0.05} = 1.05$. The Odds ratio of the independent variable is 1.05 and it tells us how much the odds of the outcome (career length) will change for each 1 unit change in the predictor (gp).

Therefore → An increase of 1 in games played (gp) multiplies the odds of having a career longer or equal than 5 years by 1.05.

Or equally → An increase of 1 in games played (gp) is associated with an increase of 5% in the odds of having a career longer or equal than 5 years.

IMPORTANT NOTE: the interpretation of the coefficients and metrics of the model is the same in multiple logistic regression.