This assumption in a regression analysis states that the explanatory variables are not correlated.

The OLS model encountered in Chapter is built on a series of assumptions that we will now examine. We will also look at some of the tools at our disposal when one or more of the assumptions do not hold. As usual, I start with loading in the data I will be using. Additionally, there are a couple new libraries introduced in this chapter:

tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
3 and
tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
4. If they are not installedon your computer, You will need to use the
tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
5 function on them first. Finally, I will no longer be using the convention of attaching datasets. My general feeling is that it is not best practice anyhow, but more practically, there are a few datasets in this chapter that have clashing variable names, which would make the analysis messy.

data(wage1)
data(CPS1985)
data(ceosal1)
data(nyse)
data(smoke)
data(vote1)
data(hprice3)
data(infmrt)

We will work through a series of assumptions upon which the OLS model is built, and what one might do if these assumptions do not hold.

The basic model is:

\[\begin{equation} Y_{i} = \alpha + \beta X_{i} + \epsilon_i \end{equation}\]

This is the equation for a straight line. But what if the data doesn’t really look like a straight line? Let’s look at this data from the

tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
6 dataset in the
tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
7 package. Here, we graph CEO salary on the Y axis and the company sales on the X axis.

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)

This assumption in a regression analysis states that the explanatory variables are not correlated.

So this doesn’t say much, and I’d guess there is not much of a relationship when we estimate the regression. But, part of what is going on might be due to the fact that both the CEO salary and sales data look skewed, so maybe we just don’t have a linear relationship. Let’s see what the regression looks like:

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The result is significant at the 10% level, \(R^2 = .01\) is tiny, but we might be able to do better if we do something about the non-linearity.

All OLS requires is that the model is linear in parameters. We can take what we learned in Chapter about data transformation to do some mathematical transformations of the data to create a linear function. Here, let’s calculate the natural log of both salary and sales and plot them.

tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))

Now, let’s take a look at the plot between lnsales and lnprice:

tempdata %>% ggplot(aes(x = lnsales, y = lnsalary)) +
    geom_point() +
    theme_classic() +
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

This assumption in a regression analysis states that the explanatory variables are not correlated.

This is quite a change from the previous graph! In fact, this data looks like it might actually have a linear relationship now.

reg1b <- lm(lnsalary ~ lnsales, data = tempdata)
stargazer(reg1b, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                              lnsalary          
## -----------------------------------------------
## lnsales                      0.257***          
##                               (0.035)          
##                                                
## Constant                     4.822***          
##                               (0.288)          
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.211           
## Adjusted R2                    0.207           
## Residual Std. Error      0.504 (df = 207)      
## F Statistic           55.297*** (df = 1; 207)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The \(R^2\) is considerably higher now, the \(\beta\) is significant at the 1% level, and all in all this is a much more compelling model.

The log transformation is probably the most common one we see in econometrics. Not only because it is useful in making skewed data more amenable to linear approaches, but because there is a very useful interpretation of the results. Let’s look at the regressions again, side by side:

stargazer(reg1a, reg1b, type = "text")

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
0

The regression on the left was the linear-linear model. Interpreting this coefficient demands that we are aware of the units of measure in the data: sales are measured in millions of dollars, salary in thousands of dollars. So \(\beta=0.015\) literally says that if sales goes up by 1, salary goes up by 0.015. But we interpret this as saying that, for every additional $1,000,000 in sales, CEO salary is expected to go up by $15. The model on the right is the log-log model. By log transforming a variable before estimating the regression, you change the interpretation from level increases into percentage increases. This model states that a 1% increase in sales on average leads to CEO pay going up by 0.257%.

Let’s look at the linear-log and log-linear models too.

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
1

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
2

Here all 4 specifications are side-by-side. We’ve already interpreted columns 1 and 4. How would we interpret columns 2 and 3?

  • Column 2 is the linear-log model (\(salary\) is linear, \(sales\) is logged). A 1% increase in sales is associated with an increase in CEO salary of $2,629.01 (there’s some calculus involved here, but the shortcut is just to move the decimal place 2 spaces to the left).

  • Column 3 is the log-linear model. A $1,000,000 increase in sales is associated with a .1% higher salary (again, there’s some calculus involved here, but the shortcut here is to move the decimal place 2 spaces to the right).

The model in column 4 seems to be the best model of the bunch.

Another common non-linear transformation is the quadratic transformation; this is particularly useful in cases where you think a relationship may be decreasing up to a point, and then start increasing after that point (or vice versa). To see this in action, let’s look at a graph that looks at the relationship between the age and selling prices of homes in the

tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
8 data from the
tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
7 package.

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
3

This assumption in a regression analysis states that the explanatory variables are not correlated.

This relationship looks somewhat U-shaped; moving left to right, it seems that the value of houses falls as they get older, but at a certain point, the relationship reverses course and older becomes more valuable!

This is a great place to estimate a quadratic regression, which is just a fancy term for including both \(age\) and \(age^2\) in our regression.

\[\begin{equation} Price_{i} = \alpha + \beta_1 age_{i} + \beta_2 age_i^2 + \epsilon_i \end{equation}\]

The best method here is to include the

tempdata %>% ggplot(aes(x = lnsales, y = lnsalary)) +
    geom_point() +
    theme_classic() +
  geom_smooth(method = lm)
0 argument in our regression; alternately, we can manually create a squared term and put it in our regression. The hprice3 data already has a squared term in it called agesq, so let’s verify that both methods get us to the same place:

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
4

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
5

Both columns 2 and 3 are the same, as expected. Our regression model, then, looks like:

\[\begin{equation} Price_{i} = \$113,762.10 - \$1691.90 age_{i} + \$9.26 age_i^2 + \epsilon_i \end{equation}\]

We can look at these two models graphically as well: the green line is the linear model (column 1 above), the red line is the quadratic model (column 2/3 above). The red line is clearly a better fit.

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
6

This assumption in a regression analysis states that the explanatory variables are not correlated.

We can also, with a little bit of calculus, figure out the age at which the relationship stops decreasing and starts increasing. You simply need to take the derivative of the regression equation with respect to age, set it equal to zero, and solve for age!

\[\begin{equation} Price_{i} = \$113,762.10 - \$1691.90 age_{i} + \$9.26 age_i^2 + \epsilon_i \end{equation}\] \[\begin{equation} \frac{\partial Price_i}{\partial age} = \$1691.90 + 2 \cdot \$9.26 age_i = 0 \: at \: age^\star \end{equation}\] \[\begin{equation} \frac{\$1691.90}{\$18.52} = age^\star=91.4 \end{equation}\]

As houses in this dataset age, they lose value until they hit 91.4 years of age, at which point they start appreciating in value!

You may have wondered why we bother with having a constant term \(\alpha\) in our regressions if nobody really cares about it. It turns out that the constant term is what makes this assumption true. For example. let’s look back at our log-log model from the above:

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
7

The Residuals panel looks at the distribution of the error term. Each residual from the regression is stored in the regression object; let’s put them in our tempdata dataset and take a look at the first few rows.

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
8

ceosal1 %>% ggplot(aes(y = salary, x = sales)) + 
    geom_point() +
    theme_classic() +
    geom_smooth(method=lm)
9

Is the mean of our residuals = 0?

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
0

I mean…that’s about as close to zero as you can get:

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
1

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
2

As long as you always have \(\alpha\) in your regression, this assumption isn’t something to worry about. There are only occasionally cases where you might want to run a regression without a constant, but they are rare.

We assume that the error term is homoskedastic, which means that the variance of the error term is not correlated with the dependent variable. If the variance of the error term is correlated with the dependent variable, the data is said to be heteroskedastic. We can look for heteroskedasticity by looking at a plot of residuals and fitted values.

This assumption in a regression analysis states that the explanatory variables are not correlated.

Let’s take a look at a regression with homoskedasticity first. In Chapter we looked at the voting share data from

tempdata %>% ggplot(aes(x = lnsales, y = lnsalary)) +
    geom_point() +
    theme_classic() +
  geom_smooth(method = lm)
1; here, we estimate the regression and plot the fitted values on the X-axis and the residuals on the Y-axis. For ease of reading, I am adding a horizontal line at 0:

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
3

This assumption in a regression analysis states that the explanatory variables are not correlated.

Note that the variation around the horizontal line is roughly the same for all of the possible fitted values. In other words, that 0 line that I added may very well be the line of best fit if I ran a regression!

Now, let’s take a look at a regression using the

tempdata %>% ggplot(aes(x = lnsales, y = lnsalary)) +
    geom_point() +
    theme_classic() +
  geom_smooth(method = lm)
2 data in the
tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
7 package. We estimate the effect of \(income\) on the number of daily cigarettes smoked, \(cigs\). The estimated coefficients are not significant, but that’s not important for what we are trying to show here.

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
4

This assumption in a regression analysis states that the explanatory variables are not correlated.

See how the shape of the residual plot looks a bit like a cone, with less spread on the left and a lot more spread on the right? This is evidence of heteroskedasticity. The residual plot doesn’t have to strictly be cone shaped for there to be heteroskedasticity, though that is the most common. Something that looks like a bow tie, or points scattered around a curved or sloped line would be considered heteroskedastic as well. Basically, any shape that isn’t a lot like that nice, neat rectangle from the

tempdata %>% ggplot(aes(x = lnsales, y = lnsalary)) +
    geom_point() +
    theme_classic() +
  geom_smooth(method = lm)
1 data above exhibits heteroskedasticity.

The bad news is that, for the most part, academic economists simply assume that heteroskedasticity is always present. The good news is that there is a fairly simple and straightforward fix to it: calculating robust standard errors. In fact, nearly every regression in every academic journal will report robust standard errors as a matter of course. In R, we can get this from the

tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
3 library. If you haven’t already, install and load the
tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
4 library and use
tempdata %>% ggplot(aes(x = lnsales, y = lnsalary)) +
    geom_point() +
    theme_classic() +
  geom_smooth(method = lm)
7 function with the
tempdata %>% ggplot(aes(x = lnsales, y = lnsalary)) +
    geom_point() +
    theme_classic() +
  geom_smooth(method = lm)
8 option.

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
5

We can compare this result side-by-side with the original regression:

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
6

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
7

It’s a bit tough to see here, but the coefficients didn’t change at all, and they shouldn’t. The only change is in the the standard errors (the numbers in the parentheses) underneath the coefficients. Because the standard errors change (they could go up or down!), the significance of the coefficients may change as well.

For example, let’s consider these regressions using the

tempdata %>% ggplot(aes(x = lnsales, y = lnsalary)) +
    geom_point() +
    theme_classic() +
  geom_smooth(method = lm)
9 infant mortality data in the
tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
7 package. This next bit of code estimates infant mortality, \(infmort\), as a function of the share of families receiving Aid to Families with Dependent Children (AFDC) and the number of physicians per capita. I then show those results side-by-side with the same regression corrected for heteroskedasticity.

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
8

reg1a <- lm(salary ~ sales, data = ceosal1)
stargazer(reg1a, type = "text")
9

Again, compare the columns. The coefficients do not change, but the standard errors do. And, as stated above, this can have the effect of changing the significance of one or more of your coefficients; in this case, physicians per capita went from having a significant (at the 1% level!) relationship with infant mortality to having an insignificant relationship.

If a model is not heteroskedastic and doesn’t have autocorrelation, it is said to have spherical errors and the error terms are IID (Independent and Identically Distributed).

This one is important and will probably create quite a few headaches for you when we get to regression with categorical independent variables in Chapter . Let’s introduce the concept quickly here though.

As discussed in a previous notebook, ordinary least squares works by attributing the variation in the dependent variable (Y) to the variation in your independent variables (the Xs). If you have more than one independent variable, OLS needs to figure out which independent variable to attribute the variation to. If you have two identical independent variables, R cannot distinguish one variable from the other when trying to apportion variation. If you attempt to estimate a model that contains independent variables that are perfectly correlated, R will attempt to thwart you. Typically, the way to proceed is to simply remove one of the offending variables.

So, let’s see what happens if I try to run a regression with the same variable twice:

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
0

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
1

Thwarted! R doesn’t even let me run this–note that shareA is only included in the table once. So let’s trick it into running a regression with two identical variables. I will display the results both with

## `geom_smooth()` using formula 'y ~ x'
1 and
## `geom_smooth()` using formula 'y ~ x'
2, because
## `geom_smooth()` using formula 'y ~ x'
1 will do its best to disguise my ineptitude here:

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
2

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
3
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
4

Now R is quite displeased with us. R simply dropped the shareAclone variable because it is impossible to run a regression with both shareA and shareAclone.

Let’s dig a little deeper into the linear function idea. Assume that you are running a regression with 3 independent variables, \(X_1\), \(X_2\), and \(X_3\).

\[\begin{equation} Y = \alpha + \beta_1 X_{1} + \beta_2 X_{2} +\beta_3 X_{3} +\epsilon_i \end{equation}\]

This assumption basically states that:

  • \(X_1\), \(X_2\), and \(X_3\) are all different variables.
  • \(X_1\) is not simply a rescaled version of \(X_2\) or \(X_3\). For example, If \(X_1\) is height in inches, \(X_2\) can’t be height in centimeters because then \(X_2 = 2.5X_1\)
  • \(X_1\) cannot be reached with a linear combination of \(X_2\) and \(X_3\). So, if \(X_1\) is income, \(X_2\) is consumption, and \(X_3\) is savings, and thus \(X_1 = X_2 + X_3\), you can’t include all 3 variables in your equation. This is true of more compicated linear combinations as well; if \(X_1 = 23.1 + .2X_2 - 12.4X_3\), you couldn’t run that either.

This probably doesn’t seem like it would be an issue. However, this assumption trips up a lot of people who are new to regression, because they are not usually aware that there is another variable hidden in the regression, \(X_0\), which carries a value of 1 for every observation. This is technically what the \(\alpha\) is multiplied by. So in actuality, your regression model is

\[\begin{equation} Y = \alpha X_{0}+ \beta_1 X_{1} + \beta_2 X_{2} +\beta_3 X_{3} +\epsilon_i \end{equation}\]

Since \(X_{0}\) is 1, we don’t bother writing it out every time, but it is there. And so this means that \(X_1\), \(X_2\), and \(X_3\) cannot be constants, because otherwise you will violate this assumption.

Let’s see what happens when we include another constant in the voting model:

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
5

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
6

R didn’t like the constant in the regression and just chucked it out. Why? Because the variable I called \(six\) is literally \(X_{0}+5\), which makes it a linear function of the intercept term!

Remember this lesson for when we start talking about dummy variable regressions in Chapter , it’s going to be important!

A related issue you might run into is multicollinearity, which is where you don’t have perfectly correlated independent variables but they are very, very close. If these correlations are high enough, they generally cause problems. Let’s see what happens when we run a regression with multicollinear independent variables.Here, I will use the voting data and create a new variable called shareArand which is the value of shareA plus a random number between -1 and 1.

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
7

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
8

You can see that shareA and shareArand are very highly correlated. But they are not exactly the same, so we haven’t violated our assumption. What happens when I run this regression? Weird stuff:

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               salary           
## -----------------------------------------------
## sales                         0.015*           
##                               (0.009)          
##                                                
## Constant                   1,174.005***        
##                              (112.813)         
##                                                
## -----------------------------------------------
## Observations                    209            
## R2                             0.014           
## Adjusted R2                    0.010           
## Residual Std. Error    1,365.737 (df = 207)    
## F Statistic            3.018* (df = 1; 207)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
9

tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
0

Columns 1 and 2 have the original data and slightly adulterated data, respectively. Note that the results are very similar, though not identical. Now, Compare the model with the multicollinearity on the right with the original model on the left. The coefficients are huge in absolute value compared to column 1. One of the coefficients has the wrong sign.

Note that if you add the two \(\beta\)s in column 3 together, you get a number very close to the coefficient on shareA in column 1. This result is typical of models with multicollinearity.

Multicollinearity is a problem, but it is a very easy problem to fix. Just drop one of the collinear variables and the problem is solved.

This assumption in a regression analysis states that the explanatory variables are not correlated.

The last assumption of the regression model is that your error terms are normally distributed. Violating this assumption is not terrible, but if this assumption is violated it is often a sign that your models might be heavily influenced by outliers. An easy way to look for this is the Q-Q plot. Let’s look at a Q-Q plot of the voting regression:

tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
1

This assumption in a regression analysis states that the explanatory variables are not correlated.

This Q-Q plot is pretty close to the line, indicating the residuals have pretty close to a normal distribution.

Let’s look at the Q-Q plot from the CEO salary regressions from up above:

tempdata <- ceosal1 %>% 
    mutate(lnsalary = log(salary)) %>% 
    mutate(lnsales = log(sales))
2

This assumption in a regression analysis states that the explanatory variables are not correlated.

Not quite as close to the line, but still not bad.

The Q-Q plot (and the residual plot from assumption 5) can be obtained another way; if you plot a regression object, you get 4 diagnostic plots, two of which are the ones we’ve looked at.

This assumption in a regression analysis states that the explanatory variables are not correlated.
This assumption in a regression analysis states that the explanatory variables are not correlated.
This assumption in a regression analysis states that the explanatory variables are not correlated.
This assumption in a regression analysis states that the explanatory variables are not correlated.

What do you do if your models are heavily influenced by outliers? Sometimes the right answer is to do nothing, especially if there are very few outliers relative to the size of the dataset. We’ve already discussed another approach in this chapter; using non-linear transformation. Beyond that, there are some very sophisticated approaches, like median regression (AKA Least Absolute Deviations) one can try, but these are well beyond the scope of this text.

This book is focused on learning the basic tools of econometrics; in line with that goal, I am totally aware that I did a considerable amount of handwaving (or straight up ignoring) with respect to some serious econometric issues. It is hard to draw the line in the sand between introductory econometrics and intermediate/advanced material, but that’s what I’m attempting to do here.

For those interested in pursuing careers in econometrics or business/data analytics, digging more deeply into these issues is essential. In the final chapter, Chapter , I list some suggested resources for those who wish to dig deeper into R or econometrics; many of the econometric suggestions in that section take much more comprehensive approaches into some of the issues presented here and would be excellent next steps for a reader interested in attaining a deeper understanding of econometrics.

Next, we will next turn to expanding the power of multiple regression modeling to include categorical independent variables and interaction effects.

When validating the assumptions of a regression linearity assumes that the relationship between the response variable and the explanatory variables is linear?

1. Linearity. This assumption states that the relationship between the response variable and the explanatory variables is linear. That is, the expected value of the response variable is a straight-line function of each explanatory variable, while holding all other explanatory variables fixed.

What type of statistics is about drawing conclusions about the characteristics of the population?

Inferential statistics are tools that statisticians use to draw conclusions about the characteristics of a population, drawn from the characteristics of a sample, and to decide how certain they can be of the reliability of those conclusions.

Is the combination of visualization and predictive analytics?

Visual analytics is the combination of visualization and predictive analytics.

What measure is used to characterize the peak tall skinny nature of the distribution?

Kurtosis is a statistical measure used to describe the degree to which scores cluster in the tails or the peak of a frequency distribution. The peak is the tallest part of the distribution, and the tails are the ends of the distribution.