CHAPTER 4: REGRESSION CORRECTIONS AND RESULTS
I. Introduction
In the previous chapter I presented the theoretical regression models and the specified regression equations for teenage birthrates and high school dropout rates. In this chapter I use regression analysis techniques to estimate the regression equations and test the hypothesis that specific cultural, economic, educational, social, and public input factors affect teenage birthrates in California counties. The regression results cannot "prove" the hypothesis that a causal relationship exists. Proving that such a relationship exists between the explanatory variables and the dependent variable requires sound theory and common sense, not just regression results. What the regression results can tell us is whether a statistically significant relationship exists and whether the relationship is positive or negative. This information can then be used to reject or not reject a hypothesis regarding a theoretical causal relationship between two variables. The regression results also provide measures of how well the regression model explains the dependent variable, the magnitude of the effect of each explanatory variable on the dependent variable, and level of confidence about the explanatory variables' statistical significance.
In this chapter I describe some tools for evaluating regression results, present the results of the regressions for the teenage birthrate and high school dropout rate models in table format, and interpret the regression results. In Section II, I briefly describe a process for evaluating regression results and define some commonly used regression analysis terms. I also discuss how the regression results are presented in the tables. In Sections III to VI, I report the regression results for three different regressions, discuss possible problems in the regression models, and explain why and how the models were corrected for these problems. The regression results of the regression models presented in Chapter 3, referred to as the uncorrected models, are reported in Section III. Because it is likely that some problems exist in the uncorrected models, these results are presented only as a basis for comparison and are not interpreted. One of the suspected problems with the regression models is endogeneity. In Section IV, I report the results of the first step in the regression technique to correct the problem of endogeneity. This technique, called Two-Stage Least Squares regression, is explained in Section IV. The results are reported in order to evaluate the regression in terms of its potential to correct the endogeneity in the regression models. In Section V, I discuss two other possible problems in the regression models: multicollinearity and heteroskedasticity. The problems are defined, as are the methods to determine if they are indeed a problem in the regression model. If applicable, I explain how to correct the problem. The results of the regression of the corrected models for teenage birthrates and high school dropout rates are reported in Section VI. These results are compared to the predictions I made in Chapter 3. The corrected models are also evaluated for their reliability for policymakers. In Section VII, after all the corrections to the regression equations have been made, I report and discuss the elasticities of the significant explanatory variables in the teenage birthrate model.
II. Evaluating Regression Results
Reliable regression results can be of value to policymakers. They draw attention to important causal factors and help prioritize efforts to solve policy problems. The first step in evaluating a regression model and its regression results is to consider whether the theory behind the model and its broad causal factors and specific explanatory variables make sense. Judging the reliability of the regression also requires knowledge of such terms as R-squared, estimated regression coefficients, significance, levels of confidence, and F-statistic.
The R-squared of an estimated regression equation is a measure of the overall fit of the regression model. An R-squared of 0.70 means that the model explains 70 percent of the variation in the dependent variable around its mean. The values of R-squared range between zero and one. The higher the value of R-squared, the better the fit. For cross-sectional data which consists of observations for the same time period and is the type of data I use in this study, an R-squared of 0.50 could be considered at good fit (Studenmund, 1997, p. 52). The R-squared of a regression model is reported by most regression analysis software when the regression equation is estimated.
The estimated regression coefficient is a measure of the effect of a change in an explanatory variable on the dependent variable, holding all other explanatory variables constant. Specifically, it measures the effect of a one-unit change in the explanatory variable on the dependent variable. The coefficients can have either negative or positive values. If the relationship between the explanatory variable and the dependent variable is negative, meaning that the explanatory variable and the dependent variable move in opposite directions, the coefficient will have a negative sign. The coefficient has a positive sign if the explanatory and dependent variables move in the same direction. The values of the coefficients cannot be directly compared because the explanatory variables do not have the same unit of measurement. To compare the relative impact of the explanatory variables, the elasticities for the coefficients must be calculated. Elasticities are the percent of change in the dependent variable as a result of a one percent change in an explanatory variable. (See Section VII for a fuller explanation of elasticities.)
The terms significance and levels of confidence are related. An explanatory variable is statistically significant at a specific level of confidence. If a coefficient is significant at the 90 percent level of confidence, we can be 90 percent confident that the explanatory variable has a non-zero effect on the dependent variable. In a regression model, the explanatory variables that are significant are the variables that have an impact on the dependent variable. These are the variables that warrant the attention of policymakers. However, significance is not a measure of the strength of the relationship between the explanatory and the dependent variables. An explanatory variable that is significant at a high level of confidence, say 99 percent confidence, is not necessarily the explanatory variable with the strongest impact on dependent variable.
Generally, significance at the 90 percent or higher level of confidence is desirable. For policy purposes, a level of 80 percent or higher is often acceptable. If a policymaker can be 80 percent sure that implementing a policy that changes a causal factor (explanatory variable) will have an impact on the social problem represented by the dependent variable, it seems reasonable to implement the change. In order to estimate the impact of a change in an explanatory variable on the dependent variable, the coefficient of the variable is transformed into an elasticity. The elasticity is a measure of the magnitude of the impact of the explanatory variable on the dependent variable.
The F-test is used to test the overall significance of the regression equation and the results of the F-test are expressed in terms of the F-statistic (Studenmund, 1997, pp. 157-159). The F-statistic is reported by most regression analysis software. SPSS, the regression software used in this study, reports the F-statistic and its level of confidence. The higher the level of confidence, the more sure we can be that all the explanatory variables together have a significant non-zero relationship to the dependent variable. The F-test is also a test of the significance of the R-squared. An regression result with an R-squared of 0.70 and an F-statistic that is significant at the 99 percent level of confidence means that we can be 99 percent certain that the regression equation explains 70 percent of the variation in the dependent variable.
How regression results are presented
The results of the regressions for three versions of the teenage birthrate and the high school dropout rate model are presented in tables. The tables consist of a list of the variables in both of the regression models, their estimated coefficients, and their standard errors. The standard errors are reported in parentheses. For coefficients that are significant, their level of confidence is denoted by a specific number of asterisks. This notation, which is consistent for all models, is defined immediately below each table. The level of confidence for all variables was calculated using the more stringent two-sided or two-tailed t-test of significance. The R-square and F-statistic for the regression model are also reported, as is the number of observations in each estimation. The number of observations, 57, is the number of California counties for which data was gathered and input. All counties except Alpine County are included in the regression.
III. Regression Results: Uncorrected Models
The results of the regressions for the uncorrected teenage birthrate and the high school dropout rate models are presented in Table 4.1. These regression results are reported only for purposes of comparison. Both regression models are assumed to have problems that need to be corrected. Based on the literature reviewed in Chapter 2, one problem with both models is endogeneity. Other problems may exist as well. If these problems with the regression model are not corrected, the results of the regressions can not be trusted. The coefficients of the explanatory variables will be biased. The results of the regressions of the uncorrected models were obtained using Ordinary Least Squares regression, the most commonly used regression techniques. The variables in this regression are the variables listed in the theoretical model discussed in Chapter 3. Where the term "not used" is printed in the variable column, that particular variable was not included in the regression equation being estimated.
Uncorrected model for teenage birthrate
The regression results show that the R-squared for the teenage birthrate model is 0.91, meaning that the model explains 91 percent of the variation in birthrates in California counties. An R-square of 0.91 is very high, especially for cross-sectional data, and is one indication that the regression model is theoretically sound. The F-statistic of 20.58 is significant at the 99 percent or higher level of confidence, which is evidence that the R-squared is significant and reliable. My predictions about the direction of the relationship between the explanatory variables and the dependent variable are correct with two exceptions: Spanish Speaking and Teens in Poverty are both negatively related to Teenage Birthrate rather than positively as expected. In addition, 11 of the 18 variables are significant at an 80 percent or higher level of confidence. However, several variables are not significant that were significant in other studies and that I thought would also be significant in this model: Female SAT Scores, Adults with Bachelors' Degrees, and the Public Input variables, Community Clinics or Physicians.
The lack of significance of some of the theoretically important variables and the unexpected signs of two other variables indicate that there may be some errors in the specification of the regression model for teenage birthrate. Endogeneity is a suspected problem with the model. As a number of the studies reviewed in Chapter 2 attest, the relationship between birthrates and dropout rates may be endogenous. If the relationship is endogenous, the teenage birthrate in a county affects the high school dropout rate in the county and vice versa. The regression result for the teenage birthrate model, in which high school dropout rate is a significant explanatory variable, is evidence that the relationship is endogenous. To obtain the most reliable regression results, the model must be corrected to account for the endogeneity.
Table 4.1: Regression Results for Teenage Birthrates and High School Dropout Rates Uncorrected Models, Ordinary Least Squares Regression
| Teenage Birthrate Model | Dropout Rate Model | |||||
Variable |
Estimated Coefficient |
Standard Error |
Estimated Coefficient |
Standard Error |
||
| Constant | 238.415 | (87.544) | 5.139 | (18.277) | ||
| Teenage Birthrate | (not used) | 4.053E-02* | (0.026) | |||
| Hispanic Population | 3.657*** | (0.877) | 0.178 | (0.175) | ||
| Black Population | -0.928* | (0.614) | -7.394E-02 | (0.108) | ||
| Asian Population | -0.458 | (0.441) | 0.103* | (0.071) | ||
| Spanish Speaking | -4.301*** | (0.992) | -0.174 | (0.197) | ||
| Families in Poverty | 3.641*** | (1.254) | -9.418E-02 | (0.230) | ||
| Teens in Poverty | -1.484*** | (0.470) | 0.131* | (0.088) | ||
| Median Family Income | -6.032E-04** | (0.000) | -1.911E-05 | (0.000) | ||
| High School Dropout Rate | 1.559* | (0.927) | (not used) | |||
| Female SAT Scores | -0.531 | (0.455) | 0.130* | (0.089) | ||
| Adults with Bachelor's Degrees | -0.373 | (0.574) | -9.472E-02 | (0.086) | ||
| Public School Attendance | -1.005 | (0.810) | -0.160* | (0.114) | ||
| Female-Headed Households | 2.199 | (1.785) | 0.327 | (0.306) | ||
| Working Mothers | -0.802** | (0.463) | 9.951E-02* | (0.075) | ||
| Rural Population | -0.477*** | (0.137) | 6.820E-02*** | (0.024) | ||
| Suburban County | -18.054*** | (5.469) | 2.217** | (0.936) | ||
| Urban County | -12.115* | (7.752) | 3.515*** | (1.226) | ||
| Agricultural Employment | (not used) | -0.116* | (0.085) | |||
| Manufacturing Employment | (not used) | -3.228E-02 | (0.058) | |||
| Community Clinics | -5377.053 | (6220.574) | (not used) | |||
| Physicians | -2.151E-02 | (0.033) | (not used) | |||
| Per Pupil Expenditure | (not used) | -4.044E-04 | (0.001) | |||
| Pupil to Teacher Ratio | (not used) | 0.412* | (0.264) | |||
| Average Class Size | (not used) | -0.396 | (0.323) | |||
| R-squared | .907 | .599 | ||||
| F-statistic | 20.547*** | 2.485*** | ||||
| Observations | 57 | 57 | ||||
Uncorrected model for high school dropout rate
The regression results for the uncorrected high school dropout rate models are reported only for purposes of comparison. Because the model is flawed, the results of the regression are biased. The results of the regression for the high school dropout rate model also provide evidence that the relationship between birthrates and dropout rates is endogenous and that endogeneity is a problem in both models: in the dropout rate model, the teenage birthrate variable is significant. There are other indicators of problems with the high school dropout rate model. The R-squared of the dropout rate model is 0.60. While this is an acceptable R-squared for cross-sectional data, it is much lower than the R-squared for the teenage birthrate model. Approximately 40 percent of the variation in high school dropout rates is not explained by the model. The F-statistic of 2.49 is significant at the 99 percent or higher level of confidence, suggesting that the R-squared is significant.
Of the 21 explanatory variables in the high school dropout rate model, 11 of them are significant. However, for seven of the explanatory variables the direction of the relationship between them and the dependent variable is not as I had predicted: Black Population, Spanish Speaking, Families in Poverty, Female SAT Scores, Agricultural Employment, Manufacturing Employment, and Average Class Size. Of particular cause for concern are the negative relationship for Families in Poverty and the positive relationship for Female SAT Scores. Common sense calls into question these regression results which indicate that as the number of families in poverty increases, the high school dropout rate decreases and that as the number of students achieving high SAT scores increases, the dropout rate increases. Given the endogeneity in the dropout rate model and the questionable direction of the relationship between some of the explanatory variables and the dependent variable, some correction of the dropout rate model is necessary.
Because both theory and the regression results for the teenage birthrate and high school dropout rate models indicate that endogeneity is a problem, a correction is necessary to control for the endogeneity. When endogeneity exists in a regression model, the coefficients estimated by the regression are biased. Biased coefficients are inconsistent, and they are not reliable predictors of the impact of the explanatory variables on the dependent variable. If the coefficients are not reliable predictors, the regression is of little value to policymakers or researchers. In the next section, the technique for correcting endogeneity is discussed.
IV. Correcting for Endogeneity
The problem of endogeneity in a regression model is fairly easy to correct. Instead of using the Ordinary Least Squares regression technique, the regression equation is estimated using a Two-Stage Least Squares technique (Studenmund, 1997, pp. 529-557). As its name suggests, this regression is done in two stages. In the first stage an instrumental variable is created to replace the endogenous variable in the regression model. The instrumental variable is also referred to as the fitted value of the endogenous variable. The instrumental variable is created by running a regression with the endogenous variable as the dependent variable. The explanatory variables in this equation are all the variables in both regression models except the other endogenous variable. In the second stage of the Two-Stage Least Squares regression, the fitted value of the endogenous variable is used in place of the original (non-fitted) endogenous variable in the regression equation.
For example, to create an instrumental variable to replace the endogenous explanatory variable High School Dropout Rate in the birthrate regression model, a regression is run with the High School Dropout Rate as the dependent variable. The Teenage Birthrate variable is excluded as an explanatory variable in this regression. The result of the regression can be used produce an instrumental variable, the fitted value of High School Dropout Rate. The Fitted High School Dropout Rate is then used in the teenage birthrate regression model in the place of the High School Dropout Rate variable.
In order for this technique to work, each of the two regression models must have some explanatory variables that are unique to that model. This condition, called identification, is satisfied in the regression models for teenage birthrate and high school dropout rates. Each of these regression equations has explanatory variables that the other does not have. Specifically, the Public Input variables are different in each model; and Agricultural Employment and Manufacturing Employment are unique to the high school dropout rate model.
Table 4.2 reports the results of the first stage of the Two-Stage Least Squares regression. The R-squared and the F-statistic are the important measures in regression results. If the R-squared is too low and is not significant at a high enough level of confidence (i.e., if the F-statistic is not significant), the regression equation is not a good fit and the regression result is unreliable. The lower the R-squared in the first stage of the Two-Stage Least Squares regression, the more likely there will be bias in the second stage regression. In addition, a low R-square decreases the likelihood that the endogenous variable will be significant in the second-stage regression. The results of this stage of the Two-Stage Least Squares regression are not
appropriate for interpretation. They are just the first step in a regression technique that controls for endogeneity. The results are reported for the purpose of evaluating the regression models for their potential to address the problem of endogeneity.
Results of the first stage regression to correct endogeneity in the teenage birthrate model
In the first stage of the Two-Stage Least Squares regression to correct the endogeneity in the teenage birthrate model, a regression is run with the endogenous variable, High School Dropout Rate, as the dependent variable. In this regression, Teenage Birthrate is excluded as an explanatory variable, but all the other variables in both the dropout rate and the birthrate models are included. A new variable, an instrumental variable, can be produced by this regression. This instrumental variable, referred to as the fitted value of High School Dropout Rate, replaces the High School Dropout Rate variable in the teenage birthrate regression equation. Before using the instrumental variable in the teenage birthrate equation, the results of the first-stage regression should be evaluated. The R-squared and the F-statistic are the two important results for purposes of evaluation. The R-squared is 0.59 and the F-statistic is significant at the 90 to 99 percent level of confidence. With these regression results, it is appropriate to use the fitted value of High School Dropout Rate in the teenage birthrate equation to correct the problem of endogeneity. The new variable, Fitted High School Dropout Rate, is used in all future regressions.
Results of the first stage regression to correct endogeneity in the high school dropout rate model
The first-stage regression to correct the endogeneity problem in the high school dropout rate model is a regression with Teenage Birthrate as the dependent variable. The High School Dropout Rate is excluded in this first-stage regression, but all other variables in both the teenage birthrate and the high school dropout rate models are included. The new variable produced by this regression is the fitted value of Teenage Birthrate. Before replacing the variable Teenage Birthrate with the new instrumental variable, Fitted Teenage Birthrate, in the high school dropout
Table 4.2: First Stage Regression Results Two-Stage Least Squares Regression to Correct for Endogeneity
| Teenage Birthrate Model | Dropout Rate Model | |||||
Variable |
Estimated Coefficient |
Standard Error |
Estimated Coefficient |
Standard Error |
||
| Constant | 119.908 | (142.613) | 22.384 | (22.601) | ||
| Teenage Birthrate | (not used) | (not used) | ||||
| Hispanic Population | 3.657*** | (0.955) | 0.340** | (0.151) | ||
| Black Population | -1.227* | (0.806) | -0.198* | (0.128) | ||
| Asian Population | -0.227 | (0.487) | 0.123* | (0.077) | ||
| Spanish Speaking | -4.606*** | (1.033) | -0.382** | (0.164) | ||
| Families in Poverty | 3.976*** | (1.376) | 0.118 | (0.218) | ||
| Teens in Poverty | -1.468*** | (0.531) | 5.492E-02 | (0.084) | ||
| Median Family Income | -6.494E-04** | (0.000) | -5.829E-05 | (0.000) | ||
| High School Dropout Rate | (not used) | (not used) | ||||
| Female SAT Scores | -0.812* | (0.590) | 6.642E-02 | (0.094) | ||
| Adults with Bachelor's Degrees | -0.336 | (1.626) | -5.587E-02 | (0.099) | ||
| Public School Attendance | -1.407* | (1.043) | -0.344** | (0.165) | ||
| Female-Headed Households | 2.448 | (1.977) | 0.489* | (0.313) | ||
| Working Mothers | -0.576 | (0.495) | 8.614E-02 | (0.078) | ||
| Rural Population | -0.369** | (0.144) | 5.079E-02** | (0.023) | ||
| Suburban County | -14.725** | (5.579) | 1.557** | (0.884) | ||
| Urban County | -8.605 | (7.897) | 3.366** | (1.252) | ||
| Agricultural Employment | 0.382 | (0.573) | -0.133* | (0.091) | ||
| Manufacturing Employment | -0.219 | (0.400) | -6.174E-02 | (0.063) | ||
| Community Clinics | -3725.430 | (8539.439) | 582.011 | (1353.335) | ||
| Physicians | -3.075E-02 | (0.042) | -8.192E-03 | (0.007) | ||
| Per Pupil Expenditure | 1.382E-02* | (0.008) | 1.420E-04 | (0.001) | ||
| Pupil to Teacher Ratio | 1.438 | (1.790) | 0.567** | (0.284) | ||
| Average Class Size | 2.294 | (2.143) | -0.393 | (0.340) | ||
| R-squared | .910 | .590 | ||||
| F-statistic | 15.660*** | 2.224** | ||||
| Observations | 57 | 57 | ||||
rate model, it is important to consider the first-stage regression results. Again, the R-squared and the F-statistic are the important results. The R-squared of this regression is 0.91 and the F-statistic is significant at the 99 percent or higher level of confidence. With such a high R-square and confidence level, using Fitted Teenage Birthrate in the high school dropout rate model is an appropriate correction for the endogenous relationship between the birthrate and dropout rate. The variable Fitted Teenage Birthrate is used in all subsequent regressions.
Using the Two-Stage Least Squares regression technique corrects the problem of endogenous relationship between teenage birthrates and high school dropout rates and the biased estimates that are a result of endogeneity. However, other problems are likely to exist. Two indicators of other problems with the regression models are the lack of significance of some of the explanatory variables and the direction of a number of the relationships between explanatory variables and the dependent variables that are counter to common sense and/or my predictions. In the next section, I address these problems that might be causing these results and explain how to correct them.
V. Other Corrections to the Regression Models
When explanatory variables that are expected to be significant are not, the reason is often that multicollinearity exists in the regression model (Studenmund, 1997, pp. 259-292). The simple correlation coefficient, which is the measure of the strength and direction of the relationship between two explanatory variables, is the measure used to determine if multicollinearity is a problem. The simple correlation coefficient is denoted as the lower case letter "r." If the simple correlation coefficient (See Table 3.3) is high, around 0.80 or higher, the regression software may not be able to distinguish the separate effects of the two highly correlated explanatory variables. The solution to the problem of multicollinearity is to remove one of the highly correlated variables from the model. Eliminating an explanatory variable is only appropriate if there is another explanatory variable in the same category of broad causal factors.
Corrections for multicollinearity in the teenage birthrate model
In the teenage birthrate regression model several pairs of explanatory variables are highly correlated, meaning they have an "r" of 0.80 or more. The simple correlation coefficient for Hispanic Population and Spanish Speaking is 0.99. However, both of these highly correlated variables are significant, so both remain in the regression equation. The "r" for Physicians and Adults with Bachelor's Degrees is 0.80. Neither of these highly correlated explanatory variables is significant and there are other explanatory variables within the same category of broad causal factors of each. Under these conditions, it is appropriate to eliminate one of these variables from the equation. I chose to eliminate Adults with Bachelor's Degrees because there were four explanatory variables in the broad causal factor category of Educational Status and only two in the Public Inputs category. Interestingly, there is a strong relationship between Adults with Bachelor's Degrees and Female SAT Scores (r = 0.78) and between Adults with Bachelors' Degrees and Public School Attendance (r = 0.79). Public School Attendance was also eliminated from the regression equation because it is highly correlated with Physicians (r = 0.86). With these changes, two explanatory variables remain in the regression equation for Educational Status. The results of the regression of the regression equation after correcting for multicollinearity, as well as endogeneity, are reported in Table 4.3.
Corrections for multicollinearity in the high school dropout rate model
Three explanatory variables were eliminated from the high school dropout rate regression equation to correct for multicollinearity: Spanish Speaking, Adults with Bachelor's Degrees, and Pupil to Teacher Ratio. Hispanic Population and Spanish Speaking are highly correlated (r = 0.99) and neither is significant. Eliminating one of these variables from the regression equation is appropriate under these conditions. I chose to keep Hispanic Population instead of Spanish Speaking because other racial/ethnic populations are included in the regression equation, but no other language variable is included. Any conclusions that could be drawn from the regression results would be more meaningful with more directly comparable variables. Adults with Bachelor's Degrees was removed from the regression equation because of its strong relationship to both Female SAT Scores (r = 0.78) and Public School Attendance (r = 0.79). Neither Adults with Bachelor's Degrees or Public School Attendance is significant and Female SAT Scores is significant only at the 80 to 90 percent confidence level. Pupil to Teacher Ratio and Average Class Size are also highly correlated (r = 0.88) and neither is significant. These conditions are strong indicators that multicollinearity exists and justify the removal of one of the highly correlated variables. Since reducing class size is a current educational goal, I chose to retain Average Class Size and remove Pupil to Teacher Ratio. The results of the regression of the regression equation without the three highly correlated explanatory variables are reported in Table 4.3.
No correction for heteroskedasticity
Heteroskedasticity is frequently a problem in regression models with cross-sectional data and in those with a large difference in the size of the observations (i.e., a county's teenage birthrate) in the sample (Studenmund, 1997, pp. 366-399). Both of these conditions exist in the data utilized in this regression analysis, so heteroskedasticity may be a problem in the regression models. Heteroskedasticity exists when the variance of the residuals is not constant but differs depending on which observation is being discussed. A residual is the difference between the observed value of the dependent variable and the estimated or predicted value of the dependent variable. A regression equation suffers from heteroskedasticity when the variance of the residuals is related to a specific exogenous variable, generally referred to as a proportionality factor. If heteroskedasticity exists, the standard errors of the variables' coefficients tend to be underestimated with the result that some explanatory variables appear to be significant when in fact they are not significant. The policy consequence of heteroskedasticity is that it mistakenly focuses attention on causal factors that are not significant, factors that do not impact the dependent variable.
I used the Park test to determine if heteroskedasticity was a problem in the teenage birthrate or high school dropout rate model. The Park test is a three-step procedure. First, the uncorrected regression equation is estimated and the residuals from that estimation are calculated. These residuals are then squared and the log of the squared residuals is used as the dependent variable in a second regression. The log of the proportionality factor is the only explanatory variable in this second regression. I used the total population of each county as the proportionality factor. After this second regression is run, the results are checked to see if the proportionality factor is significant or not. If the proportionality factor is not significant, heteroskedasticity is not a problem in the regression model. For both the teenage birthrate and the high school dropout rate models, the results of the Park test indicated heteroskedasticity was not a problem. No correction was made to either regression equation for heteroskedasticity. The regression results reported in Table 4.3 are the estimations of the teenage birthrate model and the high school dropout rate model after corrections for both endogeneity and multicollinearity. The results are discussed and evaluated in the next section.
VI. Regression Results: Corrected Models
The results of the regressions of the corrected teenage birthrate and high school dropout rate models are reported in this section. If the regression models are found to be reliable, that is found to be reliable predictors of the effects of the explanatory variables on the dependent variable, the regression models will have value for policymakers and researchers. Once the problems of endogeneity and multicollinearity have been corrected, the regression results will be different, and ideally more reliable, than results of the regression of the uncorrected model. The R-squared is likely to be lower because there are fewer explanatory variables in the regression equations due to the removal of some explanatory variables to correct for multicollinearity. Different variables may be significant, the level of confidence for significance may increase or decrease, the signs for the direction of the relationship between the explanatory and the dependent variables may change, and the coefficients may have higher or lower values.
Results of regression of corrected birthrate model
As expected, the R-squared for corrected birthrate regression model (with corrections made for endogeneity and multicollinearity) is slightly lower than in the uncorrected regression model. The difference, however, is less than one percent. With an R-squared of 0.90, the regression model explains 90 percent of the variation in teenage birthrates in California's counties. Eleven of the 16 explanatory variables are significant on both models, but in the corrected model Female SAT Scores is significant and Black Population is no longer significant. Urban County is significant at a higher level of confidence. None of the signs for the direction of the relationship between the explanatory variables and the dependent variable are different, and all but two are as I predicted. Both Spanish Speaking and Teens in Poverty continue to have a negative relationship to the teenage birthrate, which is the opposite of what I had predicted. The estimated coefficients of the explanatory variables are somewhat different than those for the
Table 4.3: Regression Results
for Corrected Models Teenage Birthrates and High School Dropout
Rates Corrected for
Endogeneity and Multicollinearity
| Teenage Birthrate Model | Dropout Rate Model | |||||
Variable |
Estimated Coefficient |
Standard Error |
Estimated Coefficient |
Standard Error |
||
| Constant | 150.744 | (41.365) | 4.648 | (17.949) | ||
| Fitted Teenage Birthrate | (not used) | 0.102*** | (0.032) | |||
| Hispanic Population | 3.272*** | (1.003) | 6.061E-02* | (0.039) | ||
| Black Population | -0.733 | (0.590) | -7.914E-03 | (0.104) | ||
| Asian Population | -0.497 | (0.458) | 0.106* | (0.071) | ||
| Spanish Speaking | -3.832*** | (1.131) | (not used) | |||
| Families in Poverty | 3.844*** | (1.204) | -0.184 | (0.204) | ||
| Teens in Poverty | -1.690*** | (0.430) | 0.156** | (0.079) | ||
| Median Family Income | -5.137E-04** | (0.000) | 9.698E-06 | (0.000) | ||
| Fitted High School Dropout Rate | 2.779* | (1.765) | (not used) | |||
| Female SAT Scores | -0.708** | (0.391) | 0.156** | (0.084) | ||
| Adults with Bachelor's Degrees | (not used) | (not used) | ||||
| Public School Attendance | (not used) | -5.398E-02 | (0.104) | |||
| Female-Headed Households | 1.546 | (1.821) | 6.923E-02 | (0.306) | ||
| Working Mothers | -0.921** | (0.460) | 0.104* | (0.075) | ||
| Rural Population | -0.553*** | (0.156) | 8.381E-02*** | (0.026) | ||
| Suburban County | -20.889*** | (6.064) | 3.019*** | (1.028) | ||
| Urban County | -18.157** | (8.963) | 3.995*** | (1.310) | ||
| Agricultural Employment | (not used) | -0.179** | (0.083) | |||
| Manufacturing Employment | (not used) | -1.716E-02 | (0.058) | |||
| Community Clinics | -5630.681 | (5978.013) | (not used) | |||
| Physicians | -6.864E-03 | (0.024) | (not used) | |||
| Per Pupil Expenditure | (not used) | -1.548E-03 | (0.001) | |||
| Pupil to Teacher Ratio | (not used) | (not used) | ||||
| Average Class Size | (not used) | -0.451* | (0.334) | |||
| R-squared | .898 | .549 | ||||
| F-statistic | 21.983*** | 2.571*** | ||||
| Observations | 57 | 57 | ||||
uncorrected model, though the difference is unremarkable. Importantly, the coefficient for Fitted High School Dropout Rate is higher, increasing from 1.559 to 2.779. This increase in the coefficient suggests that even after correcting for the endogenous relationship between teenage birthrates and dropout rates, the dropout rate is an important factor for the teenage birthrate. Generally, the results of the regression of the corrected regression model for teenage birthrates indicate that the model is a sound one and that the results are reliable and have value for policymakers.
There is some cause for concern about the unexpected relationship between the variables Spanish Speaking and Teens in Poverty and the dependent variable. Both of these explanatory variables are negatively related to the dependent variable, Teenage Birthrate. Incorrect predictions do not necessarily mean that the regression model is unsound or that the theory behind it is erroneous. However, when variables have unexpected signs it is a good practice to reconsider the relationship.
One study of young Mexican-American women in California that compared the level of sexual activity of those born in Mexico with those born in the United States may help explain the negative relationship between the variable Spanish Speaking and a county's teenage birthrate. The young women who were born in Mexico were less likely to be sexually active than their American-born peers were (Powell, 1994, p. 14). Assuming the young women born in Mexico are more likely to speak Spanish in their homes than those born in America, it may be that the variable Spanish Speaking reflects some the cultural differences that also account for lower levels of sexual activity among young Mexican-born, Mexican-American women.
An explanation for the negative relationship between the variable Teens in Poverty and birthrates is more difficult. It could be that poor teens are the target of more social programs to prevent pregnancy and childbirth. Another possible explanation is that the families of poor teens, already burdened with poverty, discourage birth and encourage abortion in response to a teen's pregnancy. A lack of explanation for negative relationship calls for an inquiry into the possible causes of such a relationship, not the reformulation of the regression model. A more thorough search of the literature that specifically addresses teen poverty may yield a plausible explanation.
Results of regression of corrected high school dropout rate model
The regression results for the corrected regression model for high school dropout rates call into question the reliability of the regression model and its value to policy makers. The R-squared, as expected, is lower than in the uncorrected model. The R-square is 0.54, relatively low even for cross-sectional data. Two variables, Hispanic Population and Average Class Size, are significant in the corrected regression model that were not significant in the uncorrected model. Public School Attendance is no longer significant. A total of 11 of the 18 explanatory variables are significant. The level of confidence for significance increased two levels for Fitted Teenage Birthrate, and one level for Female SAT Scores, Suburban County, and Agricultural Employment. The level of confidence decreased for Teens in Poverty. Only one sign for the direction of the relationship between an explanatory variable and the dependent variable changed: Median Family Income now has a negative relationship to the dependent variable, the opposite of the predicted relationship. Six other variables also have relationships that are the opposite of what as expected. The estimated coefficients are different for both models, with some increasing and some decreasing. The coefficient for Fitted Teenage Birthrate increased considerably, from 0.041 to 0.102.
The relatively low R-squared and the high number of variables with unexpected relationships are indications that high school dropout rate regression model, even after corrections for endogeneity and multicollinearity, is not as reliable as the teenage birthrate model. According to the regression, Families in Poverty is significantly negatively related to the dependent variable, High School Dropout Rate. This means that as the number of families in poverty in a county increases, the number of dropouts decreases. The positive relationship between a county's median family income and its dropout rate means that as incomes rise, dropout rates also rise. Both of these conclusions run counter to a large body of literature that reports dropout rates are positively related to poverty. More importantly, the conclusions defy common sense. Another relationship that defies common sense is the reported positive relationship between SAT scores and dropout rates. According to the regression results, as more female students in a county score higher than average on the SAT test, the higher the dropout rate will be in that county.
The lack of a plausible explanation for any of these relationships suggests that there is a serious problem with the high school dropout rate model and, perhaps, with the theory underlying the model. With 40 percent of the variation in dropout rates unexplained by the regression model, it is likely that the model suffers from omitted variable bias. The unexpected signs of several explanatory variables are further evidence that omitted variable bias exists. Omitted variable bias occurs when one or more significant explanatory variables are missing from the regression equation. The remedy for omitted variable bias is to rethink the theory underlying the regression model to try to determine what variables are missing and then include them in the model.
In rethinking the regression model for high school dropout rates a number of possible omitted variables come to mind, but the desired data are not available at the county level. For example, the mobility of households with high school age teens was found to be related to high school dropout rates in one recent study (Kollars, 1999b). Other possible causal factors are the existence and effectiveness of dropout prevention programs, the degree to which the high school environment supports students, and the quality of high school teachers and programs. In addition to the problem of omitted variable bias, there may be problems with the reliability of the data for the dependent variable, High School Dropout Rate. Several recent articles on dropout rates have alleged that the dropout rate is improperly measured, unreliable, and undercounted (Kollars, 1999a; and Schrag, 1999). If the dependent variable itself is questionable, the regression results for any regression model attempting to explain high school dropout rates would also be questionable.
Based on the relatively low R-squared, the implausible relationships between some of the explanatory variables and the dependent variable, and the likelihood of omitted variable bias, the high school dropout rate model is not reliable. It is of little value to policymakers because its regression results can not be trusted to accurately estimate the impact of the explanatory variables on the dependent variable. It would therefore be inappropriate to interpret the regression results and to apply them to public policy. The high school dropout rate regression model is adequate as a necessary component of the endogeneity correction for the teenage birthrate model, which is its primary purpose for inclusion in this study. Beyond that role, the regression model is of little value and I will not attempt to interpret its regression results.
VII. Interpreting the Regression Results
The primary reason I used regression analysis to investigate the causal factors of teenage birthrates is its predictive value. The estimated coefficient of an explanatory variable tells us how much of an impact the variable has on the dependent variable. In order to compare the magnitude of the impact of the explanatory variables, which have different units of measurement, the estimated coefficients must be converted into elasticities. Elasticities are the percent changes in the dependent variable as a result of a one percent change in the explanatory variable. For simple linear regression analysis, the elasticities are calculated by dividing the mean of the explanatory variable by the mean of the dependent variable and multiplying the resulting quotient by the coefficient of the explanatory variable. The elasticities for the significant explanatory variables in the corrected teenage birthrate regression model are reported in Table 4.4.
Table 4.4: Elasticities for Significant Explanatory Variables in the Teenage Birthrate Model
| Significant Variable | Elasticity |
| Hispanic Population | 0.88*** |
| Spanish Speaking | -0.76*** |
| Families in Poverty | 0.57*** |
| Teens in Poverty | -0.49*** |
| Median Family Income | -0.27** |
| High School Dropout Rate | 0.19* |
| Female SAT Scores | -0.17** |
| Working Mothers | -0.88** |
| Rural Population | -0.29*** |
The absolute values of the elasticities of the significant explanatory variables range from 0.17 to 0.88. The magnitude of the impact of the explanatory variables with smaller elasticities is less than the impact of the variables with higher elasticities. The magnitude of the impact of an explanatory variable is an important policy consideration. No matter how significant a variable is, a small magnitude suggests that policy changes affecting that variable would have little effect compared to policy changes affecting variables with higher magnitudes. For example, a one percent increase in a county's dropout rate is expected to result in a 0.19 percent increase in the county's teenage birthrate. A one percent increase in the number of working mothers in a county is expected to have a much larger effect on the birthrate. It is expected to lower the birthrate by 0.88 percent.
One way to compare the magnitude of the impact of the significant variables is to compare the predicted effect of a ten percent increase on each variable. Using the mean value of each explanatory variable to represent the "average" county and the elasticity of the variable, the impact of such a change can be calculated. For example, an increase of ten percent in Hispanic population of the average county (from 17.7 to 19.5 percent of the population) is expected to result in an 8.8 percent increase in the birthrate (from 66.32 to 72.16 births per 1,000 women aged 15-19). If the percent of families in poverty in the average county increases by ten percent (from 9.78 to 10.7 percent), the birthrate is expected to increase by 5.7 percent (to 70.1 births). The birthrate is predicted to increase to 67.58 births per 1,000 women aged 15-19 if the high school dropout rate increased ten percent, from 4.44 to 4.8 percent of high school aged youth.
Furthermore, if the percent of working mothers in the average county increases by ten percent (from 63.43 to 69.7 percent) the predicted result would be an 8.8 percent decrease in the birthrate (from 66.32 to 60.48 births) in the average county. A ten percent increase in the percent of individuals who speak Spanish in their households (from 13.1 to 14.4 percent of households) is expected to result in a 7.6 percent decrease in birthrates (from 66.32 to 61.64 births). A ten percent increase in the percent of teens in poverty (from 19.07 to 21 percent) is predicted to lower birthrates by 4.9 percent (to 63.07 births). A ten percent increase in the rural population (from 34.85 to 38.4 percent of the population) is expected to result in a 2.9 percent decrease in the teenage birthrate (to 64.40 births). If the median family income in the average county increases from $34,876 to $38,364, a ten percent increase, the birthrate is expected to decrease by 2.7 percent (to 64.54 births). A ten percent increase in the variable Female SAT Scores (from 16.16 to 17.7 percent) is predicted to decrease the birthrate by 1.7 percent and lower the birthrate in the average county to 65.09 per 1,000 young women.
Knowing the magnitude of the impact of an explanatory variable can help policymakers prioritize efforts to lower teenage birthrates. Magnitude is not the only consideration though, there are other issues such as the feasibility of policies implemented to impact the explanatory variables. With an elasticity of 0.49, a ten percent increase in the number of teens living in poverty is expected to lower the birthrate by 4.9 percent. However, no reasonable policymaker would recommend that actions be taken to increase the number of teens in poverty. Some explanatory variables are more easily affected by policy changes. Counties can also do little to increase the number of Spanish speaking individuals in their communities. But counties could implement policies to reduce the percent of high school dropouts and expect to reduce the teenage birthrate. In addition to reducing the teenage birthrate, reducing the dropout rate would have other beneficial effects. Since lowering teenage birthrates and high school dropout rates are socially desirable goals, implementing policies to reduce the dropout rate would serve both purposes. Implementing a policy to lower the number of high school dropouts is far more politically feasible than implementing a policy to increase the number of teens in poverty, though the magnitude of the latter change would be higher. In the final chapter I discuss the some of the policy implications of the findings in this study, as well as the some of the larger issues of teenage childbearing.