3. Regression
Home ] Up ]

 

  • Regression

    • Equation of a Straight Line

      As you will see below a regression line is a straight line that represents the relationship between an x-variable and a y-variable.  Recall that the equation of a straight line is y = m x + b.  Quantity m gives the slope (rise/run) of the line and the y-intercept (the y coordinate at which the straight line crosses the y axis) is given by b.

      The line y = x has a slope of 1 and a y-intercept of 0 while the line y = 3x - 2 has a slope of 3 and a y-intercept of -2.  What are the slope and y-intercept of the line whose equation is 3x - 2y + 4 = 3?


    • Regression Line

      Statistical data often consists of related pairs of numbers, for example, height and weight, income and taxes paid, age and blood pressure, or advertising expenditures and income for a company.  You can think of these pairs of numbers as the x and y-coordinates of points in the plane.  You can plot these datapoints on a x-y coordinate system.  For example, using the FOCUS dataset choosing Scatter Plot under the Graphics menu, and then selecting High School GPA (hsgpa) as the x-variable and Cumulative GPA (cumgpa) as the y-variable produces the following scatterplot.

      Clearly, you can't draw a single straight line that passes through all of these points.  However, you can draw a straight line that lies approximates the relationship between hsgpa and cumgpa.  Such a line is shown in the next graph.

      The equation of the regression line is shown in the title.  We will discuss R-sq later.  The line shows that generally as hsgpa increases, cumgpa increases.  How is the equation of this regression line determined?  The next graph shows four given data points (in black).  The points have coordinates (1,1), (2,3), (3,3), and (4,5).  No single line will pass through all four points.  However, you can try to find lines that come close to the points.

      The next four graphs show four lines (in purple) that come close to the given points.  The equations of the lines are y=0.75x+1.5 (upper left graph), y=1.5x-1 (upper right graph), y=1x+0.5 (lower left graph), and y=1.2x (lower right graph).  Closeness of the line to the points is measured by the sum of squares of the vertical distances (lengths of the red vertical segments) between the given points and points (in purple) lying on the line vertically above or below the given points.

      The sums of squares are 2.375 for the upper left hand graph, 1.5 for the upper right hand graph, 1 for the lower left hand graph, and 0.8 for the lower right hand graph.  Follow this link to see an Excel spreadsheet showing sum of squares calculations.

      The least squares line of best fit, called the regression line for short, is the line that makes this sum of squares as small as possible.  For the example above, the regression line is the line shown in the lower right hand graph, the line y=1.2x.  Calculus techniques can be used to derive formulas for the slope and y-intercept of the regression line whose equation is denoted by y=b1x+b0.  You will only need to apply the resulting formulas.  First compute:

       

      For the given points, (1,1), (2,3), (3,3), and (4,5),

      Sxx=(12+22+32+42)-((1+2+3+4)2/4)=30-(102/4)=30-25=5,

      Syy=(12+32+32+52)-((1+3+3+5)2/4)=44-(122/4)=44-36=8,

      Sxy=(1*1+2*3+3*3+4*5)-((1+2+3+4)*(1+3+3+5))/4)=36-((10*12)/4)=36-30=6

      Then b1=Sxy/Sxx=6/5=1.2, and b0=(12/4)-(1.2*(10/4))=0.  So the regression line is the line shown in the lower right hand graph shown above.

      Click on the orange area below to open the FOCUS dataset within Webstat.  Use Simple Linear Regression under the Stat menu to verify the regression calculations shown above.

Click on this link for an Excel spreadsheet that contains the FOCUS data, a scattergram of HSGPA on the horizontal axis, CUMGPA on the vertical axis, and the regression line, regression equation, and the coefficient of determination (r2).  

  • Coefficient of Determination and Correlation Coefficient

    • Look at the Scattergram of Your Data First

      You can use the equations shown above to find the regression line for any set of numbers.  For example, given the pairs of numbers, (16,42), (1,54), (14,60), (4,70), (0,48), (6,41), (2,59), (10,64), (0,45), (8,69), using the formulas shown above, the equation of the regression line is y = 0.115653 x + 54.4945.  However, look at the scattergram of these numbers.

      It appears that there isn't a linear relationship between the x and y variables.  So, before trying to compute a regression line, which should only be used when a linear relationship exists between the variables, make a scatterplot.  If the scatterplot makes it clear that there is not a linear relationship between variables, don't use linear regression.

    • What is the Coefficient of Determination

      Look at the next two scattergrams. 

        

      In both cases, it appears that there is a linear relationship between the x and y variables.  However, if you imagine regression lines atop the scatterplots, you can see that in the left graph the points will lie closer to the regression line than in the right graph.  The coefficient of determination is a number that measures the degree of closeness of points  to the regression line.  This degree of closeness is called 'goodness of fit' by statisticians.

      Variation in y-values is measured by the standard deviation of the y-values.  Standard deviation is measured by

      In defining the coefficient of determination, only the inside top of this formula is used, and the x's are replaced by y's.  It is called the total sum of squares or SST, and is given by (y with a bar over it is the mean or average of the y-values):

      It can be shown that SST can be expressed as the sum of two terms named SSR and SSE.  SSR, or the sum of squares due to regression is given by the formula (y with a 'hat' over it signifies a the y-value found by using the regression equation on a x-value to find the corresponding y-value ):

      SSR measures the amount of total variation in y-values explained by the regression line.  The amount of total variation not explained by the regression line is called the sum of squares for error and denoted by SSE.  The formula for it is:

       

      These three quantities are related by the formula (which can be shown algebraically):

      SST = SSR + SSE

      Dividing both sides of this formula by SST results in the equation

      1 = (SSR/SST) + (SSE/SST)

      In this equation SSR/SST is called the coefficient of determination--it measures the proportion of variation in the y values explained by the regression line.  Multiplying by 100 gives the percentage of the variation in  y-values explained by the regression.  The coefficient of determination is denoted by r2.  For the data shown in the last three graph above, the coefficient of determination is 3046.68/3187.67 =0.96.  So about 96% of the variation in y-values is explained by the regression.  The other 4% is unexplained or error variation.

      Formulas for computing SST and SSR are:

      where

       

       

    • How is the Correlation Coefficient Computed

      Correlation measures the degree of linear relationship between two variables and is the square root of the coefficient of determination.  It is denoted by r and is the square root of the coefficient of determination, r2.  Since r2 can only lie between 0 and 1, r must lie between -1 and 1.  Also, since values of r2near 1 indicate that the regression line lies close to the data points, i.e. the regression line explains most of the variation in y-values, values of r near -1 or +1 also indicate a regression in which most of the variability in y's is explained by the regression line.  Values of  r near +1 indicate a regression line with positive slope, which implies that there is a direct linear relationship between the x and y-variables, while values of r near -1 indicate a regression line with negative slope implying an indirect or inverse linear relationship between the variables.

      Another formula for computing the correlation coefficient is:

      where the symbols in the formula have been defined above.

       

    • Relationship between the Correlation Coefficient, the Scattergram, and the Regression Line

      This link takes you to an interactive demonstration that shows the relationship between the correlation coefficient and the regression line.  When the page opens click on Interactive Scatterplot.  After the simulation for the scatterplot opens, you can place points on the display by clicking the mouse button.  After the 2nd point has been placed the regression line will be drawn.  In addition, the correlation coefficient and other statistics will be shown.

      Here is a link to another demonstration of the relationship between the scattergram or scatterplot and the correlation coefficient.  Once the page opens click on the + symbol next to Statistical Application--the display will change, then click on the + next to correlation, then click on the + next to applets, and finally click on correlation movie.  When the movie plays you will see the relationship between points in the plane and the correlation coefficient. 

    • Computation of Regression Lines and Coefficients of Determination

Make the following regression calculations on the FOCUS: database using Webstat2.  You can open Webstat2 by pushing the orange button above.

1. Make a scatterplot of the points where SAT Math is the y-variable and SAT Verbal is the x-variable.  Then find and plot the regression line on the scatterplot of points, find the coefficient of determination and correlation coefficient.  Relate these coefficients to the regression line plotted on the graph.

2. Do the same as in number 1 but make HS GPA the x-variable and Cumulative GPA the y-variable.

3. Finally, answer the same questions as in 1 but use hours as the x-variable and hsgpa as the y-variable.