Comparing Two Quantitative Variables

As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display technique for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.

How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a study. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research study is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.

In summarizing the relationship between two quantitative variables, we need to consider:

Association/Direction (i.e. positive or negative)
Form (i.e. linear or non-linear)
Strength (weak, moderate, strong)

Example

We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of 50 students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot showing the relationship between Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student performance on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz performance be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:

Using Minitab
Using SPSS

Opening the Exam Data set.
From the menu bar select Graph > Scatterplot > Simple
In the text box under Y Variables enter Final and under X Variables enter Quiz Average
Click OK

Association/Direction and Form

We can interpret from either graph that there is a positive association between Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.

The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not appear to be a change in the direction in the relationship.

Strength

In order to measure the strength of a linear relationship between two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):

From the menu bar select Stat > Basic Statistics > Correlation
In the window box under Variables Final and Quiz Average
Click OK (for now we will disregard the p-value in the output)

The output gives us a Pearson Correlation of 0.609

Correlation Properties (NOTE: the symbol for correlation is r)

Correlation is unit free. If we changed the final exam scores from percents to decimals the correlation would remain the same.
Correlation, r, is limited to – 1 ≤ r ≤ 1.
For a positive association, r > 0; for a negative association r < 0.
Correlation, r, measures the linear association between two quantitative variables.
Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is 0 but the two variables are obviously related.)
The closer r is to 0 the weaker the relationship; the closer to 1 or – 1 the stronger the relationship. The sign of the correlation provides direction only.
Correlation can be affected by outliers

Equations of Straight Lines: Review

The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the change in y per unit change in x.

Two examples:

Data 1		Data 2
x	y	x	y
	3		13
1	5	1	11
2	7	2	9
3	9	3	7
4	11	4	5
5	13	5	3

For the 'Data 1' the equation is y = 3 + 2x ; the intercept is 3 and the slope is 2. The line slopes upward, indicating a positive relationship between x and y.

For the 'Data 2' the equation is y = 13 - 2x ; the intercept is 13 and the slope is -2. The line slopes downward, indicating a negative relationship between x and y.

Plot for Data 1	Plot for Data 2
y = 3 + 2 x	y = 13 - 2 x

The relationship between x and y is 'perfect' for these two examples—the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be concerned with relationships between two variables which are not perfect. The 'Correlation' between x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.

Regression analysis is concerned with finding the 'best' fitting line for predicting the average value of a response variable y using a predictor variable x.

Least Squares Regression

The best description of many relationships between two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Sir Francis Galton who in the mid 1800’s studied the phenomenon that children of tall parents tended to “regress” toward mediocrity.

Adjusting the algebraic line expression, the regression line is written as:

\(\hat{y}=b_0+b_1 x \)

Here, b_o is the y-intercept and b₁ is the slope of the regression line.

Some questions to consider are:

Is there only one “best” line?
If so, how is this line found?
Assuming we have properly fitted a line to the data, what does this line tell us?

By answering the third question we should gain insight into the first two questions.

We use the regression line to predict a value of \(\hat{y}\) for any given value of X. The “best” line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as small as possible. To accomplish this goal of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals appears as follows:

Residuals: \(y-\hat{y}\)

Sum of squared residuals: \(\sum{(y-\hat{y})^2}\)

A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for b_o and b₁:

\(b_1=r\frac{S_y}{S_x}\) \(b_0=\bar{y}-b_1\bar{x}\)

Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point (\(\bar{x}\), \(\bar{y}\)). As to the other expressions in the slope equation, S_y refers to the square root of the sum of squared deviations between the observed values of y and mean of y; similarly, S_x refers to the square root of the sum of squared deviations between the observed values of x and the mean of x.

Example: Exam Data set (Final.MTW or Final.XLS)

To perform a regression on the Exam Data we can use either Minitab or SPSS:

Using Minitab
Using SPSS

From the menu bar select Stat > Regression > Regression
In the window box by Response enter the variable Final
In the window box by Predictors enter the variable Quiz Average
Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
Click OK and OK again.

Plus the following is the first five rows of the data in the worksheet:

To perform a regression analysis in SPSS:

Import the data set
From the menu bar select Analyze > Regression > Linear
Click on variable Final and enter this in the Dependent box.
Click the variable Quiz Average and enter this in the Independent box.
Click OK

This should result in the following regression output:

WOW! This is quite a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you hang your mouse over various parts of the output in Minitab pop-ups will appear with explanations.

The Output

From the output we see:

Fitted equation is “Final = 12.1 + 0.751 Quiz Average”.
A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.

NOTE: Remember that the square root of a value can be positive or negative (think of the square root of 2). Thus the sign of the correlation is related to the sign of the slope.

The values under “T” and “P”, as well as the data under Analysis of Variance will be discussed in a future lesson.
For the values under RESI1 and FITS1, the FITS are calculated by taking substituting the corresponding x-value in that row into the regression equation to attain the corresponding fitted y-value.

For example, if we substitute the first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the first value in the FITS column. Using this value, we can compute the first residual under RESI by taking the difference between the observed y and this fitted : 90 – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.

What does the slope of 0.751 tell us? The slope tells us how y changes as x changes. That is, for this example, as x, Quiz Average, increases by one percentage point we would expect, on average, that the Final percentage would increase by 0.751 percentage points, or by approximately three-quarters of a percent.

Coefficient of Determination, R²

The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we try to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R². In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R² now; we will take a further look at this statistic in a future lesson.

Residuals or Prediction Error

As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or \(y-\hat{y}\). Therefore we would have as many residuals as we do y observations. The goal in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.

Comparing Two Quantitative Variables | STAT 800 (2024)