Elementary Statistics
 |Sofia Home | Content Gallery |
Home
Syllabus
Schedule
Lessons
Assignments
Exams
Resources
Calculator

""

Lesson 12.6 Outliers

Outliers, in General

Outliers are points far from the line of best fit. The difference between the actual y and the estimated y for an outlier is "large." Outliers should be examined closely. In some cases, they should be deleted from the set of data points. In other cases, they should not be deleted at all because they are the key to the population under study. You must carefully examine what causes a data point to be an outlier.

Scatterplt showing one possible outlier

In this course, you will learn one method for determining outliers. When you take higher level courses in Linear Regression, you will learn other methods for determining outliers.

Back to Top

Outlier Calculation

To calculate outliers:

  • Do linear regression.
  • Calculate each (actual y - estimated y) value: Each
Residual: (y - yhat)

These values are called residuals.

  • Calculate the SSE which is the sum of the squares of all the (actual y - estimated y) values. SSE =
SSE
                        equation: sum of (y - yhat) squared

 

  • Calculates, the standard deviation of all the
Residual: (y - yhat)

values (the residuals):

standard
                          deviation of the residuals: s = square root of
                          (SSEdivided by n - 2)

where n - 2 is equal to the number of data points - 2 .

  • Multiply 1.9 by s.
  • Compare the absolute value of each residual to 1.9s.
  •  If the absolute value of any residual is greater than or equal to 1.9s, the corresponding point is an outlier. (If
Inequality: Absolute value of (y - yhat) is
                        greater than or equal to 1.9s

then the corresponding point is an outlier.)

Example: Linear regression produces the following line of best fit:

yhat
                        = 3.5106 - 0.6723x

The data points are

(1, 2), (3, 1.5), (4, 1), (2, 2), (3, 1), (5, 0.3), (1, 4).

 Scatterplot showing a possible outlier

The table contains the actual y values, the estimated y values calculated from the line of best fit, and the absolute value of the difference.

y
2
1.5 1 2 1 0.3
4
2.84
1.49 0.82 2.17 1.49 0.15
2.84

0.84
0.01 0.18 0.17 0.49 0.15
1.16

SSE = .842 +.012 +.182 + .172 +.492 +.152 +1.162 = 2.38 

 

n = 7 data points

 

Calculation
                        of s: s = the square root of (2.38 divided by 5)
                        = 0.69

Compare each value in the table below to 1.31.

0.84
0.01
0.18
0.17
0.49
0.15
1.16

No value is greater than or equal to 1.31. We do not have any

Inequality: Absolute value of (y - yhat) is
                        greater than or equal to 1.9s

Therefore, no point is an outlier. 

Back to Top

Think About It

Try problem number 87 in Chapter 12 of Introductory Statistics.

Please continue to the next section of this lesson.

 

Back to Top

 

Up » 12.1 Linear Equations » 12.2 Scatter Plots » 12.3 The Regression Equation » 12.4 The Correlation Coefficient » 12.5 Prediction » 12.6 Outliers » 12.7 TI-83

Content Developed by Susan Dean and Barbara Illowsky, Licensed under a Creative Commons License
Published by the Sofia Open Content Initiative
© 2004 Foothill-De Anza Community College District & The William and Flora Hewlett Foundation