Sunday, October 14, 2012

Correlation and regression

When you are deciding which variables to try out in a multiple regression, looking at their correlation with the dependent variable is a good start. You can get the correlations from Excel, using Data Analysis, as I show in this Youtube.

Here is an example, where the dependent variable is Duration:


The number given is the 'r' value....Pearson Product Moment....which goes from - 1 to + 1. Mostly, a variable with an 'r' value of greater than + 0.7 or smaller than - 0.7 will work in the regression. So I would think that Length ( r = 0.779) would work. 

Points to note: careful of 'multicollinearity'....which is were the independent variables are explaining the same thing. Length doesn't seem to be collinear with the other independent variables (look along the row of Length).


Wednesday, October 19, 2011


   How can we solve part c in Excel?


     A 10-year study conducted by the American Heart Association provided data on how age, blood pressure, and smoking relate to the risk of strokes.  Assume that the following data are from a portion of this study.  Risk is interpreted as the probability (times 100) that the patient will have a stroke over the next 10-year period.  For the smoking variable, define a dummy variable with 1 indicating a smoker and 0 indicating a nonsmoker.

a)   Develop an estimated regression equation that relates risk of a stroke to the person’s age, blood pressure, and whether the person is a smoker.
b)  Is smoking a significant factor in the risk of a stroke?  Explain.  Use a  .05.
c)   What is the probability of a stroke over the next 10 years for Art Speen, a 68-year-old smoker who has blood pressure of 175?  What action might the physician recommend for this patient? (15/38)


a.     The Excel output is shown below:

Regression Statistics




Multiple R
0.9346




R Square
0.8735




Adjusted R Square
0.8498




Standard Error
5.7566




Observations
20










ANOVA






df
SS
MS
F
Significance F
Regression
3
3660.7396
1220.247
36.8230
2.06404E-07
Residual
16
530.2104
33.1382


Total
19
4190.95










Coefficients
Standard Error
t Stat
P-value

Intercept
-91.7595
15.2228
-6.0278
1.76E-05

Age
1.0767
0.1660
6.4878
7.49E-06

Pressure
0.2518
0.0452
5.5680
4.24E-05

Smoker
8.7399
3.0008
2.9125
0.0102


                 = -91.7595 + 1.0767 Age + .2518 Pressure + 8.7399 Smoker

         b.     Since the p-value corresponding to t = 2.9125 is .0102 < a = .05, smoking is a significant factor.

         c.     The point estimate is 34.27; the 95% prediction interval is 21.35 to 47.18.  Thus, the probability of a stroke (.2135 to .4718 at the 95% confidence level) appears to be quite high.  The physician would probably recommend that Art quit smoking and begin some type of treatment designed to reduce his blood pressure.

Part c: first calculate the point estimate: yhat = - 91.7595 + 10.767*68 + 0.2518*175 + 8.7399 = 34.27

The question says that the probabilities are mulliplied by 100 (see highlighted section above). So we would divide 34.27/100 = 0.3427. This is the risk that this patient runs. 

Outliers and residuals Part 3 Q 11

Question 11 in Part 3 asks to detect outliers. We are trying to find out if any of the assumptions regarding Ordinary Least Squares regression have been violated. The easiest way to do this is to look at the standardised residuals. You can find the standardised residuals if you check the appropriate box when organising the regression. Let's look at this question:

 The dataset AUTO2  reports the price, horsepower, and ¼ -mile speed for 16 popular sports and GT cars.

a)   Find the estimated regression equation, which uses price and horsepower to predict ¼ -mile speed.
b)  Plot the standardized residuals against the predicted value y. Does the residual plot support the assumption about ?  Explain.
c)   Check for any outliers.  What are your conclusions? (15/42)

Below is the column of standardised residuals from the regression output for this question:
Look at the column, which is z-scores for each residual. We are looking for a z score which is larger than + 3 or smaller than - 3. There aren't any. So no outliers. I will not ask you to plot the standardised residuals against the predicted y value, which is what b) is asking.



Tuesday, October 18, 2011

Prediction intervals

In Excel it is rather difficult to calculate prediction intervals. It can be done in StatTools, but we are no longer using StatTools. So there won't be any questions about prediction intervals. BUT you should be able to calculate a point prediction just by plugging in the numbers in a regression equation. So, in this question from Part 3: make sure you can do a) but forget about b).


Prediction Intervals

1.         The owner of Showtime Movie Theaters, Inc., used multiple regression analysis to predict gross revenue (y) as a function of television advertising (x1) and newspaper advertising (x2).  The estimated regression equation was

   83.2  2.29x1  1.30x2

a)   What is the gross revenue expected for a week when $3500 is spent on television advertising (x1  3.5) and $1800 is spent on newspaper advertising (x2  1.8)?

b)  Provide a 95% prediction interval for next week’s revenue, assuming that the advertising expenditures will be allocated as in part (a). (15/29)

a.     = 83.2 + 2.29(3.5) + 1.30(1.8) = 93.555 or $93,555

         b.     Using StatTools, the prediction interval estimate: 91.774 to 95.401 or $91,774 to $95,401

How did I get the EVPI in Q4 of Part 2

This is a question worth studying. The table above shows the EMV for the three possible decisions. How much is the most you should pay for perfect information? We can find this by looking at the biggest --- or in this case smallest payoff because these are costs -- and then multiply that payoff with the probabilities, then adding up. That would be the best possible expected value. Then subtract the highest (or in this case lowest) EMV to find the difference that having perfect information would make.

Here we go: 180*04 + 201.25*0.6 = 193 (rounded). The lowest EMV is 198 (rounded). The difference between them is 198 - 193 = 5. You should not pay more than 5 for perfect information. So in correct notation: EVPI = 5. The expected value of perfect information is 5. Note that in this example we are dealing with costs, so want the lowest. If the problem was concerning profits, we would want the highest.

Why do we square time?

Why do we square time when the curve looks “quadratic”. Sometimes the rate of increase of the dependent variable is not constant. If it is constant, then we have a linear trend. We can model that with a straight line and use Ordinary Least Squares . But sometimes the dependent variable is increasing/ decreasing  more rapidly as time goes on. For example, if your pay increased by 10% every month, you would soon be very rich. If we square time, the variable on the horizontal x axis, we can capture this effect. Example: you square ‘2’ and get ‘4’ which is only twice as big. You square ‘10’ you get ‘100’, which is ten times as big. Squaring time accommodates the increasing rate of change.

Changing the reference level dummy...any effect?

Does it matter which dummy you pick as the reference level? No, it doesn’t. Just remember which dummy is the reference. One especially alert and hardworking student changed the reference level dummy in the Mutual Funds dataset. Her results are below: top is what we did in class, below with a different reference level. Now, try to estimate the return for a fund which is ‘IE’. You will find the results are the same.