Wednesday, October 19, 2011


   How can we solve part c in Excel?


     A 10-year study conducted by the American Heart Association provided data on how age, blood pressure, and smoking relate to the risk of strokes.  Assume that the following data are from a portion of this study.  Risk is interpreted as the probability (times 100) that the patient will have a stroke over the next 10-year period.  For the smoking variable, define a dummy variable with 1 indicating a smoker and 0 indicating a nonsmoker.

a)   Develop an estimated regression equation that relates risk of a stroke to the person’s age, blood pressure, and whether the person is a smoker.
b)  Is smoking a significant factor in the risk of a stroke?  Explain.  Use a  .05.
c)   What is the probability of a stroke over the next 10 years for Art Speen, a 68-year-old smoker who has blood pressure of 175?  What action might the physician recommend for this patient? (15/38)


a.     The Excel output is shown below:

Regression Statistics




Multiple R
0.9346




R Square
0.8735




Adjusted R Square
0.8498




Standard Error
5.7566




Observations
20










ANOVA






df
SS
MS
F
Significance F
Regression
3
3660.7396
1220.247
36.8230
2.06404E-07
Residual
16
530.2104
33.1382


Total
19
4190.95










Coefficients
Standard Error
t Stat
P-value

Intercept
-91.7595
15.2228
-6.0278
1.76E-05

Age
1.0767
0.1660
6.4878
7.49E-06

Pressure
0.2518
0.0452
5.5680
4.24E-05

Smoker
8.7399
3.0008
2.9125
0.0102


                 = -91.7595 + 1.0767 Age + .2518 Pressure + 8.7399 Smoker

         b.     Since the p-value corresponding to t = 2.9125 is .0102 < a = .05, smoking is a significant factor.

         c.     The point estimate is 34.27; the 95% prediction interval is 21.35 to 47.18.  Thus, the probability of a stroke (.2135 to .4718 at the 95% confidence level) appears to be quite high.  The physician would probably recommend that Art quit smoking and begin some type of treatment designed to reduce his blood pressure.

Part c: first calculate the point estimate: yhat = - 91.7595 + 10.767*68 + 0.2518*175 + 8.7399 = 34.27

The question says that the probabilities are mulliplied by 100 (see highlighted section above). So we would divide 34.27/100 = 0.3427. This is the risk that this patient runs. 

Outliers and residuals Part 3 Q 11

Question 11 in Part 3 asks to detect outliers. We are trying to find out if any of the assumptions regarding Ordinary Least Squares regression have been violated. The easiest way to do this is to look at the standardised residuals. You can find the standardised residuals if you check the appropriate box when organising the regression. Let's look at this question:

 The dataset AUTO2  reports the price, horsepower, and ¼ -mile speed for 16 popular sports and GT cars.

a)   Find the estimated regression equation, which uses price and horsepower to predict ¼ -mile speed.
b)  Plot the standardized residuals against the predicted value y. Does the residual plot support the assumption about ?  Explain.
c)   Check for any outliers.  What are your conclusions? (15/42)

Below is the column of standardised residuals from the regression output for this question:
Look at the column, which is z-scores for each residual. We are looking for a z score which is larger than + 3 or smaller than - 3. There aren't any. So no outliers. I will not ask you to plot the standardised residuals against the predicted y value, which is what b) is asking.



Tuesday, October 18, 2011

Prediction intervals

In Excel it is rather difficult to calculate prediction intervals. It can be done in StatTools, but we are no longer using StatTools. So there won't be any questions about prediction intervals. BUT you should be able to calculate a point prediction just by plugging in the numbers in a regression equation. So, in this question from Part 3: make sure you can do a) but forget about b).


Prediction Intervals

1.         The owner of Showtime Movie Theaters, Inc., used multiple regression analysis to predict gross revenue (y) as a function of television advertising (x1) and newspaper advertising (x2).  The estimated regression equation was

   83.2  2.29x1  1.30x2

a)   What is the gross revenue expected for a week when $3500 is spent on television advertising (x1  3.5) and $1800 is spent on newspaper advertising (x2  1.8)?

b)  Provide a 95% prediction interval for next week’s revenue, assuming that the advertising expenditures will be allocated as in part (a). (15/29)

a.     = 83.2 + 2.29(3.5) + 1.30(1.8) = 93.555 or $93,555

         b.     Using StatTools, the prediction interval estimate: 91.774 to 95.401 or $91,774 to $95,401

How did I get the EVPI in Q4 of Part 2

This is a question worth studying. The table above shows the EMV for the three possible decisions. How much is the most you should pay for perfect information? We can find this by looking at the biggest --- or in this case smallest payoff because these are costs -- and then multiply that payoff with the probabilities, then adding up. That would be the best possible expected value. Then subtract the highest (or in this case lowest) EMV to find the difference that having perfect information would make.

Here we go: 180*04 + 201.25*0.6 = 193 (rounded). The lowest EMV is 198 (rounded). The difference between them is 198 - 193 = 5. You should not pay more than 5 for perfect information. So in correct notation: EVPI = 5. The expected value of perfect information is 5. Note that in this example we are dealing with costs, so want the lowest. If the problem was concerning profits, we would want the highest.

Why do we square time?

Why do we square time when the curve looks “quadratic”. Sometimes the rate of increase of the dependent variable is not constant. If it is constant, then we have a linear trend. We can model that with a straight line and use Ordinary Least Squares . But sometimes the dependent variable is increasing/ decreasing  more rapidly as time goes on. For example, if your pay increased by 10% every month, you would soon be very rich. If we square time, the variable on the horizontal x axis, we can capture this effect. Example: you square ‘2’ and get ‘4’ which is only twice as big. You square ‘10’ you get ‘100’, which is ten times as big. Squaring time accommodates the increasing rate of change.

Changing the reference level dummy...any effect?

Does it matter which dummy you pick as the reference level? No, it doesn’t. Just remember which dummy is the reference. One especially alert and hardworking student changed the reference level dummy in the Mutual Funds dataset. Her results are below: top is what we did in class, below with a different reference level. Now, try to estimate the return for a fund which is ‘IE’. You will find the results are the same.


Saturday, October 1, 2011

Holding other variables constant in multiple regression

Someone asked me to explain what it means to "hold the other variables constant" in a multiple regression. One way of answering this great question is through a surface chart, a 3D representation. Below I've made a surface chart with two X independent variables, and the Y dependent variable on the vertical axis. So the height of the coloured surface shows the size of the dependent variable at any values of the two independent variables. I used the regression equation yhat = -18.36 + 2*X1 + 4.7*X2, which was Q2 in the in-class work. I calculated the value of Y for X1 and X2 from 0 to 50 and 0 to 15 respectively. Here is the plot:

and here is the data:

The Y values are in the internal cells, the two X variables along the edges. So, for example, if we held X2 constant at 3, then you can see that the change in the Y variable as X1 changes from 20 to 30 would be from 36.1 to 56.1. This is a simple example. Often the surface isn't flat but curved. HTH.

Wednesday, September 28, 2011

Regression Links

Here are links to Youtubes I've done on various aspects of regression. I'll be putting up more as we progress through the technique

Simple linear regression

This is an unrealistic example just to illustrate the theory.

Multiple regression with a dummy variable

Here I show how to convert a categorical variable which is either Yes or No into 1, 0, using Excel's =(if..) function. The new variable can then be used as an independent variable in a multiple regression. Uses the 'stroke' data for predicting risk of a stroke from age, blood pressure and smoking history.

Monday, September 19, 2011

Project Ideas

On this page I am going to keep adding ideas I get for projects/assignments. I am not saying that you should do them or that any of them are any good. Just examples of what might be a possible start. I have seen the data in the links I posted earlier, which is how I got the idea.

1. Border crossings. Does the number of border crossings made by Canadians depend on the exchange rate? The weather? How can we predict border crossings and so optimise the staffing of the border posts?

2. Bike travel on Burrard Bridge. Data is available for bike traffic and car traffic around Hornby/Burrard. There has been a lot of controversy over the dedicated bike lanes. Does the amount of bike traffic depend on the weather (I guess if it is raining/very cold, bicyclists are possibly less likely to get into those ghastly Spandex suits?)...or perhaps the hours of daylight? or gas prices? Build a model that reflects these decisions made by bikers and give advice to City Hall. A good model could be used elsewhere for prediction.

more ideas to come....!

Software Challenge!

I've been thinking about the software we are using, and the fact that some of you might have a hard time using either the library or lab workstations to access Palisades. So I have rewritten parts of the course (eg Part 3 Regression, Part 4 Time Series) so that we use basic Excel. I think we can also use Excel for Part 5 but I am not sure yet. This will make it easier for you to cooperate on your projects/assignments.

However, Excel can't do the decision-tree work we have been performing with Precision Tree, and I don't think it can Part 6 Simulation. So....

There is a lot of free open-access software about. If you can, please take a look and see if there is anything we could use. It would need to be reliable and 'safe' of course. Thanks to anyone taking on this challenge!

Data Sources

You can either collect your own data or find external sources, such as the Internet. Here are some sources to get you started. If you find more that you'd like to share, please let me know and I'll add them.


Vancouver
World Bank
 BC Data
Kwantlen Library…business stats
http://www.kwantlen.ca/library/subjects/busi_stats.html

Weather -- including historical weather data for any city. Might be useful if you were doing the bike-riding example!
www.wunderground.com

Friday, September 9, 2011

Books on reserve and textbook link

I've placed the books below on reserve in the library. These are my personal copies, so grateful for your care and attention! The first four books are source books for Part 1. The fifth book is the course text (but see the notes below)


How we decide by Jonah Lehrer

De La Mettrie’s Ghost: The Story of Decisions by Chris Nunn

Flirting with Disaster by Marc Gerstein. How some really terrible decisions have been made and why. Chapter 2 is about why humans have difficulty with probability.

Risk by Dan Gardner. Explores the ‘gut’ and ‘head’ responses we discussed in Part 1

Practical Management Science by Winston and Albright. This is the third edition. The recommended edition for the class is the fourth edition. But for our purposes this edition is fine. Note that the chapter numbering is not the same as in the course presentation, which is based on the fourth edition. I have the CD. If you need it, please let me know. But I will be posting the datasets as and when necessary.

AND one of your fellow students as found that you can check out the e-version of the course text here:
http://search.barnesandnoble.com/Data-Analysis-and-Decision-Making/S-Christian-Albright/e/9780538476126