Department of Economics
Columbia University
S3412
Summer 2022
SOLUTION to Problem Set 3
Introduction to Econometrics
Seyhan Erden
Hprice1.dta is a data set collected from the real estate pages of the Boston Globe during 1990.
These are homes that sold in the Boston, MA area. Variables are explained in table 1.
In this problem set you will take a look at some empirical evidence on housing prices of 1990 in
Boston, MA area. Note that, to do this problem set, you will need to create (generate) some new
variables, which are functions of the variables in hprice1.dta.
0
200
400
house price, $1000s
600
800
1. Preliminary data analysis:
a) Produce the scatterplot of price v. lotsize.
0
20000
40000
60000
size of lot in square feet
80000
100000
5.5
5
4.5
log(price)
6
6.5
b) Produce the scatterplot of lprice v. llotsize.
7
8
9
log(lotsize)
10
11
400
0
200
house price, $1000s
600
800
c) Produce the scatterplot of price vs. sqrft.
1000
2000
3000
size of house in square feet
4000
400
200
0
house price, $1000s
600
800
d) Produce the scatterplot of price vs. lsqrft.
7
7.5
8
8.5
log(sqrft)
e) Using the scatterplots from (a) and (b), would you suggest using the variables (i) price
and lotsize or (ii) lprice and llotsize for modeling using linear regression?
The relation between price and lotsize looks nonlinear. Taking logs of both variables makes the
relation look much more like a scatter with a linear relation, the sort of thing that can be well
handled by conventional multiple linear regression methods.
f) Using the scatterplot from (c) and (d), does the relation between price and sqrft appear to
be linear or nonlinear? If nonlinear, what sort of nonlinear curve might you want to
explore (briefly explain)?
From (c), ignoring a couple of outlier, it looks like they have a linear relation.
g) Regress lprice on llotsize, lsqrft, bdrms and colonial. Interpret the coefficient of (i)
llotsize, (ii) lsqft and (iii) bdrms.
. reg
lprice llotsize lsqrft bdrms colonial,r
Linear regression
Number of obs =
F( 4,
83) =
88
34.50
Prob > F
Rsquared
Root MSE
=
=
=
0.0000
0.6491
.18412
—————————————————————————–
Robust
lprice 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
————+—————————————————————llotsize 
.1678189
.0440356
3.81
0.000
.0802338
.2554041
lsqrft 
.7071931
.1090447
6.49
0.000
.4903076
.9240787
bdrms 
.0268305
.032718
0.82
0.415
.0382444
.0919053
colonial 
.0537962
.0489041
1.10
0.274
.0434721
.1510645
_cons  1.349589
.8115795
1.66
0.100
2.963788
.2646099
——————————————————————————
(i)
(ii)
(iii)
10% increase in lot size will cause price to increase by 1.68% keeping other
variables constant. (Elasticity of price with respect to lot size is 0.168)
10% increase in the size of the house will cause price to increase by 7.07%
keeping other variables constant. (Elasticity of price with respect to the size of the
house is 0.707)
An additional bedroom will increase the price of the house by 2.68% keeping
other variables constant.
h) Now regress lprice on llotsize, llotsize2, lsqrft, lsqrft2, bdrms and colonial. Interpret the
coefficient of (i) llotsize and (ii) lsqrft.
. reg
lprice llotsize llotsize2 lsqrft lsqrft2 bdrms colonial,r
Linear regression
Number of obs =
F( 6,
81) =
Prob > F
=
Rsquared
=
Root MSE
=
88
27.46
0.0000
0.6756
.17919
—————————————————————————–
Robust
lprice 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
————+—————————————————————llotsize 
.049371
.4969611
0.10
0.921
.9394256
1.038168
llotsize2 
.0053273
.0269443
0.20
0.844
.0482834
.0589381
lsqrft 
8.4956
3.052306
2.78
0.007
14.56873
2.422469
lsqrft2 
.6025779
.2002404
3.01
0.003
.2041623
1.000994
bdrms 
.0104218
.0323804
0.32
0.748
.0540051
.0748486
colonial 
.0911863
.0481
1.90
0.062
.0045176
.1868901
_cons 
34.40863
12.08688
2.85
0.006
10.35953
58.45772
——————————————————————————
The interpretation of the coefficients on llotsize and lsqrft is now complicated by the inclusion of
the quadratic terms. One can say that the effect of proportional increases in lot size is increasing
at an increasing rate, except neither term is remotely significant. Similarly, one can say that the
effect of proportional increases in square footage is initially decreasing at a decreasing rate
before switching signs. Alternatively, one can find the square footage level that minimizes
lprice, holding other variables constant.
i) Compare the model specification in part (g) to the one in part (h)
Both of these regressions have problems insignificant variables in (g) are bdrms and colonial
in part (h) llotsize, llotsize2 are also insignificant. Overall fit measures are better in part (h)
but nevertheless, it seems there would be a better model specification.
j) Regress price on lotsize, sqrft, bdrms and bdrms2. Is there an optimum number of
bedrooms that maximizes (or minimizes) the price of a house? (hint: check the sign of the
quadratic term)
. reg
price lotsize sqrft bdrms bdrms2,r
Linear regression
Number of obs =
F( 4,
83) =
Prob > F
=
Rsquared
=
Root MSE
=
88
16.55
0.0000
0.6794
59.544
—————————————————————————–
Robust
price 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
————+—————————————————————lotsize 
.0020749
.0012125
1.71
0.091
.0003367
.0044866
sqrft 
.1221177
.01709
7.15
0.000
.0881265
.1561089
bdrms 
40.2742
48.31517
0.83
0.407
136.3711
55.82273
bdrms2 
6.771337
6.532258
1.04
0.303
6.221062
19.76374
_cons 
81.67705
91.26611
0.89
0.373
99.84758
263.2017
——————————————————————————
Approximately 3 bedrooms minimize the price of a house. There is no price maximizing
value!
2. Estimate the regressions in Table 2 and fill in the empty entries. You may write in the entries
by hand or type them using the .doc electronic version of the table on the course Web site.
3. Use the results in Table 2 to answer the following questions.
a) Using regression (1), test the hypothesis that the coefficient on lsqft is zero, against the
alternative that it is nonzero, at the 5% significance level. Explain in words what the
coefficient means.
t=8.908 so the hypothesis is rejected at the 5% (1%) significance level. The estimated coefficient
0.873, means that a 1% increase in sqft is associated with a 0.873% increase in house prices.
b) Using regression (3), test the hypothesis that the coefficients on lsqft and lsqft2 are both
zero, against the alternative that one or the other coefficient is nonzero, at the 5%
significance level.
F=55.68 with pvalue o.oo so the hypothesis that both coefficients are zero(holding constant
llotsize and llotsize 2 ) is rejected at the 5%(1%) significance level
c) Using regression (3), is there evidence that the relationship between lprice and llotsize is
nonlinear?
No, the tstatistic testing the hypothesis that llotsize 2 has a zero coefficient is 0.095, so the
coefficient is not significant at the 10% significance level.
d) Using regression (3), is there evidence that the relationship between lprice and lsqft is
nonlinear?
Yes, the tstatistic testing the hypothesis that lsqrft 2 has a zero coefficient is 2.546, so the
coefficient is significant at the 5% significance level.
e) Using regression (5), test the null hypothesis (at the 5% significance level) that the
coefficients on the â€œstyle dummiesâ€ (Colonial and Victorian) all are zero, against the
alternative hypothesis that at least one is nonzero. What is number of restrictions q in
your test? What is the critical value of your test?
F = 2.51 with pvalue = 0.088, so the hypothesis is not rejected at the 5% significance level. The
number of restrictions is the number of coefficients that are zero under the null, here q = 2
(coefficients on Colonial and Victorian). The 5% critical value of the F2,82 distribution is
approximately 3.10.
Table 1
DATA DESCRIPTION, FILE: hprice1.dta
Variable
price
assess
bdrms
lotsize
sqrft
victorian
colonial
lprice
lassess
llotsize
lsqft
Definition
House price, in $1000.
Assessed value in $1000.
Number of bedrooms
Size of lot in square feet.
Size of house in square feet
= 1 if house is in Victorian style.
= 0 otherwise.
= 1 if house is in Colonial style.
= 0 otherwise.
Log(price)
Log(assess)
Log(lotsize)
Log(sqft)
Problem Set 3, Table 2
Determinants of Housing Prices
(1)
lprice
(2)
lprice
(3)
lprice
(4)
lprice
(5)
lprice
(lsrqft)2
0.873**
(0.098)
__
0.762**
(0.077 )
__
0.749**
(0.081)
__
0.752**
(0.083)
__
llotsize
__
0.168**
(0.038)
__
6.796*
(2.974)
0.494*
(0.194)
0.185
(0.382)
0.002
(0.021)
__
R2
SER
n
0.553
0.204
88
0.635
0.185
88
0.655
0.183
88
0.646
0.185
88
Dependent variable:
Regressor:
lsqrft
0.056
(0.562)
0.006
(0.031)
0.068
(0.047)
__
0.163
(0.544)
(llotsize)2
__
0.001
(0.030)
colonial
__
__
0.008
(0.085)
victorian
__
__
__
0.117
(0.092)
Intercept
0.975
1.640*
27.242*
1.070
1.553
(0.745)
(0.681)
(11.730)
(2.671)
(2.593)
Fstatistics testing the hypothesis that the population coefficients on the indicated regressors
are all zero:
lsqrft, (lsqrft)2
__
__
55.68
__
__
(0.000)
llotsize,(llotsize)2
__
__
9.94
7.89
9.56
(0.0001)
(0.001)
(0.000)
Style dummies
__
__
__
__
2.51
(Colonial and Victorian)
( 0.088)
Regression summary statistics
0.548
0.627
0.639
0.629
0.635
R2
0.656
0.183
88
SER = (RMSE^2*(88/86))^(1/2)
Notes: Heteroskedasticityrobust standard errors are given in parentheses under estimated
coefficients, and pvalues are given in parentheses under F statistics. The Fstatistics are
heteroskedasticityrobust. Coefficients are significant at the +10%, *5%, **1% significance level
4. US states differ in the generosity of their welfare programs. We here wish to analyze which
factors play a role in the level of benefits across different states. The data set TANF2.dta
contains data from each of 49 states. The variables in the data set are given in the following
table:
Table 3
DATA DESCRIPTION, FILE: TANF2.dta
Variable
Definition
tanfreal
Stateâ€™s real maximum benefit for single parent with three kids.
black
blue
Percentage of stateâ€™s population who are African Americans.
Dummy variable, equals 1 if state voted Democratic in 2004
presidential election.
Stateâ€™s median income.
= 1 if state is in West
= 0 otherwise
= 1 if state is in South.
= 0 otherwise.
= 1 if state is in Midwest
= 0 otherwise
= 1 if state is in Northeast
=0 otherwise
mdinc
west
south
midwest
northeast
Use data set TANF2.dta to examine whether Midwest states differ in their welfare programs from other
states. To do this, we will use the following regression model:
tanfreal = Î²0 + Î²1 black + Î²2 blue + Î²3 midwest + Î²4 (black*midwest) + Î²5 (blue*midwest) + u
Here, black*midwest is the product of the regressors black and midwest and so forth.
(a) Write the null hypothesis to test whether there is a difference between the welfare programs of
Midwest states and all other states, explain.
H0: Î²3+ Î²4*black+Î²5*blue=0
Under H0, we have Î²0 + Î²1 black + Î²2 blue + u
That is, the expected level of benefits does not depend on whether the state is in the Midwest or not.
(b) Construct new set of interaction regressors in STATA. Estimate the model above. Write your
answer as a regression equation with standard errors in parenthesis underneath each coefficient.
Perform the test for the null hypothesis in part (a) with a robust Ftest. What is your conclusion?
The fitted model is:
Ë†
tamfreal
= 347.53 âˆ’ 522.03 ï‚´ black + 31.76 ï‚´ blue + 141.42 ï‚´ midwest âˆ’ 1420.53 ï‚´ [black ïƒ— midwest ] âˆ’ 204.14 ï‚´ [blue ïƒ— midwest ]
For nonMidwest states(Midwest=0), the fitted expected benefits level therefore is:
Ë†
tamfreal
= 347.53 âˆ’ 522.03 ï‚´ black + 31.76 ï‚´ blue
So a larger black population has a negative effect on benefit levels while Democratic states tend to give
higher levels.
For Midwest states (Midwest=1)
Ë†
tamfreal
= 488.95 âˆ’ 1942.56 ï‚´ black âˆ’ 172.38 ï‚´ blue
Again, the effect of size of black population is negative, but much larger impact than for nonMidwest
states. Midwest Democratic states tend to give lower benefits. With the command test, we test H0 in
STATA and obtain Fobs = 26.68 which is from F3,ï‚¥ distribution. STATA reports a pvalue of
0.00. This is based on a F3,43 distribution which is close to the F3,ï‚¥ one. So we reject the
hypothesis and conclude that Midwestern states have significantly different welfare programs
compared with nonMidwest ones.
(c) Introduce a new variable nonmidwest = 1 â€“ midwest. That is nonmidwest = 1 if a state is not in the
Midwest and zero otherwise. Consider the following alternative regression model:
tanfreal = Î³1 nonmidwest + Î³2 (black*nonmidwest) + Î³3 (blue*nonmidwest) + Î³4 midwest
+ Î³5 (black*midwest) + Î³6 (blue*midwest ) + u
Write up the hypothesis of no differences in welfare programs in terms of Î³1…. Î³6
What is the relationship between the parameters Î³1…. Î³6 in this new model and Î²1…. Î²6
in the previous model? Estimate the model in STATA and write the result in usual regression
equation form with standard errors in parentheses underneath coefficients.
The hypothesis of no differences in welfare programs in terms of ï§ 1 ,
, ï§ 6 amounts to ï§ 1 = ï§ 4,ï§ 2 = ï§ 5
and ï§ 3 = ï§ 6 . We can rewrite the model as
tanfreal = Î³1 (1midwest) + Î³2 (black black*midwest) + Î³3 (blueblue*midwest) + Î³4 midwest
+ Î³5 (black*midwest) + Î³6 (blue*midwest ) + u
= Î³1 + Î³2 black + Î³3 blue + (Î³4 Î³1 )* midwest
+ (Î³5 â€“ Î³2)* (black*midwest) + (Î³6 â€“ Î³3 )(blue*midwest ) + u
That is, ï¢ 0 = ï§ 1, ï¢1 = ï§ 2 , ï¢ 2 = ï§ 3 , ï¢ 3 = ï§ 4 âˆ’ ï§ 1, ï¢ 4 = ï§ 5 âˆ’ ï§ 2 and ï¢5 = ï§ 6 âˆ’ ï§ 3 . So the two models are
just different parameterizations of the same equations. This relationship implies that the OLS
estimates become
ï§Ë†1 = ï¢Ë†0 = 347.53, ï§Ë†2 = ï¢Ë†1 = âˆ’522.03, ï§Ë†3 = ï¢Ë†2 = 31.76, ï§Ë†4 = ï¢Ë†3 + ï§Ë†1 = 488.95,
ï§Ë†5 = ï¢Ë†4 + ï§Ë†2 = âˆ’1942.56, ï§Ë†6 = ï¢Ë†5 + ï§Ë†3 = âˆ’172.38
(d) What happens if you include an intercept Î³0 in the model in part (c)? Explain.
By including an intercept term in the model, the model will suffer from multicollinearity since
Midwest+nonmidwest=1. Thus, we cannot estimate the model by OLS since the model is
overparameterized.
Following questions will not be graded, they are for you to practice and will be discussed at
the recitation:
8.1. This table contains the results from seven regressions that are referenced in these answers.
(1)
(2)
Data from 2008
(3)
(4)
(5)
(6)
(7)
(8)
Dependent Variable
AHE
ln(AHE)
ln(AHE) ln(AHE) ln(AHE) ln(AHE) ln(AHE) ln(AHE)
0.585** 0.027**
0.081
0.081
0.124*
0.112 0.146*
Age
(0.037)
(0.002)
(0.043)
(0.043)
(0.06)
(0.059) (0.069)
Age2
âˆ’0.00091 âˆ’0.00091 âˆ’0.0015 âˆ’0.0016 âˆ’0.0020
(0.00073) (0.00073) (0.0010) (0.001) (0.0012
)
0.80**
ln(Age)
(0.05)
Female ï‚´ Age
âˆ’0.088
âˆ’0.093
(0.087)
(0.088)
2
0.0012
0.0012
Female ï‚´ Age
(0.0015)
(0.0015
)
Bachelor ï‚´ Age
âˆ’0.064 âˆ’0.040
(0.087) (0.088)
2
0.0014 0.0010
Bachelor ï‚´ Age
(0.0015) (0.0015
)
1.31
1.43
Female
âˆ’3.66** âˆ’0.19** âˆ’0.19** âˆ’0.19** âˆ’0.22**
âˆ’0.22**
(1.27)
(0.21)
(0.01)
(0.01)
(0.01)
(0.02)
(0.02) (1.28)
Bachelor
8.08**
0.43**
0.43** 0.43**
0.40**
0.40**
1.08
0.69
(0.21)
(0.01)
(0.01)
(0.01)
(0.02)
(0.01)
(1.27) (1.280)
0.069** 0.068** 0.072** 0.072**
Female ï‚´ Bachelor
(0.022)
(0.021) (0.022) (0.022)
1.88** âˆ’0.035
1.09
1.10
0.36
0.78
0.16
Intercept
âˆ’0.64
(0.06)
(0.63)
(0.87)
(0.88) (1.01)
(1.08)
(0.185) (0.64)
Fstatistic and pvalues on joint hypotheses
(a) Fstatistic on
109.8
111.13
59.49
60.69
43.88
terms involving
(0.00)
(0.00)
(0.00)
(0.00
(0.00)
Age
(b) Interaction terms
10.79
10.77
12.13
with Age and Age2
(0.00)
(0.00) (0.00)
9.07
0.47
0.47
0.47
0.47
0.47
0.47
0.47
SER
R2
0.20
0.20
0.20
Significant at the *5% and **1% significance level.
0.20
0.20
0.20
0.20
0.21
(a) The regression results for this question are shown in column (1) of the table. If Age increases
from 25 to 26, earnings are predicted to increase by $0.585 per hour. If Age increases from
33 to 34, earnings are predicted to increase by $0.585 per hour. These values are the same
because the regression is a linear function relating AHE and Age.
(b) The regression results for this question are shown in column (2) of the table. If Age increases from
25 to 26, ln(AHE) is predicted to increase by 0.027. This means that earnings are predicted to
increase by 2.7%. If Age increases from 34 to 35, ln(AHE) is predicted to increase by 0.027.
This means that earnings are predicted to increase by 2.7%. These values, in percentage terms,
are the same because the regression is a linear function relating ln(AHE) and Age.
(c) The regression results for this question are shown in column (3) of the table. If Age increases from
25 to 26, then ln(Age) has increased by ln(26) âˆ’ ln(25) = 0.0392 (or 3.92%). The predicted
increase in ln(AHE) is 0.80 ï‚´ (.0392) = 0.031. This means that earnings are predicted to
increase by 3.1%. If Age increases from 34 to 35, then ln(Age) has increased by ln(35) âˆ’
ln(34) = .0290 (or 2.90%). The predicted increase in ln(AHE) is 0.80 ï‚´ (0.0290) = 0.023. This
means that earnings are predicted to increase by 2.3%.
(d) The regression results for this question are shown in column (4) of the table. When Age
increases from 25 to 26, the predicted change in ln(AHE) is
(0.081 ï‚´ 26 âˆ’ 0.00091 ï‚´ 262) âˆ’ (0.081 ï‚´ 25 âˆ’ 0.00091 ï‚´ 252) = 0.081.
This means that earnings are predicted to increase by 8.1%.
When Age increases from 34 to 35, the predicted change in ln(AHE) is
(0. 081 ï‚´ 35 âˆ’ 0.00091 ï‚´ 352) âˆ’ (0. 081 ï‚´ 34 âˆ’ 0.00091 ï‚´ 342) = 0.081.
This means that earnings are predicted to increase by 8.1%.
(e) The regressions differ in their choice of one of the regressors. They can be compared on the
basis of the R 2 . The regression in (3) has a (marginally) higher R 2 , so it is preferred.
(f) The regression in (4) adds the variable Age2 to regression (2). The coefficient on Age2 is not
statistically significant (t = âˆ’1.2) and the estimated coefficient is very close to zero. This
suggests that (2) is preferred to (4), the regressions are so similar that either may be used.
(g) The regressions differ in their choice of the regressors (ln(Age) in (3) and Age and Age2 in (4)).
They can be compared on the basis of the R 2 . The regression in (3) has a (marginally)
higher R 2 . , so it is preferred.
(h)
The regression functions are very similar, particularly for Age between 27 and 33 years. The
quadratic regression shows somewhat more curvature than the loglog regression, but the
difference is small. The regression functions for a female with a high school diploma will
look just like these, but they will be shifted by the amount of the coefficient on the binary
regressor Female. The regression functions for workers with a bachelorâ€™s degree will also
look just like these, but they would be shifted by the amount of the coefficient on the binary
variable Bachelor.
(i) This regression is shown in column (5). The coefficient on the interaction term Female ï‚´
Bachelor shows the â€œextra effectâ€ of Bachelor on ln(AHE) for women relative the effect for
men.
Predicted values of ln(AHE):
Alexis: 0.081 ï‚´ 30 âˆ’ 0.00091 ï‚´ 302 âˆ’ 0.22 ï‚´ 1 + 0.40 ï‚´ 1 + 0.069 ï‚´ 1 + 1.1 = 2.96
Jane: 0. 0.081 ï‚´ 30 âˆ’ 0.00091 ï‚´ 302 âˆ’ 0.22 ï‚´ 1 + 0.40 ï‚´ 0 + 0.069 ï‚´ 0 + 1.1 = 2.49
Bob: 0. 0.081 ï‚´ 30 âˆ’ 0.00091 ï‚´ 302 âˆ’ 0.22 ï‚´ 0 + 0.40 ï‚´ 1 + 0.069 ï‚´ 0 + 1.1 = 3.11
Jim: 0. 0.081 ï‚´ 30 âˆ’ 0.00091 ï‚´ 302 âˆ’ 0.22 ï‚´ 0 + 0.40 ï‚´ 0 + 0.069 ï‚´ 0 + 1.1 = 2.71
Difference in ln(AHE): Alexis âˆ’ Jane = 2.96 âˆ’ 2.49 = 0.469
Difference in ln(AHE): Bob âˆ’ Jim = 3.11 âˆ’ 2.71 = 0.40
Notice that the difference in the difference predicted effects is 0.469 âˆ’ 0.40 = 0.069, which is the
value of the coefficient on the interaction term.
(j) This regression is shown in (6), which includes two additional regressors: the interactions
of Female and the age variables, Age and Age2. The Fstatistic testing the restriction that the
coefficients on these interaction terms is equal to zero is F = 10.79 with a pvalue of 0.00.
This implies that there is statistically significant evidence (at the 1% level) that there is a
different effect of Age on ln(AHE) for men and women.
(k) This regression is shown in (7), which includes two additional regressors that are interactions of
Bachelor and the age variables, Age and Age2. The Fstatistic testing the restriction that the
coefficients on these interaction terms is zero is 10.77 with a pvalue of 0.00. This implies
that there is statistically significant evidence (at the 1% level) that there is a different effect
of Age on ln(AHE) for high school and college graduates.
(l) Regression (8) includes Age and Age2 and interactions terms involving Female and Bachelor.
The figure below shows the regressions predicted value of ln(AHE) for male and females
with high school and college degrees.
The estimated regressions suggest that earnings increase as workers age from 25â€“35, the
range of age studied in this sample. Gender and education are significant predictors of
earnings, and there are statistically significant interaction effects between age and gender and
age and education. The table below summarizes the regressions predictions for increases in
earnings as a person ages from 25 to 32 and 32 to 35.
Predicted Increase
in ln(AHE)
(Percent per year)
Predicted ln
(AHE) at Age
Gender, Education
Females, High School
25
2.44
32
2.50
35
2.50
25 to 32
0.85
32 to 35
0.07
Males, High School
2.55
2.76
2.80
3.06
1.03
Females, BA
Males, BA
2.82
2.87
3.01
3.21
3.09
3.32
2.62
4.83
2.85
3.82
Earnings for those with a college education are higher than those with a high school degree, and
earnings of the college educated increase more rapidly early in their careers (age 25â€“35).
Earnings for men are higher than those of women, and earnings of men increase more rapidly
early in their careers (age 25â€“35). For all categories of workers (men/women, high
school/college) earnings increase more rapidly from age 25â€“32 than from 32â€“35.
8.2. The regressions in the table are used in the answer to this question.
Dependent Variable = Course_Eval
Regressor
(1)
(2)
(3)
(4)
0.166**
(0.032)
0.011
(0.056)
0.160**
(0.030)
0.002
(0.056)
0.231**
(0.048)
0.090*
(0.040)
âˆ’0.001
(0.056)
âˆ’0.001
(0.056)
OneCredit
0.635**
(0.108)
0.620**
(0.109)
0.657**
(0.109)
0.657**
(0.109)
Female
âˆ’0.173**
(0.049)
âˆ’0.188*
*
(0.052)
âˆ’0.173*
*
(0.050)
âˆ’0.173*
*
(0.050)
Minority
âˆ’0.167*
(0.067)
âˆ’0.180*
*
(0.069)
âˆ’0.135
(0.070)
âˆ’0.135
(0.070)
NNEnglish
âˆ’0.244**
(0.094)
âˆ’0.243*
(0.096)
âˆ’0.268*
*
(0.093)
âˆ’0.268*
*
(0.093)
Beauty
Intro
Age
0.020
(0.023)
Age2
âˆ’0.0002
(0.0002)
Female ï‚´
Beauty
âˆ’0.141*
(0.063)
Male ï‚´ Beauty
Intercept
4.068**
(0.037)
3.677**
(0.550)
4.075**
(0.037)
0.141
(0.063)
4.075**
(0.037)
Fstatistic and pvalues on joint hypotheses
Age and Age2
SER
0.514
0.63
(0.53)
0.514
2
0.144
0.142
R
0.511
0.511
0.151
0.151
Significant at the *5% and **1% significance level.
(a) See Table
(b) The coefficient on Age2 is not statistically significant, so there is no evidence of a nonlinear
effect. The coefficient on Age is not statistically significant and the Fstatistic testing whether
the coefficients on Age and Age2 are zero does not reject the null hypothesis that the
coefficients are zero. Thus, Age does not seem to be an important determinant of course
evaluations.
(c) See the regression (3) which adds the interaction term Female ï‚´ Beauty to the base
specification in (1). The coefficient on the interaction term is statistically significant at the
5% level. The magnitude of the coefficient in investigated in parts (d) and (e).
(d) Recall that the standard deviation of Beauty is 0.79. Thus Professor Smithâ€™s course rating is
expected to increase by 0.231 ï‚´ (2 ï‚´ 0.79) = 0.37. The 95% confidence interval for the
increase is (0.231 ï‚± 1.96 ï‚´ 0.048) ï‚´ (2 ï‚´ 0.79) or 0.22 to 0.51.
(e) Professor Smithâ€™s course rating is expected to increase by (0.231 âˆ’ 0.173) ï‚´ (2 ï‚´ 0.79) = 0.09. To
construct the 95% confidence interval, we need the standard error for the sum of coefficients
ï¢ Beauty + ï¢ Femaleï‚´Beauty . How to get the standard error depends on the software that you are
using. An easy way is respecify the regression replacing Female ï‚´ Beauty with Male ï‚´
Beauty. The resulting regression is shown in (4) in the table. Now, the coefficient on Beauty
is the effect of Beauty for females and the standard error is given in the table. The 95%
confidence interval is (0.090 ï‚± 1.96 ï‚´ 0.040) ï‚´ (2 ï‚´ 0.79) or 0.02 to 0.27
Department of Economics
Columbia University
S3412
Summer 2022
SOLUTIONS to Problem Set 2
Introduction to Econometrics
Seyhan Erden
1. [graded] For many years, housing economists believed that households spend a constant
fraction of income on housing, as in
housing expenditure = ï¢ (income) + u
The file housing.dta contains housing expenditures (housing) and total expenditures
(total) for a sample of 19th century Belgian workers collected by Edouard Ducpetiaux 1.
The differences in housing expenditures from one observation to the next are in the
variables dhousing; the differences in total expenditures are in variable dtotal.
(a) Compute the means of total expenditure and housing expenditure in this sample
(b) Estimate ï¢ using total expenditure for total income.
(c) If income rises by 100 (it averages around 900 in this sample) what change in
estimated expected housing expenditure results according to your estimate in (b)?
(d) Interpret the R2
(e) What economic argument would you make against housing absorbing a constant
share of income?
(f) What are some determinants of housing captured by u?
Solution:
a)
. sum
housing total
Variable
Obs
Mean
Std. Dev.
Min
Max
housing
total
162
162
72.54259
902.8239
57.26064
411.6408
7.25
377.06
450.52
2822.54
b)
1
Edouard Ducpetiaux, Budgets Economiques de Classes de Ouvrieres en Belgique (Brussels, Hayaz 1855)
> 12Problem Set 2 Fall 2012housing.dta”, clear
. reg
housing total, noconstant
Source
SS
df
MS
Model
Residual
895121.769
485275.167
1
161
895121.769
3014.13147
Total
1380396.94
162
8520.96874
housing
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
total
.0749545
.0043495
17.23
0.000
.0663651
Number of obs =
F( 1,
161) =
Prob > F
=
Rsquared
=
Adj Rsquared =
Root MSE
=
162
296.98
0.0000
0.6485
0.6463
54.901
.0835439
c) Housing expenditure is expected to increase by 7.49
d) Since this regression does not contain a constant, we cannot necessarily interpret
the R2 in the usual way (i.e. 64.6% of the variations in housing expenditures can be
explained by the variations in income). To see this, run the regression including a
constant; the R2 is now 12.3%!
e) The relationship more likely to be nonlinear
f) Price, mortgage interest rates, location, etc. (answers will vary here)
2. [graded] Use Table 2 to answer the following questions. Table 2 presents the results of four
regressions, one in each column. Estimate the indicated regressions and fill in the values
(you may either handwrite or type the entries in; if you choose to type up the table, an
electronic copy of Table 2 in .doc format is available on the course Web site). For example,
to fill in column (1), estimate the regression with colGPA as the dependent variable and
hsGPA and skipped as the independent variables, using the â€œrobustâ€ option, and fill in the
estimated coefficients
(a) Fill out the table with necessary numbers, some will be on STATA output some you
will need to calculate yourself.
(b) Common sense predicts that your high school GPA (hsGPA) and the number of
classes you skipped (skipped) are determinants of your college GPA (colGPA). Use
regression (2) to test the hypothesis (at the 5% significance level) that the coefficients
on these two economic variables are all zero, against the alternative that at least one
coefficient is nonzero.
at least one coef. is nonzero
The pvalue for the Fstatistic =.002.62, thus we cannot reject
at the 5% significance level. We tend to
conclude that bgfriend and campus jointly have no explanatory power. Alternatively, we can use
the Stata command: di fprob(2, 135, 2.62) to find the associated pvalue = .077>.05, again we
cannot reject
at the 5% significance level
Table 1
Definitions of Variables in GPA4.dta (data is from Wooldridge textbook)
Variable
Definition
colGPA
Cumulative College Grade Point Average of a sample of 141 students
at Michigan State University in 1994.
hsGPA
High School GPA of students.
skipped
Average number of classes skipped per week.
PC
= 1 if the students owns a personal computer
= 0 otherwise.
bgfriend
= 1 if the student answered â€œyesâ€ to having a boy/girl friend
question
= 0 otherwise.
campus
= 1 if the student lives on campus.
= 0 otherwise.
College GPA Results
Regressor
hsGPA
Skipped
Table 2
Dependent variable: colGPA
(1)
(2)
(3)
.458
.455
.460
(.094)
(.092)
(.093)
.077
.065
.065
(4)
.461
(.090)
.071
(.025)
__
(.025)
.128
(.059)
__
(.025)
.130
(.059)
.084
(.055)
__
(.026)
.136
PC
(.058)
__
.085
bgfriend
(.054)
__
__
.124
campus
(.078)
1.579
1.526
1.469
1.490
Intercept
(.325)
(.321)
(.325)
(.317)
Fstatistics testing the hypothesis that the population coefficients on the indicated
regressors are all zero:
20.90
19.34
19.42
21.19
hsGPA, skipped
(.00)
(.00)
(.00)
(.00)
__
15.47
15.56
17.46
hsGPA, skipped, PC
(.00)
(.00)
(.00)
__
__
12.07
13.62
hsGPA, skipped, PC, bgfriend,
(.00)
(.00)
__
__
__
2.55
bgfriend, campus
(.082)
Regression summary statistics
.211
.234
.241
.252
R2
2
.223
.250
.263
.278
R
.331
.326
.324
.322
Regression RMSE
141
141
141
141
n
Notes: Heteroskedasticityrobust standard errors are given in parentheses under estimated
coefficients, and pvalues are given in parentheses under F statistics. The Fstatistics are
heteroskedasticityrobust.
Following questions will not be graded, they are for you to practice and will be discussed at
the recitation by your teaching assistant:
1. SW Empirical Exercise 6.1
6.1. Regressions used in (a) and (b)
Model
Regressor
a
b
Beauty
0.133
0.166
Intro
0.011
OneCredit
0.634
Female
âˆ’0.173
Minority
âˆ’0.167
NNEnglish
âˆ’0.244
Intercept
4.00
4.07
SER
0.545
0.513
2
0.036
0.155
R
(a) The estimated slope is 0.133
(b) The estimated slope is 0.166. The coefficient does not change by a large amount. Thus, there
does not appear to be large omitted variable bias.
(c) The first step and second step are summarized in the table
Regressor
Dependent Variable
Beauty
Course_eval
Intro
0.12
0.03
OneCredit
â€“0.37
0.57
Female
0.19
â€“0.14
Minority
0.08
â€“0.15
NNEnglish
0.02
â€“0.24
Intercept
â€“0.11
4.05
Regressing the residual from step 2 onto the residual from step 1 yield a coefficient on
Beauty that is equal to 0.166 (as in (b)).
(d) Professor Smithâ€™s predicted course evaluation = (0.166 ï‚´ 0) + (0.011 ï‚´ 0) + (0.634 ï‚´ 0) âˆ’
(0.173 ï‚´ 0) âˆ’ (0.167 ï‚´ 1) âˆ’ (0.244 ï‚´ 0) + 4.068 = 3.901
2. SW Empirical Exercises 7.1
Model
Regressor
Age
a
0.60
(0.04)
Intercept
1.08
(1.17)
b
0.59
(0.04)
âˆ’3.66
(0.21)
8.08
(0.21)
â€“0.63
(1.08)
SER
R2
R2
9.99
0.029
0.029
9.07
0.200
0.199
Female
Bachelor
(a) The estimated slope is 0.60. The estimated intercept is 1.08.
(b) The estimated marginal effect of Age on AHE is 0.59 dollars per year. The 95%
confidence interval is 0.59 ï‚± 1.96 ï‚´ 0.04 or 0.51 to 0.66.
(c) The results are quite similar. Evidently the regression in (a) does not suffer from
important omitted variable bias.
(d) Bobâ€™s predicted average hourly earnings = (0.59 ï‚´ 26) + (âˆ’ 3.66 ï‚´ 0) + (8.08 ï‚´ 0)
âˆ’ ï€°ï€®ï€¶ï€³ = $14.17. Alexisâ€™s predicted average hourly earnings = (0.59 ï‚´ 30) + (âˆ’ 3.66 ï‚´ 1)
+ (8.08 ï‚´ 1) âˆ’ ï€°ï€®ï€¶ï€³ = $21.49.
(e) The regression in (b) fits the data much better. Gender and education are important
predictors of earnings. The R2 and R 2 are similar because the sample size is large (n =
7711).
(f) Gender and education are important. The Fstatistic is 781, which is (much) larger than
the 1% critical value of 4.61.
(g) The omitted variables must have nonzero coefficients and must correlated with the
included regressor. From (f) Female and Bachelor have nonzero coefficients; yet there
does not seem to be important omitted variable bias, suggesting that the correlation of
Age and Female and Age and Bachelor is small. (The sample correlations are Cor (Age,
Female) = âˆ’0.03 and Cor (Age,Bachelor) = 0.00).
3. SW Exercises 7.1
Estimated Regressions
Model
Regressor
Age
a
b
0.60
(0.04)
0.59
(0.04)
Female
âˆ’3.66
(0.21)
Bachelor
8.08
(0.21)
Intercept
1.08
(1.17)
â€“0.63
(1.08)
SER
9.99
9.07
2
0.029
0.200
0.029
0.199
R
R
2
(a) The estimated slope is 0.60. The estimated intercept is 1.08.
(b) The estimated marginal effect of Age on AHE is 0.59 dollars per year. The 95% confidence
interval is 0.59 ï‚± 1.96 ï‚´ 0.04 or 0.51 to 0.66.
(c) The results are quite similar. Evidently the regression in (a) does not suffer from important
omitted variable bias.
(d) Bobâ€™s predicted average hourly earnings = (0.59 ï‚´ 26) + (âˆ’ 3.66 ï‚´ 0) + (8.08 ï‚´ 0) âˆ’ ï€°ï€®ï€¶ï€³ =
$14.17. Alexisâ€™s predicted average hourly earnings = (0.59 ï‚´ 30) + (âˆ’ 3.66 ï‚´ 1) + (8.08 ï‚´ 1) â€“
ï€°ï€®ï€¶ï€³ = $21.49.
(e) The regression in (b) fits the data much better. Gender and education are important predictors
of earnings. The R2 and R 2 are similar because the sample size is large (n = 7711).
(f) Gender and education are important. The Fstatistic is 781, which is (much) larger than the
1% critical value of 4.61.
(g) The omitted variables must have nonzero coefficients and must correlated with the included
regressor. From (f) Female and Bachelor have nonzero coefficients; yet there does not seem
to be important omitted variable bias, suggesting that the correlation of Age and Female and
Age and Bachelor is small. (The sample correlations are Cor (Age, Female) = âˆ’0.03 and
Cor (Age,Bachelor) = 0.00).
Department of Economics
Columbia University
S3412
Summer 2022
SOLUTIONS to Problem Set 1
Introduction to Econometrics
Seyhan Erden
â€œCalculatorâ€ was once a job description. This problem set gives you an opportunity to do some
calculations on the relation between smoking and lung cancer, using a (very) small sample of
five countries. The purpose of this exercise is to illustrate the mechanics of ordinary least
squares (OLS) regression. You will calculate the regression â€œby handâ€ using formulas from
class and the textbook. For these calculations, you may relive history and use long
multiplication, long division, and tables of square roots and logarithms; or you may use an
electronic calculator or a spreadsheet.
The data are summarized in the following table. The variables are per capita cigarette
consumption in 1930 (the independent variable, â€œXâ€) and the death rate from lung cancer in 1950
(the dependent variable, â€œYâ€). The cancer rates are shown for a later time period because it takes
time for lung cancer to develop and be diagnosed.
Observation #
Country
1
2
3
4
5
Switzerland
Finland
Great Britain
Canada
Denmark
Cigarettes consumed
per capita in 1930 (X)
530
1115
1145
510
380
Lung cancer deaths per
million people in 1950 (Y)
250
350
465
150
165
Source: Edward R. Tufte, Data Analysis for Politics and Management, Table 3.3.
1. Use a calculator, a spreadsheet, or â€œby handâ€ methods to compute the following; refer to the
textbook for the necessary formulas. (Note: if you use a spreadsheet, attach a printout)
a) The sample means of X and Y, X and Y .
X = 736, Y = 276
b) The standard deviations of X and Y, sX and sY.
sX = 364.41, sY = 132.35
c) The correlation coefficient, r, between X and Y
r = 0.92
d) ï¢Ë†1 , the OLS estimated slope coefficient from the regression Yi = ï¢0 + ï¢1Xi + ui
ï¢Ë† = 0.336418
1
e) ï¢Ë†0 , the OLS estimated intercept term from the same regression
ï¢Ë† = 28.39656
0
f) YË†i , i = 1,â€¦, n, the predicted values for each country from the regression
1
Switzerland
France
GreatBritain
Canada
Denmark
206.6981
403.5026
413.5952
199.9697
156.2354
g) uË†i , the OLS residual for each country.
Switzerland
43.3019
France
53.5026
GreatBritain 51.40483
Canada
49.9697
Denmark
8.7646
2. On graph paper or using a spreadsheet, graph the scatterplot of the five data points and the
regression line. Be sure to label the axes, the data points, the residuals, and the slope and
intercept of the regression line.
deaths
Fitted values
Death rate in 1950
600
400
200
0
0
500
1000
Cigarettes per capita in 1930
1500
Cigarette consumption,death rates, and OLS regression line
2
3. This time, please calculate the same statistics using STATA. On the STATA output file, find
and label the items.
a) The sample means of X and Y, X and Y .
b) The standard deviations of X and Y, sX and sY.
c) The correlation coefficient, r, between X and Y
d) ï¢Ë†1 , the OLS estimated slope coefficient from the regression Yi = ï¢0 + ï¢1Xi + ui
e) ï¢Ë† , the OLS estimated intercept term from the same regression
0
f) YË†i , i = 1,â€¦, n, the predicted values for each country from the regression
g) uË†i , the OLS residual for each country.
STATA HINTS: First load STATA and type â€œedit,â€ which brings up something that looks
like a spreadsheet. Enter the smoking and cancer values in the first two columns. Doubleclick the column headers to enter variable names (e.g. â€œsmokeâ€, â€œdeathâ€). Close the editor
window when you are done. The following commands will be useful:
list
lists the data (to be sure you typed it in correctly)
summarize
computes sample means and standard deviations (the option
â€œ,detailâ€ gives additional statistics, including the sample
variance)
correlate
produces correlation coefficients (with the option â€œ, covarianceâ€
this command produces covariances)
regress
estimates regression by OLS
predict
compute OLS predicted values and residuals
Note that STATA has online help.
Do not be concerned if you do not yet understand all the statistics shown in the output â€“ we
will discuss them in class in due course.
Answers:
a) Listing of the data:
+————————+
 country
cigs
deaths 
————————
1. 
Switz
530
250 
2.  Finland
1115
350 
3.  Britain
1145
465 
4.  Canada
510
150 
5.  Denmark
380
165 
+————————+
3
b) Mean and standard deviation:
. summarize cigs deaths;
Variable 
Obs
Mean
Std. Dev.
Min
Max
————+——————————————————cigs 
5
736
364.4071
380
1145
deaths 
5
276
132.3537
150
465
c) Correlation coefficient:
. * —– compute correlation —–;
. correlate cigs deaths;
(obs=5)

cigs
deaths
————+—————–cigs 
1.0000
deaths 
0.9263
1.0000
d) OLS Regression:
. regress deaths cigs;
Source 
SS
df
MS
————+—————————–Model  60116.1644
1 60116.1644
Residual  9953.83564
3 3317.94521
————+—————————–Total 
70070
4
17517.5
Number of obs =
F( 1,
3) =
Prob > F
=
Rsquared
=
Adj Rsquared =
Root MSE
=
5
18.12
0.0238
0.8579
0.8106
57.602
—————————————————————————–deaths 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
————+—————————————————————cigs 
.3364177
.0790347
4.26
0.024
.084894
.5879414
_cons 
28.39656
63.61827
0.45
0.686
174.0652
230.8583
——————————————————————————
ï¢Ë†0 = 28.39656
ï¢Ë†1 = .3364177
e) Predicted values and residuals
. predict dhat;
(option xb assumed; fitted values)
. generate uhat = deaths – dhat;
4
. list deaths dhat uhat;
+——————————+
 deaths
dhat
uhat 
——————————
1. 
250
206.698
43.30205 
2. 
350
403.5023
53.50232 
3. 
465
413.5948
51.40515 
4. 
150
199.9696
49.96959 
5. 
165
156.2353
8.764709 
+——————————+
In this table, the predicted values are dhat and the residuals are uhat.
4. [graded] Using â€œgraph twowayâ€ command in STATA, graph the scatterplot of the five data
points and the regression line. Interpret sample slope and sample intercept.
Answers:
. graph twoway (scatter deaths cigs) (lfit deaths cigs)
deaths
Fitted values
600
Death rate in 1950
Great Britain
Predicted value for Finland
400
residual for Finland
Switzerland
Finland
200
Denmark
Canada
0
0
500
1000
Cigarettes per capita in 1930
1500
Cigarette consumption,death rates, and OLS regression line
5
The estimated intercept, ï¢Ë†0 = 28.4, is the value at which the regression line intercepts
the vertical axis. The slope of the regression line is 0.336, so an increase of one
cigarette per capita is associated with an increase in the death rate of 0.336 lung
cancer deaths per million.
STATA .do file
clear all;
*************************************************************;
* ps1.do
;
* STATA calculations for S3412, problem set #1;
*************************************************************;
log using ps1.log,replace;
set more 1;
*************************************************************;
* read in data;
input str8 country cigs deaths;
“Switz” 530 250;
“Finland” 1115 350;
“Britain” 1145 465;
“Canada” 510 150;
“Denmark” 380 165;
end;
*;
list;
* — compute mean and variance —–;
summarize cigs deaths;
* —– compute correlation —–;
correlate cigs deaths;
* —– regression of death rate on cigarettes per capita —–;
regress deaths cigs;
* —– compute predicted values and residuals —–;
predict dhat;
generate uhat = deaths – dhat;
list deaths dhat uhat;
* —— scatterplot and regression line —–;
Graph twoway (scatter deaths cigs) (lfit deaths cigs)
log close;
clear;
exit;
5. [graded] Using the data file birthweight_smoking, which contains data for a random sample
of babies born in Pennsylvania in 1989, answer the following questions. The data include the
babyâ€™s birth weight together with various characteristics of the mother, including whether she
smoked during the pregnancy. Let ð‘Œð‘– denote the babyâ€™s birth weight (in grams) for mother ð‘–
and ð‘‹ð‘– an indicator variable that equals one if the mother smoked during pregnancy and zero,
otherwise. Consider the linear regression model
ð‘Œð‘– = ð›½0 + ð›½1 ð‘‹ð‘– + ð‘¢ð‘– , ð‘– = 1, â€¦ , ð‘›.
6
(a) Run a regression of ð‘Œð‘– on ð‘‹ð‘– . Report your estimation result in the following form.
Ì‚
ðµð‘–ð‘Ÿð‘¡â„Žð‘Šð‘’ð‘–ð‘”â„Žð‘¡
= ? ? ? + ? ? ? ð‘†ð‘šð‘œð‘˜ð‘’ð‘Ÿ,
(? ? ? ) (? ? ? )
where the numbers in the parentheses are standard errors.
Answer:
Ì‚
ðµð‘–ð‘Ÿð‘¡â„Žð‘Šð‘’ð‘–ð‘”â„Žð‘¡
= 3432 âˆ’ 253ð‘†ð‘šð‘œð‘˜ð‘’ð‘Ÿ,
(12)
(27)
(b) In view of your estimation result in part (a), what is the predicted value of the birth weight
for mothers who do not smoke? What is predicted value of the birth weight for those who
smoke?
Answer: 3432 grams for mothers who do not smoke; 3179 grams for those who smoke
(c) Compute the sample correlation coefficient between the birth weight and education.
Interpret your estimation result.
Answer: 0.105; education is positively correlated with the birth weight.
(d) In view of part (c), what does the term ð‘¢ð‘– represent here? Why do different mothers have
different values of ð‘¢ð‘– ?
Answer: Education may be a part of ð‘¢ð‘– . Mothers have different levels of schooling, which
imply different values of ð‘¢ð‘– .
(e) In view of part (c), do you think that ð¸[ð‘¢ð‘– ð‘†ð‘šð‘œð‘˜ð‘’ð‘Ÿð‘– ] = 0?
Answer: It is unlikely that ð¸[ð‘¢ð‘– ð‘†ð‘šð‘œð‘˜ð‘’ð‘Ÿð‘– ] = 0 if more educated mothers smoke less.
(f)
The regression error term ð‘¢ð‘– is homoskedastic if the conditional variance of ð‘¢ð‘– given ð‘‹ð‘– = ð‘¥
does not depend on ð‘¥. When you computed your standard error in part (a), did you report
homoskedasticityonly standard errors or heteroskedasticityrobust standard errors? Justify
your choice briefly.
Answer: heteroskedasticityrobust standard errors are used since it is more robust and there
is no clear reason that the homoskedasticity assumption is satisfied in this example.
(g) Using your preferred standard errors, report the 95% confidence interval for ð›½1 . Using your
confidence interval, carry out the hypothesis test for the null hypothesis ð»0 that smoking is
associated with a decrease of 300 grams in the birth weight.
Answer: [âˆ’306, âˆ’201], so we fail to reject ð»0 .
7
(h) What is the value of ð‘…2 ? A friend of yours claim that a very low ð‘…2 means that an estimated
coefficient of ð›½1 is insignificant. Would you agree? Explain briefly.
Answer: 0.0286. I would not agree since it is possible to have a significant coefficient when
ð‘…2 is very low. Note that the tvalue is âˆ’9.45 in this example.
Following questions will not be graded, they are for you to practice and will be discussed at
the recitation by your teaching assistant:
1. [Practice question, not graded] SW Exercise 4.1
(a) The predicted average test score is
ð‘‡ð‘’ð‘ ð‘¡Ì‚
ð‘†ð‘ð‘œð‘Ÿð‘’ = 520.4 âˆ’ 5.82×22 = 392.36
(b) The predicted decrease in the classroom average test score is
Ì‚
âˆ†ð‘‡ð‘’ð‘ ð‘¡ð‘†ð‘ð‘œð‘Ÿð‘’
= (âˆ’5.82×19) âˆ’ (âˆ’5.82×23) = 23.28
or the predicted change is
Ì‚
âˆ†ð‘‡ð‘’ð‘ ð‘¡ð‘†ð‘ð‘œð‘Ÿð‘’
= (âˆ’5.82×23) âˆ’ (âˆ’5.82×19) = âˆ’23.28
(c) Using the formula for ð›½Ì‚0, we know the sample average of the test scores across the 100
classroom is
Ì…Ì…Ì…Ì…Ì…Ì…Ì…Ì…Ì…Ì…Ì…Ì…Ì…Ì… = ð›½Ì‚0 + ð›½Ì‚1 xð¶ð‘†
Ì…Ì…Ì…Ì… = 520.4 âˆ’ 5.82×21.4 = 395.85
ð‘‡ð‘’ð‘ ð‘¡ð‘†ð‘ð‘œð‘Ÿð‘’
(d) Use the formula for the standard error of the regression (SER) to get the sum of squared
residuals:
ð‘†ð‘†ð‘… = (ð‘› âˆ’ 2)ð‘†ð¸ð‘… 2 = (100 âˆ’ 2)x11.52 = 12961
Use the formula for ð‘… 2 to get the total sum of squares:
ð‘‡ð‘†ð‘† =
ð‘†ð‘†ð‘…
12961
=
= 14088
1 âˆ’ ð‘…2 1 âˆ’ 0.08
ð‘‡ð‘†ð‘†
The sample variance is ð‘ ð‘Œ2 = ð‘›âˆ’1 =
14088
99
= 142.3. Thus, the standard deviation is ð‘ ð‘Œ =
âˆšð‘ ð‘Œ2 = 11.9
8
2. [Practice question, not graded] Let ð¾ð¼ð·ð‘† denote the number of children born to a woman,
and let ð¸ð·ð‘ˆð¶ denote years of education for the woman. A simple model relating fertility to
years of education is
ð¾ð¼ð·ð‘† = ð‘Ž + ð‘ âˆ— ð¸ð·ð‘ˆð¶ + ð‘¢,
where u is the unobserved residual.
(a) What kinds of factors are contained in u? Are these likely to be correlated with level of
education?
Income, age, and family background (such as number of siblings) are just a few
possibilities. It seems that each of these could be correlated with years of
education.(Income and education are probably positively correlated; age and
education may be negatively correlated because women in more recent cohorts have,
on average, more education; and number of siblings and education are probably
negatively correlated.)
(b) Will simple regression of kids on EDUC uncover the ceteris paribus (â€˜all else equalâ€™) effect
of education on fertility? Explain.
Not if the factors we listed in part (i) are correlated with EDUC. Because we would
like to hold these factors fixed, they are part of the error term. But if u is correlated
with EDUC, then E(uEDUC) is not zero, and thus OLS Assumption (A2) fails.
3. [Practice question, not graded] SW Exercises 5.1
(a) The 95% confidence interval for ï¢1 is {âˆ’5ï€®82 ï‚± 1ï€®96 ï‚´ 2ï€®21}, that is
âˆ’10ï€®152 ï‚£ ï¢1 ï‚£ âˆ’1ï€®4884.
(b) Calculate the tstatistic:
t act =
ï¢Ë† 1 âˆ’ 0 âˆ’5ï€®82
=
= âˆ’2ï€®6335ï€®
SE( ï¢Ë† 1)
2ï€®21
The pvalue for the test H 0 ï€º ï¢1 = 0 vs. H1 ï€º ï¢1 ï‚¹ 0 is
pvalue = 2ï†(âˆ’t act ) = 2ï† (âˆ’2ï€®6335) = 2 ï‚´ 0ï€®0042 = 0ï€®0084ï€®
The pvalue is less than 0.01, so we can reject the null hypothesis at the 5% significance
level, and also at the 1% significance level.
9
(c) The tstatistic is
ð‘¡ ð‘Žð‘ð‘¡ =
ð›½Ì‚1 âˆ’ (âˆ’5.6)
= âˆ’0.10
ð‘†ð¸(ð›½Ì‚1 )
The pvalue for the test ð»0 : ð›½1 = âˆ’5.6 vs. ð»1 : ð›½1 â‰ âˆ’5.6 is
pvalue = 2ï† (âˆ’t act ) = 2ï† (âˆ’0.10) = 0.92
The pvalue is larger than 0.10, so we cannot reject the null hypothesis at the 10%, 5%
or 1% significance level. Because ï¢1 = âˆ’5.6 is not rejected at the 5% level, this value is
contained in the 95% confidence interval.
(d) The 99% confidence interval for ï¢0 is {520.4 ï‚± 2.58 ï‚´ 20.4}, that is, 467.7 ï‚£ ï¢0 ï‚£ 573.0.
4. [Practice question, not graded] Any of the Empirical Exercises at the end of Chapter 4 and
Chapter 5. (Teaching assistant will go over one of the empirical exercises in the recitation as
a practice to use Stata)
10
Purchase answer to see full
attachment