+1(978)310-4246 credencewriters@gmail.com
Select Page

I’m working on a data analytics discussion question and need an explanation and answer to help me learn.

What is an instrumental variable in linear regression? How do instrumental variables improve causal inference? Please answer both of these questions and give an example of an instrumental variable.

Chapter 8
Advanced Methods for Establishing Causal Inference
Â© 2019 McGraw-Hill Education. All rights reserved. Authorized only for instructor use in the classroom. No reproduction or distribution without the prior written consent of McGraw-Hill Education
Learning Objectives
1. Explain how instrumental variables can improve causal
inference in regression analysis
2. Execute two-state least square regression
3. Judge which type of variables may be used as instrumental
variables
4. Identify a difference-in-difference regression
5. Execute regression incorporating fixed effects
6. Distinguish the dummy variable approach from a within
estimator for a fixed effect regression model
2
Instrumental Variables
Instrumental variables
â€¢ In the context of regression analysis, a variable that allows us
to isolate the causal effect of a treatment on an outcome due
to its correlation with the treatment and the lack of correlation
with the outcome
â€¢ Can improve causal inference in regression analysis
3
Instrumental Variables: An Example
â€¢ A firm attempting to determine how its sales depend on price
it charges for its product
â€¢ Beginning with a simple data-generating process:
Salesi = Î± + Î²1Pricei + Ui
â€¢ If local demand factor depends on local income, then local
income is a confounding factor:
Salesi = Î± + Î²1Pricei + Î²2Incomei + Ui
4
Instrumental Variables: An Example
â€¢ Including income in the model removes local income as
confounding factor
â€¢ Does its inclusion ensure that no other confounding factors still
exist?
â€¢ Many possibilities may come to mind, including local
competition, market size, and market growth rate
5
Instrumental Variables
â€¢ We may be unable to collect data on all confounding factors or
find suitable proxies
â€¢ Then we are unable to remove the endogeneity problem by
including controls and/or proxy variables
â€¢ A widely used method for measuring causality that can
circumvent this problem involves instrumental variables
6
Instrumental Variables
â€¢ Suppose we know price differences across some of the stores
were solely due to differences in fuel costs
â€¢ When two locations have different prices, we generally cannot
attribute differences in sales to price differences, since these
two locations likely differ in local competition
â€¢ Rather than use all of the variation in price across the stores to
measure the effect of price on sales, we focus on the subset of
price movements due to variation in fuel costs
7
Instrumental Variables: An Example
WHEN TWO LOCATIONS HAVE
DIFFERENT PRICES ONLY BECAUSE
THEIR FUEL COSTS DIFFER, ANY
DIFFERENCE IN SALES CAN BE
ATTRIBUTED TO PRICE, SINCE FUEL
COSTS DONâ€™T IMPACT SALES PER SE
8
Instrumental Variables
â€¢ Suppose we have the following data-generating function:
Yi = Î± + Î²1X1i + Î²2X2i + â€¦ + Î²KXKi + Ui
â€¢ Variable Z is a valid instrument for Xi if Z is both exogenous and
relevant, if:
1. Exogenous: It has no effect on the outcome variable beyond the
combined effects of all variables in the determining function
(X1â€¦XK)
2. Relevant: For the assumed data-generating process, Z is relevant
as an instrumental variable if it is correlated with X1 after
controlling for X2â€¦.XK
9
Two-Stage Least Square Regression
â€¢ Two-stage least squares regression (2SLS) is the process
of using two regressions to measure the causal effect of a
variable while utilizing an instrumental variable
â€¢ The first stage of 2SLS determines the subset of variation
in Price that can attributed to changes in fuel costs; we
à·£
can call the variable that tracks this variation ð‘ƒð‘Ÿð‘–ð‘ð‘’
â€¢ The second stage determines how Sales change with the
à·£
movements of ð‘ƒð‘Ÿð‘–ð‘ð‘’
à·£ ,
â€¢ This means that if we see Sales correlate with ð‘ƒð‘Ÿð‘–ð‘ð‘’
there is reason to interpret this co-movement as the
causal effect of Price
10
Two-Stage Least Square Regression
â€¢ For an assumed data-generating process:
Yi = Î± + Î²1X1i + Î²2X2i + â€¦ + Î²KXKi + Ui
â€¢ Suppose X1 is endogenous and Z is a valid instrument for X1.
We execute 2SLS, in the first stage we assume:
X1i = Î³ + Î´1Zi + Î´2X2i + â€¦ + Î´KXKi + Vi
â€¢ Then regress X1 on Z, X2â€¦,XK and calculate predicted values for
X1, defined as:
à·  ð›¾+
ð‘‹=
à·œ ð›¿áˆ˜ 1Z + ð›¿áˆ˜ 2X2 + â€¦ + ð›¿áˆ˜ KXK
11
Two-Stage Least Square Regression
à·¢1, X2, â€¦, XK
â€¢ In the second stage, regress Y on ð‘‹
â€¢ From the second stage regression, the estimated coefficient for
à·¢1 is a consistent estimate for Î²1 (the causal effect of X1 on Y)
ð‘‹
and the estimated coefficient on X2 is a consistent estimate for
Î²2
â€¢ Run two consecutive regressions using the predictions from
the first as an independent variable in the second
â€¢ Statistical software combines this process into a single command
12
2SLS Estimates for Y Regressed on
X1, X2, and X3
13
Two-Stage Least Square Regression
â€¢ Summary of 2SLS where we have J endogenous variables and L
â‰¥ J instrumental variables
Yi = Î± + Î²1X1i + Î²2X2i + â€¦ + Î²KXKi + Ui
Suppose X1, â€¦, XJ are endogenous and Z1, â€¦, ZL are valid
instruments for X1, â€¦, XJ
â€¢ Execution of 2SLS proceeds as follows:
14
Two-Stage Least Square Regression
1. Regress X1, â€¦, XJ on Z1, â€¦, ZK , XJ+1 , â€¦ XK in J separate
regressions
à·¢1, â€¦, ð‘‹à·¡ð½ using the corresponding
2. Obtain predicted values ð‘‹
estimated regression equations in Step 1. This concludes
â€œStage 1â€
à·¢1, â€¦, ð‘‹à·¡ð½, XJ+1 , â€¦ XK , which yields consistent
3. Regress Y on ð‘‹
estimates for Î±, Î²1, â€¦, Î²K. This is â€œStage 2â€
15
Evaluating Instruments
â€¢ An instrumental variable must be exogenous and relevant, and
if so, we can use 2SLS to get consistent estimates for the
parameters of the determining function
â€¢ Can we assess whether the instrumental variable possesses
these two characteristics?
16
Exogeneity
â€¢ An instrumental variable is exogenous if it is uncorrelated with
unobservables affecting the dependent variable
â€¢ For a data-generating process Yi = Î± + Î²1X1i + â€¦ + Î²KXKi + Ui , an
instrumental variable Z must have Corr(Z, U) = 0
â€¢ To prove this, regress Y on X1,â€¦..XK, and calculate the residuals
à·¢1X1i â€’ â€¦ â€’ ð›½
à·¢ð¾XKi
as: ei = Yi â€“ à·ð›¼ â€’ ð›½
â€¢ We could then calculate the sample correlation between Z and
the residuals, believing this to be an estimate for the
correlation between Z and U
17
Exogeneity
â€¢ The problem is that the residuals were calculated using a regression
with an endogenous variable
â€¢ Our parameter estimates are not consistent, meaning the sample
correlation between Z and the residuals generally is not an estimator
for the correlation between Z and U
â€¢ If the number of instrumental variables is equal to the number of
endogenous variables, there is no way to test for exogeneity
â€¢ If the number of instrumental variables is greater than the number
of endogenous variables, there are tests that can be performed to
find evidence that at least some instrumental variables are not
exogenous, but there is no way to test that all are exogenous
18
Relevance
â€¢ Testing for relevance is simple and can be added when
conducting 2SLS
â€¢ For a data-generating process: Yi = Î± + Î²1X1i + â€¦ + Î²KXKi + Ui
where X1 is endogenous, Z is relevant if it is correlated with X1
after controlling for X1, â€¦, XK
â€¢ We can assess whether this is true by regressing X1 on Z,
X2â€¦,XK
19
Regression Output for Price Regressed on
Income and Fuel Costs
20
Relevance
â€¢ It is important to establish convincing evidence that an
instrumental variable(s) is relevant
â€¢ Doing so avoids common criticism of instrumental variables
centered on the usage of weak instruments
â€¢ A weak instrument is an instrumental variable that has little
partial correlation with the endogenous variable whose causal
effect on an outcome it is meant to measure
21
Regression Results for X1 Regressed on X2,
X3,Z1, and Z2
22
à·¢ðŸ, X2,
Regression Results for Y Regressed on ð‘¿
and X3
23
Classical Applications of Instrumental
â€¢ Cost variables are popular choices as instrumental variables,
particularly in demand estimations
â€¢ Any variable that affects the costs of producing the good or
service (input prices, cost per unit, etc.) can be to be a valid
instrument for Price
â€¢ Prices charged typically depend on costs
â€¢ Cost variables are often both relevant and exogenous when
used to instrument for Price in a demand equation
24
Classical Applications of Instrumental
â€¢ Policy change is another popular choice as an instrumental
variable
â€¢ Local sales tax and/or price regulations can serve as
instrumental variables for Price in a demand equation
â€¢ Labor laws can serve as instrumental variables for wages when
seeking to measure the effect of wages on productivity
â€¢ Policy changes often affect business decisions (making them
relevant) but often occur for reasons not related to business
outcomes (exogenous)
25
Panel Data Method
â€¢ With panel data we are able to observe the same crosssectional unit multiple times at different points in time
â€¢ Difference-in- difference regression
â€¢ Fixed-effects model
â€¢ Dummy variable estimation
â€¢ Within estimation
26
Difference-in-Differences
â€¢ Consider an individual who owns a large number of liquor
stores in the states of Indiana and Michigan
â€¢ Suppose Indiana state government decides to increase the
sales tax on liquor sales by 3%
â€¢ The owner may want to know the effect of this tax increase on
her profit
27
Difference-in-Differences
â€¢ To learn the effect of tax increase on the profit, the store
owner collects data for two years as shown below:
28
Difference-in-Differences
â€¢ To assess the effect of a tax hike on profit, the store owner may
assume the following data-generating process:
Profitsit = Î± + Î²TaxHikeit + Uit
â€¢ Profitsit is the profit of store i during Year t, and TaxHikeit equals 1 if
the 3% tax hike was in place for store i during Year t and 0
otherwise
â€¢ We could regress Profits on TaxHike, but difficult to argue that
TaxHike is not endogenous
â€¢ TaxHike equals 1 for a specific group of stores at a specific time; this
method of administering the treatment may be correlated with
unobserved factors affecting Profits
29
Difference-in-Differences
â€¢ Control for a cross-sectional group (g = Indiana, Michigan) and
for time (t = 2016, 2017)
â€¢ Assume the following model:
Profitsigt = Î± + Î²1Indianag + Î²2Yeart + Î²3TaxHikegt Uigt
â€¢ The data-generating process can also be written as:
Profitsigt = Î± + Î²1Indianag + Î²2Yeart + Î²3Indianag Ã— Yeart + Uigt
30
Difference-in-Differences
â€¢ Î²3 is the diff-in-diff for profits in this example
â€¢ Difference in profits between 2017 and 2016 for Indiana:
Î± + Î²1 + Î²2 + Î²3 + Uigt â€’ (Î± + Î²1 + Uigt)= Î²2 + Î²3
â€¢ Difference in profits between 2017 and 2016 for Michigan:
Î± + Î²2 + Uigt â€’ (Î± + Uigt)= Î²2
â€¢ Take the difference between the change in profits in Indiana
and Michigan to get the diff-in-diff:
Î²2 + Î²3 â€’ Î²2 = Î²3
31
Difference-in-Differences for Liquor Profits in
Indiana and Michigan
32
Difference-in-Differences
â€¢ Difference-indifferences (diff-in-diff) is the difference in the
temporal change for the outcome between the treated and
untreated group
â€¢ Diff-in-diff highly effective and applies for dichotomous
treatments spanning two periods
33
The Fixed-Effects Model
â€¢ Fixed effects model is a data-generating process for panel data
that includes controls for cross-sectional groups
â€¢ The controls for cross-sectional groups are call fixed effects
â€¢ For a data-generating process to be characterized as a fixed
effects model, it need have only controls for the cross-sectional
groups
â€¢ Can control for time periods by including time trends
â€¢ Outcomeigt = Î±+ Î´2Group2g + â€¦ + Î´GGroupGg + Î³Timet +
Î²Treatmentgt+ Uigt
34
The Fixed-Effects Model
â€¢ By controlling for the groups and periods, many possible
confounding factors in the data-generating process are
eliminated
â€¢ Can add controls (Xigtâ€™s) beyond the fixed effects and time
dummies to help eliminate some of the remaining confounding
factors
â€¢ Two ways of estimating the fixed-effects model include:
dummy variable estimation and within estimation
35
The Fixed-Effects Model: Dummy Variable
Estimation
â€¢ Dummy variable estimation uses regression analysis to
estimate all of the parameters in the fixed effects datagenerating process
â€¢ Regress the Outcome on dummy variables for each crosssectional group (except the base unit), dummy variables for
each period (except the base period), and the treatment
36
Subset of Dummy Variable Estimation Results
for Sales Regressed on Tax Rate
37
The Fixed-Effects Model: Dummy Variable
Estimation
â€¢ Interpreting the table from the previous slide:
â€¢ Each state coefficient measures the effect on a storeâ€™s profits of
moving the store from the base state (State 1) to that
alternative state, for a given year and tax rate
â€¢ Each year coefficient measures the effect on a storeâ€™s profits of
moving the store from the base year (Year 1) to that alternative
year, for a given state and tax rate
â€¢ The coefficient on Tax Rate measures the effect on a storeâ€™s
profits of changing the Tax Rate, for a given state and year
38
The Fixed-Effects Model: Within Estimation
â€¢ Within estimation uses regression analysis of within-group
differences in variables to estimate the parameters in the fixed
effects data-generating process, except for those
corresponding to the fixed effects (and the constant)
â€¢ Eliminates the need to estimate the coefficient for each fixed
effect
39
The Fixed-Effects Model: Within Estimation
Outcomeigt = Î±+ Î´2Group2g +â€¦+ Î´GGroupGg + Î³Timet + Treatmentgt+ Uigt
â€¢ We estimate the parameters Î³2, â€¦, Î³T, Î² via within estimation:
1. Determine the cross-sectional groups and calculate group-level
1
ð‘ð‘” ð‘‡
Ïƒð‘–=ð‘– Ïƒð‘¡=ð‘– ð‘‚ð‘¢ð‘¡ð‘ð‘œð‘šð‘’ð‘–ð‘”ð‘¡ and ð‘‡ð‘Ÿð‘’ð‘Žð‘¡ð‘šð‘’ð‘›ð‘¡ =
means:
=
ð‘ ð‘‡
ð‘”
1
ð‘” Ïƒð‘‡
Ïƒð‘
ð‘‡ð‘Ÿð‘’ð‘Žð‘¡ð‘šð‘’ð‘›ð‘¡ð‘–ð‘”ð‘¡
ð‘ð‘” ð‘‡ ð‘–=ð‘– ð‘¡=ð‘–
2. Create new variables: Outcome*igt = Outcomeigt â€’ ð‘‚ð‘¢ð‘¡ð‘ð‘œð‘šð‘’ð‘” ,
Treatment*igt = Treatmentgt â€’ ð‘‡ð‘Ÿð‘’ð‘Žð‘¡ð‘šð‘’ð‘›ð‘¡ð‘”
3. Regress Outcome* on Treatment* and the Period dummy variables
40
Comparing Estimation Methods
â€¢ Dummy variable estimation provides estimates for the fixed effects
(the effects of switching groups on the outcome), whereas within
estimation does not
â€¢ For dummy variable estimation R-squared is often misleadingly
high, suggesting a very strong fit
â€¢ For within estimation, R-squared is more indicative that the
variation in Treatment is explaining variation in the Outcome
â€¢ Both estimation models eliminate confounding factors that are
fixed across periods for the groups or are fixed across groups over
time
â€¢ Both estimation models could yield inaccurate estimates if there
are unobserved factors that vary within a group over time