Chapter 8
Linear regression analysis

In the previous chapters that served the whole ECON 221 and about a half of ECON 222, we studied the fundamentals of Probability theory and the key theory and toolset of Statistical inference. Remember that we focused solely on understanding statistical distributions and estimating the distributional parameters. In your future scientific, technical, professional practice, this body of knowledge will be quite fruitful.

Now, we are ready to study the theoretical background and applied dimensions of the ’curve-fitting’ problem. To this end, in this chapter, we will consider the Linear Regression Models. Notice that what we will do here accounts for the first half of a traditionally designed ’Introductory Econometrics’ course.

The term regression was coined by Francis Galton to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average, which is also known as regression toward the mean. For Galton, regression had only this biological meaning. His work was later extended by Udny Yule and Karl Pearson, and later by Fisher (in a way to come closer to Gauss’s 1821 formulation of the problem). Once you have researched it, you will enjoy the history of this certain line of research.

As being the main pillar of it, the ’regression analysis’ takes us to the rich analytical world of Econometrics. The literal meaning of the term econometrics (econo+metrics) is ’measurement in economics’. Econometrics is ’the branch of economics concerned with the use of statistical methods in describing and quantifying economic systems’ (Oxford Dictionary). From a broader perspective, econometrics is a shared sub-field of Statistics (hence of Mathematics) and Economics. In that, our tools in Econometrics are those tools in Statistics as shaped and augmented by our knowledge of Economics. (One of the founders of the Econometrics Society, another pioneer of the field, Ragnar Frisch is credited with coining the term ’econometrics’.)

Renowned academic Badi Baltagi says "An econometrician has to be a competent mathematician and statistician who is an economist by training. Fundamental knowledge of mathematics, statistics and economic theory are a necessary prerequisite for this field".

Our starting point is a scientific urge to find/formulate, measure and test the relationship between, say, two variables y and x. These variables may belong to natural sciences, social sciences or even to humanities; this is not something to mind. What matters often is that the linkage between our variables may not be (mostly is not) a perfect relationship like y = mx + n (we prefer indeed a notation like y = β0 + β1x, where n is β0 and m is β1). We rather observe there are deviations from a perfect relationship, as seen earlier in ECON 221. In that, actual y values are connected to actual x values through a relationship like y = β0 + β1x + e where e stands for a sequence of statistical errors (disturbances).
05112230123xy05050000 The error sequence e may stem from the random actions/choices of humans, unexpected shocks to socio-economic systems, misspecification of models, improper choices of mathematical functional forms or imprecision of the data. Note that, this picture is not specific to social sciences: in the natural science experiments there is a multiplicity of sources of uncertainty (hence of statistical errors or disturbances).

Goals of econometrics, as we understand it, will be (1) to find the relation between variables y and x, encapsulated in (β0, β1), (2) to validate and quantify theory and (3) forecasting.

Purpose of modeling and Simplicity
Deferring a detailed discussion of it to class gatherings, we will say here that ’a model is a downsized yet realistic representation of reality’. An immediate analogy from architecture would be useful: on an architectural model of a building we see things ’only as needed’. While we may not see the doorknobs (depending on the scale) on a model, we see the proportionality of distances clearly. After all, the purpose of the model is to give a broad yet accurate idea of/about things.

A similar idea applies in the other disciplines. In business models we do not see every tiny detail of the workplace or the manufacturing environment. In economic models we tend not to include all potential explanatory variables at once. We just try to remain ’accurate enough’.

Using our models we can present our scientific grasp of the nature or universe or the society. Once the model is well-parametrized and quantified, we can develop forecasts of the future, or we can (depending on the type of our model) develop counter factual and/or scenario analyses. Presenting a scientific view of ours and forecasting the future (while) are fairly pragmatic ends, a third use of a scientific model helps testing, validating/ invalidating theories, which calls for a more than pragmatic spirit. Regardless of the purposes cited, though, a model (any model) should display: a certain level of simplicity. Before proceeding, recall Albert Einstein saying "Everything should be made as simple as possible, but no simpler."

In our practice of statistical/ econometric modeling, the ’principle of parsimony’ guides us. Equipped with a rich toolset of formal statistical tests and her judgmental skills, a good researcher tries to come up with an "as simple as possible but no simpler" model. Common sense says essentials will be included in while all the inessentials will be omitted from a model. Bad news is every researcher’s practice has a couple bumps as to improving a sense of such in practice. Good news is honest and hard work pays back.

In Philosophy (and Science) there are several ’razors’ to shave away the redundancies in models (or in scientific explanations). Here we will maintain the Occam’s razor (or Ockham’s razor) attributed to william of Ockham, an English philosopher of the 13th-14th centuries. Occam’s razor is a principle of parsimony stating that among the explanations addressing the same thing, the simplest is to be picked! (William of Baskerville of the Name of the Rose by Umberto Eco is a tribute to William of Ockham) (Arthur Conan Doyle’s Sherlock Holmes once utters "When you have eliminated the impossible, whatever remains, however improbable, must be the truth")

Occam’s razor reads in Latin as "pluralitas non est ponenda sine necessitate" which translates into English as "plurality should not be posited without necessity". The principle, so, calls for parsimony in ’deductive thinking’.

Despite what we do in applied statistics/econometrics is not purely (maybe not at all) deductive thinking, we rather try to reach an inference to the best explanation via a formal sequence of estimations/tests/calculations. In this practice, Occam’s razor sheds some good light for us to see things clearly.

In the world of project development you may hear the same principle as an acronym of ’KISS’. Referring to a model, KISS reads as ’Keep It Small and Simple’ or sometimes as ’Keep It Simple, Stupid’. (Search yourself for its relevance to the US Navy)

In the remainder of this chapter, we will study/learn the theory of elementary econometrics, a rich enough toolset pertaining to it along with a selection of applied problems.

8.1 EXERCISES____________________________________________     

1. 

Refer to our in-class discussions to explain/discuss the following:

i. Occam’s razor

ii. Principle of parsimony

iii. ’Keep It Small and Simple’, i.e., KISS

iv. Purpose of modeling

v. Come up with a synthesis of the terms/phrases referred to above.

Solution: Left as self-exercise.

0Checkpoint
No: 86

8.1 Overview of linear models

Overview of linear models

The specific meaning of linearity here is ’the linearity of a model in terms of (with respect to) its parameters. In that:

y = β0+ β1x1+ β2x2+ e

is a linear model. So is

           2     3
y = β0+ β1x1+ β2x2+ e

However,

         2
y = β 0+ β1x1+ β1β3x2+ β3x3+ e

is not considered to be a linear model. Neither is

y = β0 + β1x1+ β2x2+ β1β2x3+ e.

In your future practice, you will be able to settle this issue in a crystal clear fashion.

Why do we resort to linear models? This is a very legitimate question once we observe a number of relationships in the nature and in societal life are, indeed, nonlinear (not linear). A straightfor ward answer reads as ’linear models are easy to use’. So, simplicity matters. Simplicity brings practicality to researchers, they are easy to compute, to interpret and to communicate. More importantly, as noted earlier, our linear regression models are linear with respect to their parameters while the independent variables of our models can be of any nonlinear form. All in all, one can establish/ form models that are nonlinear in their variables’ using ’models that are linear in their parameters’. The good thing about models that are linear in parameters is that such a structure allows us to use the tools of linear algebra effectively in our computations.

Our curious nature often forces us to include many explanatory variables in a model:

y = β0+ β1x1+ β2x2+ ⋅⋅⋅+ βkxk

However, a minimalist design is also possible:

y = β0+ β x
         1

Even this may be a good enough model (think when):

y = β0

The process of inference begins with the specification of an economic model. Then a statistical model describes the sampling process that we visualize was used to produce the sample data. See the structure below:

Economic model:

y = β0+ β1x

Statistical model:

y = β0+ β 1x + e

The random error term (e) serves three main purposes:

1.
e captures the combined effect of all other influences other than x. These other effects are assumed to be unobservable, otherwise they would be included in the model.
2.
e captures any approximation error that arises because of the linear functional form
3.
e captures any element of random behavior present in each individual observation.

See the structures below:

Case 1: Unconditional model of mean

Economic model:

y = β0

Statistical model:

y = β0+ e

Case 2: Simple Linear model

Economic model:

y = β0+ β1x

Statistical model:

y = β + β x + e
     0    1

Case 3: Multiple Linear model

Economic model:

y = β0+ β1x1+ β2x2+ ⋅⋅⋅+ βkxk

Statistical model:

y = β0+ β 1x1 + β2x2+ ⋅⋅⋅+ βkxk + e

0 Checkpoint
No: 87

8.2 Transformations and functional forms

Transformations and functional forms In economics and finance, like in other quantitative disciplines, we attribute a great deal of importance to measuring the impact of a change in one variable on one another. Considering y = f(x) as a relationship between the variables y (dependent) and x (independent), the derivative dy/dx = f(x) describes that impact. When we consider y = f(x1,x2,...,xk), the impact of an independent variable xi on the dependent variable y is better described by the partial derivative ∂y/∂xi. Having formed and estimated a proper statistical/ econometric model, then, a researcher gains a good grasp of issues embedded in the research problem at hand.

Note that, as economists and finance specialists, we like to learn about a special class of impact measurements, namely the elasticities. Recall from your introductory economics classes that ’elasticity of y with respect to x is the percentage change in y against a one percentage point change in x ’. in formal terms:

              Δy
n   = -%Δy =  y--= Δy-⋅ x-
  y,x  % Δx    Δxx   Δx  y

So, as long as we can estimate Δyx, we can come up with an estimate of ηy,x by substituting appropriate values of x and y into x/y. We will see several examples as we progress through this chapter, where we will see estimating an elasticity is possible under a wide array of functional forms of f() in the expression y = f(x).

One of the functional forms, i.e., the Log-Log form, yields elasticities directly as:

      Δlny-
ηy,x = Δlnx .

We will discuss this topic further in our classes.

Functional form: Linear

                      yi = β0+ β1xi+ ei
Nonlinear form:

                            None
IImnpaac ntutasth mealrlgin:
                           dy-
                           dx = β1
Elasticity:
                         dy  x     x
                         --⋅ -= β 1--
                         dx  y     y

Functional form: Reciprocal
                                 1-
                      yi = β0 + β1xj + ei

Nonlinear form:
                            None
In a nutshell
Impact at margin:
                         dy-= −β 1 12
                         dx       x
Elasticity:
                       dy⋅ x= − β ⋅ 1--
                       dx  y     1  xy

Functional form: log− log
                    lny = β  + β ln x + ℓ
                       i   0    1   i   i
Nonlinear form:
                         yi = αx β1eei
                                i
IImnpaac ntutasth mealrlgin:
                          dy-= β y-
                          dx    1x
Elasticity:
                         dy- x-
                         dx ⋅y = β1

Functional form: Log -Linear (exponential)
                     lny =  β + β x + e
                        i    0   1 i  i
Nonlinear form:
                        yi = eβ0+β1xi+ei
In a nutshell
Impact at margin:
                          β y = dy-
                           1    dx
Elasticity:
                         dy- x-
                         dx ⋅y = β1x

Functional form: Linear log( semilog )

                     yi = β 0+ β1lnxi+ ei

Nonlinear form:                     β
                        eyi = eβ0+eix1i
In a nutshell
Impact at margin:
                          dy-= β11-
                          dx     x
Elasticity:
                         dy⋅ x= β  1-
                         dx  y    1y

Functional form: Log -Inverse
                                  1-
                     lnyi = β0 − β1xi+ ei
                                   i
Nonlinear form:               β−β 1+e
                        yi = e0 1xi i
In a nutshell
Impact at margin:
                          dy=  β1 y2
                          dx     x
Elasticity:
                         dy-x-= β 1-
                         dx y    1x

0Checkpoint
No: 88

8.3 Our approach to teaching/learning

In the remainder of this chapter, we will maintain an approach which may slightly differ from the approaches of others. Sticking to this approach would facilitate better learning. Our approach folds out as:

Having a pitstop here, the sequence of topics above will provide us with a solid understanding of the mechanical workings of our linear regression universe.

Once we have learned these, we will move to:

In many, maybe all, books Gauss-Markov assumptions are covered before other things. Though, our approach maintains a different pedagogical perspective. In that, we take into consideration the Gauss-Markov assumptions, which are crucial in econometric theory and practice, upon a clear view of the working environment. After that we will move to:

Note that the above order of topics require us to stick to it without interruption or gaps for successful learning.
An artificial data set:

In our subsequent discussions we will be referring to the following data set frequently. While we can show a data set as an actual set (with proper mathematical notation) like:

A={(2,1),   (2,3),   (3,2),   (3,3),    (3,4),   (5,3),
 (5,4),    (5,6),   (8,5),   (8,8),   (10,6),   (11,8),

(11,10),  (12,8), (14,10), (15,11),  (15,17), (16,13),
(19,15),  (21,16), (23,18), (23,19),  (23,20), (25,18),
(25,20),  (26,21), (27,24), (28,21),  (28,24), (28,25)}

it may be more practical to use a tabular listing of the data. A tabular structure improves visibility and exposition:







Observation i xi yi Observation i xi yi






1 2 1 16 15 11






2 2 3 17 15 17






3 3 2 18 16 13






4 3 3 19 19 15






5 3 4 20 21 16






6 5 3 21 23 18






7 5 4 22 23 19






8 5 6 23 23 20






9 8 5 24 25 18






10 8 8 25 25 20






11 10 6 26 26 21






12 11 8 27 27 24






13 11 10 28 28 21






14 12 8 29 28 24






15 14 10 30 28 25






8.4 Building and estimating an Unconditional Model of Mean: A model which is a non-model

Consider a variable y that is modeled as:

y = β0+ e

If we have a sample y1, y2, , yn, this relationship can also be written as

yi = β0+ ei,i = 1,2,...,n

It is clear that our model does not include any independent (explanatory) variables on the right hand side, ie., values of y are scattered around β0 (if they are not all accidentally equal to β0 ).

Supposing there are K potential independent variables, x1, x2, , xk, that might explain y, the unconditional model of mean can be viewed as:

yi = β0+ 0 ⋅x1i+ 0 ⋅x2i+ ⋅⋅⋅+ 0⋅xki+ ei

where the researcher places zero weight on x1, x2, , xk. In that, this model of mean turns out to be the simplest possible model or more like a non-model. When we plot yi against one of the x’s (say xki), the model of mean is to appear as a horizontal line (as the model disregards x ’s). This is simply the orange line displayed below (observe that across the orange line dy/dx = 0:
0511223051122xy050500505 To estimate β0 in y = β0 + e we need two main ingredients:

Now, suppose out estimator is ̂β 0. Then, the estimated values of yi (denoted as ŷi ) are written as:

 ̂   ̂
yi = β0

Actual values of yi, on the other hand, are:

    ̂
yi = β0+ ̂ei

equivalently

yi = ̂yi+ ̂ei

The difference between yi and ŷi are the error terms:

̂ei = yi− ŷi
̂ei = yi− β̂0

Consider the function S :

     n      n
S = ∑  ̂e2i = ∑ (yi− ̂β0)2
    i=1    i=1

The Least Squares method instructs us to minimize S by optimally choosing β̂ 0 :

     n (     )
m in ∑  yi− β̂0 2
{̂β0} i=1

The F.O.C. for this problem is:

      n
dS-=  ∑ 2 (yi− β̂0) (− 1) = 0
d̂β0   i=1

which is followed by:

 n (    ̂ )
∑   yi− β0 = 0
i=n1     n
∑  yi− ∑ β̂0 = 0
i=1    i=1
 n
∑  yi− n̂β0 = 0
i=1

        n
β̂0 = 1-∑ yi = ¯y
     n i=1

So, not surprisingly and as maybe called from our discussion of point estimators, the sample mean is the estimator of population mean. Namely, β̂0 = ȳ estimates y.

A note on the function S may be useful here: As ̂
β 0 is the estimated mean of yi, the function S shows the variance of error terms multiplied by n. This is good to keep in mind: the least squares estimator is a ’minimum variance estimator’ as we will formally discuss later. Statistical properties of the error terms ei will also be covered in detail.

Returning to qualities of the sample mean ̂β 0 = ȳ as an estimator of population mean β0, one can be intellectually stunned by the beauty generated by simplicity. There are couple things to mention:

0Checkpoint
No: 89

8.5 Building and estimating a Simple Linear Regression model

Consider a variable y which we believe is explained by another variable x via a linear relationship like:

y = β0+ β x + e
          1

in this expression,

Below, the green line is a good candidate to be a Simple Linear regression line:
05112230123xy05050000 Notice that we need to estimate two parameters β0 and β1 this time. The Least squares method is again applicable. Let us go over its steps below:

̂yi = ̂β0+ ̂β1xi
y = ̂β + ̂β x + ̂e
 i   0   1 i   i
̂ei = yi− β̂0− ̂β1xi

    n      n (            )2
S = ∑ ̂e2i = ∑  yi− ̂β0− ̂β1xi
   i=1    i=1

       n (           )2
 m in  ∑   yi−β̂0− ̂β1xi
{̂β0,̂β1}i= 1

∂S-   n  (    ̂    ̂  )
∂β̂0 = ∑i=12 yi− β0− β 1xi (− 1) = 0
      n  (            )
∂S-= ∑  2 yi− ̂β0− β̂1xi (−xi) = 0
∂β̂1  i=1

n
∑ (yi− β̂0 − ̂β1xi) = 0
i=1
n (            )
∑  yi− β̂0 − ̂β1xixi = 0
i=1

            (    )
∑  yi− n ̂β0−  ∑ xi ̂β1 = 0
        (    ) ̂  (    2) ̂
∑  xiyi−  ∑ xi β0 −  ∑ xi β 1 = 0

     (    )
n̂β0+  ∑ xi ̂β1 = ∑ yi
(    ) ̂  (    2) ̂
 ∑ xi β0+  ∑  xi β1 = ∑ xiyi

n∑ x2     (    )      ∑x2 ∑yi
----iβ̂0+  ∑ x2i  ̂β1 = --i-----
(∑ xi)     (    )      ∑ xi
 ∑ xi β̂0 +  ∑ x2i  ̂β1 = ∑ xiyi

(     2       )        2
  n-∑xi-−   x   ̂β = ∑-xi ∑-yi−   xy
   ∑ xi   ∑  i   0    ∑ xi    ∑   ii
(     2        2)        2
  n-∑xi-−-(∑-xi)- β̂0 = ∑-xi ∑-yi−-∑-xi∑-xiyi
       ∑ xi                   ∑ xi

̂    ∑xi2∑yi−-∑-xi∑-xiyi-
β0 =   n ∑x2 − (∑ x )2
           i      i

            (   )
∑ yi− nβ̂0 − (∑ xi) ̂β1 = 0
n̂β0 = ∑ yi−  ∑ xi ̂β1 = 0
̂        ̂
β0 = y¯− ¯xβ1
̂β0 = y¯− ̂β1¯x

        (    )(      )   (    )
∑ xiyi−  ∑ xi  ¯y− β̂1x¯ −  ∑ x2i  ̂β1 = 0
                ̂         ̂    2
∑ xiyi− ¯y∑  xi− β1¯x∑ xi− β1∑  xi = 0
∑-xiyi    ∑-xi  ̂  ∑-xi  ̂ ∑-x2i
  n   −y¯ n  − β1¯x n  − β1 n   = 0
∑ xiyi               ∑ x2
--n-- −x¯¯y− ̂β1¯x2− ̂β1--ni = 0

̂   ∑-xniyi−-¯x¯y
β1 =  2  ∑-x2i
     ¯x +  n
  = n-∑xiyi−-n2¯x¯y
     n2¯x2+ n∑ x2i

β̂1 = n∑-xiyi−-∑-xi∑-yi
      n∑ x2i + (∑ xi)2

Now, reconsider that ̂β 0 = ȳ̂β 1x and that xiyîβ 0 xîβ 1 xi2 = 0. Substituting the first one into the second:

∑  xiyi− ∑ xi¯y+ β̂1 ∑ xi¯x− ̂β1∑  x2i = 0
                ̂ (   2       )
               β1  ∑ xi − ∑ xix¯ = ∑ xi(yi− ¯y)
                  β̂1∑ xi(xi− ¯x) = ∑ xi(yi− ¯y)

                              ̂β1 = ∑-xi(yi−-y¯)
                                  ∑ xi(xi− x¯)

Now notice the following:

  (x − x¯)(y − ¯y) =   (x − ¯x)y =   (y − ¯y)x
∑   i      i       ∑   i     i  ∑   i     i

as,

∑ (xi− ¯x)(yi− ¯y) = ∑ (xi− x¯)yi− y¯∑-(xi−-¯x)= ∑  (yi− y¯) xi− x¯∑-(yi−-¯y)
                                 ◟  ◝◜0   ◞                 ◟  ◝◜0   ◞

and, as the sum of the deviations from the mean is zero, i.e.,

   (x − ¯x) =  x −    ¯x =   x − n¯x = 0
∑    i      ∑  i  ∑     ∑   i

and

∑ (yi− ¯y) = ∑ yi− ∑ ¯y = ∑ yi− n¯y = 0

The same logic applies in:

  (x − ¯x)2 =  (x − ¯x)(x − ¯x)
∑   i       ∑   i      i
          = ∑ (xi− ¯x)xi− ¯x∑--(xi−x¯)
                          ◟   ◝0◜  ◞

At the end, the above-driven expression for ̂β 1, i.e.,

    ∑ x (y − ¯y)
̂β1 =---i--i----
    ∑ xi(xi− ¯x)

can be rewritten as:

̂   ∑-(xi−x¯)(yi−-y¯)-
β1 =   ∑ (x −x¯)2
           i

so, can be written as:

     1
̂β1 = n-∑(xi−-¯x)(yi−-¯y) = Cov(x,y)
        1n ∑(xi− ¯x)2       Var(x)

To sum up, our Least Squares estimators ̂β 0 and ̂β 1 for the model parameters β0 and β1 are found to be:

 ̂   ∑(xi−-¯x)(yi−-¯y)-  Cov(x,y)
β1 =    ∑(x − ¯x)2   =   Var(x)
           i

and

̂β0 = ¯y− ̂β1¯x

In the graph given below, try to observe why the green line is superior to others in representing our data:
05112230123xy05050000

Assumptions of the Simple Linear regression model
[SLR1.] The value of y, for each value of x, is

                       y = β0+ β1x+ e
[SLR2.] The average value of the random error e is E(e) = 0 since we

assume that
                       E(y) = β0+ β1x
[SLR3.] The variance of the random error e is

In a nutshell        Var(e) = σ2 = Var(y)

[SLR4.] The covariance between any pair of random errors, ei and ej
is                (    )      (    )
               Cov ei,ej = Cov  yi,yj = 0,  i ⁄= j

[SLR5.] The variable x is not random and must take at least two dif-
ferent values.
[SLR6.] The values of e are normally distributed about their mean
                                    2
                       e ∼ Normal(0,σ )

0Checkpoint
No: 90

8.6 Building and estimating a Multiple Linear Regression model: An increase in dimensionality

Consider

y = β0+ β 1x1 + β2x2+ ⋅⋅⋅+ βkxk + e

where

    ̂    ̂     ̂           ̂
̂yi = β0+ β 1xi1+ β2xi2+ ⋅⋅⋅+ βKxiK
yi = ̂β0+ β̂1xi1+ ̂β2xi2+ ⋅⋅⋅+ β̂KxiK + ̂ei
̂e = y − ̂β − ̂β x  ⋅⋅⋅+ − ̂β x
 i   i   0   1 i1       K iK

     n      n
S = ∑  ̂e2i = ∑  (yi− β̂0− ̂β1xi1− ⋅⋅⋅− ̂βKxiK)2
    i= 1    i= 1
 {̂β0,β ̂,⋅⋅⋅ , ̂β }
      1     K

As before, this minimization problem will give us β̂ 0, ̂β 1, , ̂β K, ie., the estimators of β0, β1, , βK.

For future ease, let us restate our Multiple Linear model using matrix notation. To do this, let us first write our model equation for every single observation (for each i = 1, 2, , n):

y1 = β 0+ β1x11 + β2x12 + ⋅⋅⋅+ βKx 1K + e1

y2 = β 0+ β1x21 + β2x22 + ⋅⋅⋅+ βKx 2K + e2
      ⋅⋅⋅     ⋅⋅⋅      ⋅⋅⋅

yn = β0+ β1xn1+ β2xn2+ ⋅⋅⋅+ βKxnK+ en

In matrix notation:

⌊ y1 ⌋   ⌊                      ⌋ ⌊    ⌋   ⌊ e1 ⌋
| y  |     1  x11  x12 ⋅⋅⋅  x1K     β0     | e  |
||  2. ||   || 1  x21  x22      x2K || || β1 ||   ||  2. ||
||  .. || = ||  .                   || ||  . || + ||  .. ||
||  .. ||   ⌈  ..                   ⌉ ⌈  .. ⌉   ||  .. ||
⌈  . ⌉     1  xn1  xn2 ⋅⋅⋅  xnK     βK     ⌈  . ⌉
 -yn--   ◟--------X-◝◜----------◞ ◟β-◝◜--◞   -en--
◟y◝(n◜×1)◞            (n×(K+1))          ((K+1)×1)  ◟e(◝n◜×1)◞

can be written. Then,

y = X β+ e

It is also possible to write each explanatory variable as a separate vector like:

    ⌊   ⌋      ⌊     ⌋          ⌊     ⌋
      1           x11              x1K
    || 1 ||      ||  x21 ||          || x2K ||
x = || 1 || ,x = ||   .. || ,...,x  = ||  ..  ||
 0  || . ||   1  ||   .. ||      K   ||  ..  ||
    ⌈ .. ⌉      |⌈   .. |⌉          |⌈  ..  |⌉
      1           xn1              xnK

so the model looks like:

y = x0β0+ x1β1+ ⋅⋅⋅+ xKβK + e

When the matrix expression y = + e is maintained, the function S becomes

    ′
S = e e

where e is the transpose of e.

Returning to our minimization problem written in classical notation, the following first order conditions are written:

-∂S =  n −2 (y −β̂ − ̂β x ⋅⋅⋅− β̂x  ) = 0
∂β̂0   ∑i=1     i   0   1 i1      K iK
       n    (                        )
-∂S = ∑  −2  yi−β̂0− ̂β1xi1− ⋅⋅⋅− β̂KxiK xi1 = 0
∂β̂1   i=1
      ⋅⋅⋅      ⋅⋅⋅      ⋅⋅⋅
       n
-∂S-=  ∑ − 2(yi− ̂β0− ̂β1xi1− ⋅⋅⋅−β̂KxiK) xiK = 0
∂β̂K   i=1

Simplifying a little:

  (                         )
∑ (yi− β̂0 − ̂β1xi1− ⋅⋅⋅− ̂βKxiK)= 0
∑  yi− β̂0 − ̂β1xi1− ⋅⋅⋅− ̂βKxiK xi1 = 0

  (   ⋅⋅⋅     ⋅⋅⋅      ⋅⋅⋅   )
∑  yi− β̂0 − ̂β1xi1− ⋅⋅⋅− ̂βKxiK xiK = 0

Reorganizing the terms:

    n̂β0+ ∑  xi1̂β1+ ⋅⋅⋅+ ∑  xiK̂βK = ∑ yi
∑ xi1̂β0+ ∑ x2 ̂β1+ ⋅⋅⋅+ ∑ xi1xiK̂βK = ∑ xi1yi
            i1
              ⋅⋅⋅      ⋅⋅⋅      ⋅⋅⋅
∑ xiK ̂β0+ ∑ xi1xiK̂β1+ ⋅⋅⋅+ ∑  x2iKβ̂k = ∑ xiKyi

Notice that this last set of equations can be written as:

                               ⌊  ̂  ⌋   ⌊       ⌋
⌊   n     ∑ xi1   ⋅⋅⋅   ∑ xiK   ⌋|  β̂0 |      Σyi
|    x      x2          x x   |||  β1 ||   || Σxi1yi||
||  ∑ .i1   ∑  i1        ∑  i1 iK ||||  ...  || = ||       ||
|⌈    ..                        |⌉||  .  ||   |⌈       |⌉
   ΣxiK  ∑ xi1xiK  ⋅⋅⋅   ∑ x2    ⌈  ..  ⌉
                          iK       ̂βK        xiKyi

In terms of our earlier definitions of x and y as well as β; what we have obtained is

X ′X ̂β=  X′y

So,

β̂= (X′X)− 1X′y

Solves our minimization problem and β̂ = [β̂, ̂β , ̂β ,⋅⋅⋅ ,β ̂ ]
  0  1  2      K contains our parameter estimates.

8.2 EXERCISES____________________________________________     

1. 

Reconsider the Simple Linear model y = β0 + β1x + e and show that the β̂ = (X ′X ) 1Xy works in estimating β0 and β1 (ie., while finding ̂β 0 and β̂ 1 ). Solution:

      [          ]
X′X =    n   Σxi
        Σxi  Σx2i
     [       ]
X′y =  ∑ yi
       ∑ xiyi
(   )           1       [   x2   −Σx  ]
 X′X − 1 =-----2-------2   ∑ i       i
          n ∑ xi − (∑xi)  − Σxi   n

So,

                 [             ][       ]
̂β=  ------1------   Σx2i   −Σxi     Σyi
    n∑ x2− (Σxi)2  − Σxi   n       Σxiyi
        i

So,

    ∑ x2i ∑yi− ∑ xi∑ xiyi
̂β0 =-------2-------2---
       n∑ xi −(∑ xi)
̂β1 = −-∑-xi∑-yi+-n∑-xiyi= n-∑-xiyi−-∑-xi∑yi
       n∑ x2i − (∑ xi)2     n ∑ x2i − (∑ xi)2

Checking back our earlier solution, we verify that ̂β = (X′X) 1Xy works well.

2. 

Question: Write and solve the Least squares estimation problem for

yi = β0+ β 1xi1+ β2xi2 + ei

that is, a model with a constant term (β0) and two explanatory variables.

Solution: Solution: Left as self-study.

As you have noticed, we used/devised the term (X′X)−1 in solving
our estimation problem: Think about what ensures the invertibility
In a′ nutshell
of X X? In your future learning and practice this will be a central
technical issue to address many times.

Assumptions of the Multiple Linear regression model
[MLR1. ]
             yi = β0 + β1xi1+ β2xi2+ ⋅⋅⋅+βkxik+ ei

[MLR2. ]
       E(y ) = β + βx  + β x  + ⋅⋅⋅+β x  ⇐ ⇒ E (e) = 0
          i    0    1 i1   2 i2        kik        i
[MLR3. ]
In a nutshell       Var(yi) = Var (ei) = σ2

[MLR4. ]          (    )      (    )
               Cov yi,yj =  Cov ei,ej = 0,  i ⁄= j

[MLR5. ] The values of xik are not random and are not exact linear
functions of other explanatory variables.
[MLR6. ]
             (                       2)              (   2)
  yi ∼ N orm al β0+ β 1xi1+ ⋅⋅⋅+ βKxiK,σ  ⇔  ei ∼ Normal  0,σ

0Checkpoint
No: 91

8.7 Goodness of fit

Suppose we have the following model:

       ŷi
    ◜--◞◟--◝
yi = ̂β0+ β̂1xi+ei = ̂yi+ ei

Observe that

yi− y¯= ̂yi− ¯y+ ei

and consider the quantity (yi− ¯y) 2: This quantity is called ’Total Sum of Squares’. In what follows, we decompose it into other useful quantities:

∑  (yi− y¯)2 =∑  (̂yi− ¯y)2 +2 ∑ (̂yi− ¯y)ei+ ∑ e2i

Reordering the terms in the last expression:

         2               2
∑  (yi− y¯)  =∑  (ŷi− ¯y + ei)
           =∑  (̂yi− ¯y)2 +∑  e2+ 2∑  (ŷi− ¯y)ei
                            i   ◟---◝◜----◞
                                     0
∑  (yi− y¯)2 =∑  (̂yi− ¯y)2 +∑  e2i

is obtained. In this expression,

∑  (yi−y¯)2 = ∑ (̂yi− ¯y)2 +∑  e2
◟---◝◜---◞   ◟--◝◜---◞  ◟|◝◜i◞
    TSS         ESS      RSS

TSS, ESS and RSS stand for:

Notice that the Total Sum of Squares (yi− ¯y) 2 is nothing but the variance of y multiplied by n :

                    ( 1          )
TSS = ∑  (yi− y¯)2 = n  -∑  (yi− ¯y)2
                      n

Explained Sum of Squares (ŷi− y¯) 2 measures the sum of squared deviations of our estimated values of y (namely ŷi ) from ȳ (namely the unconditional mean of our dependent variable y ). As ŷi values are implied by our model’s explanatory variables (x1,x2,...,xK), the ESS measures the portion of TSS that we explained. Residual Sum of Squares, then, measures the portion of TSS that could not be explained. The Coefficient of Determination R2 is the fraction of variation in y explained by our knowledge of x:

R2 = ESS-= ∑-(ŷi−-¯y)2-
     TSS   ∑ (yi− ¯y)2
              RSS
         =1 − ----
              TSS  2
         =1 − ---∑-̂ei----
              ∑ (yi−y¯)2

Note that, if the model does not have a constant term (that is β0 is omitted), then the measure R2 is not appropriate anymore. When the constant term is omitted,

   (y − y¯)2 ⁄=   (̂y − ¯y)2+   e2
∑   i        ∑   i       ∑  i

A bad habit of R2 is that it tends to somehow increase upon the inclusion of additional explanatory variables (in fact, when their t-statistics exceed 1 in absolute value: we will see in subsequent sections) in a model. Does this mean we should continue adding more and more explanatory variables to our model ’just to push up R2’? The answer is quite the opposite: we must see the inclusion of more variables as a cost (after all we want to come up with a parsimonious model). Then, we need to balance the benefits of more explanatory variables (enhanced ESS) with the cost of including them.

The Adjusted Coefficient of Determination (¯2)
 R serves that purpose:

        RSS /(n− K − 1)
¯R2 = 1− --TSS-/(n−-1)--

Notice that:

 2      (     2) (  n − 1  )
¯R = 1 −  1− R     n−-K-−-1-

Also keep in mind that neither R2 nor R2 has a statistical distribution. So, they are not directly and formally testable. Though, a simple arithmetic reorganization of R2 resembles an F test score (test statistic) as we will consider very soon.

0Checkpoint
No: 92

8.8 Handling statistical uncertainty: calculation of variances and covariances associated with a Multiple Linear Regression model

As stated before in "Our approach to teaching/learning’, up to here we maintained a naive and mechanical view of the Linear Regression modeling. In that, we deliberately, avoided calculations and discussions of the measures of dispersion or co-dispersion associated with our models. Now, it is the time to turn to reality. After all, ei sequence has a certain statistical distribution, so does yi. As we will formally study under the heading of ’Ideal econometric conditions: Gauss-Markov assumptions’, the ei terms have:

           (    )
ei ∼ Normal 0,σ2

that is, a Normal (Gaussian) distributton with a mean of zero (0) and constant (and preferably finite) variance.

As a consequence yi values have:

           (                      )
yi ∼ Normal β0+ β1x1+ ⋅⋅⋅+ βKxK,σ2

Intuitively, the mean of yi depends on (is conditional on) x1, x2, , xK (along with their parameters); while variance of y simply mimics that of e (by the very construction of our analytical framework).

The key thing to understand now is the variability of our parameter estimates: once they are obtained from a stochastic/ random data set, it is natural/ trivial to expect each of our estimators to have a nonzero variance and each pair of our estimators to have a covariance.

We devote this section to some rigorous treatment of what we call a ’variance-covariance’ matrix.

Let us begin from e Normal(    )
 0,σ2. Once we assume the error terms to have a Normal distribution with a mean of zero (0) and a variance of σ2, we may proceed to the following Q&A style mathematical elaboration:

Q: Do we know the value of σ2 ?

A: No, it belongs to the population of ei’s. But, we only have a sample of ei ’s, namely êi ’s.

Q: Can we use those êi ’s to estimate σ2, that is to obtain ̂σ2 ?

A: Yes, the formula for  ̂
σ2 is:

           2
̂σ2 = ---∑-̂ei---
     n− (K + 1)

Q: can we express σ̂2 using matrix notation?

A: Yes, the expression is:

̂σ2 =----̂e′̂e--- = (y−-X-β)′(y-−-Xβ)
    n − (K+ 1)      n − (K+ 1)

Q: What about the Cov(     )
 β̂i, ̂βj values, can we calculate them?

A: Sure, in matrix notation,

Cov(̂β) = E ((̂β− β)(̂β− β )′) = σ 2(X′X)−1

Q: What about the distribution of ̂β ?

A:

           (    (    )  )
β̂∼ Normal  β,σ2 X ′X − 1

Q: What does this mean?

A: First, each parameter estimate is unbiased, E(̂
β ) = β. Second; the variances are ruled by σ2  ′
(X X) 1.

Q: How is the structure of the variance-covariance matrix?

A:

    ̂     ( ̂    ( ̂   )′]
Cov(β) =E  (β− β) β − β
         ⌊   Var(̂β0)Cov (̂β0,β ̂1)     ⋅⋅⋅   Cov (̂β0,β ̂K )⌋
         |   Cov(̂β , ̂β )Var(β̂)     ⋅⋅⋅   Cov (̂β ,β ̂ )|
       = ||        0  1       1                  1  K ||
         ⌈    (     ⋅)⋅⋅  (     )    ⋅⋅⋅       ⋅⋅(⋅ )  ⌉
           Cov β̂0, ̂βK Cov ̂β1, ̂βK    ⋅⋅⋅     Var ̂βK

  ⌊       ( ̂     )2        (̂     ) (̂     )        (̂     ) (̂     ) ⌋
  |    ( E β0 −)(β0    )   E β0−(β 0  β1)−2 β1   ⋅⋅⋅  E(β0− β 0) (βK− βK) |
= ||  E β̂0− β0  ̂β1− β1       E  ̂β1− β1        ⋅⋅⋅  E ̂β1− β 1  ̂βK− βK  ||
  |⌈                                                                   |⌉
     E(̂β − β ) (̂β − β )  E (̂β − β )(β̂ − β )  ⋅⋅⋅     E (̂β  − β )2
        0    0   K⌊   K       1   1   K    K ⌋             K    K
                     n    ∑ xi1   ⋅⋅⋅   ΣxiK   − 1
     (   )       || Σxi1   Σx 2   ⋅⋅⋅  Σx x  ||
= σ 2 X′X −1 = σ2||   .       i1          i1 iK ||
                 ⌈   ..                      ⌉
                   ΣxiK  Σxi1xiK  ⋅⋅⋅  Σx 2iK

Q: But, we do not know the value of σ2?

A: Then, substitute ̂σ2 for it:

  ̂       2(  ′ )−1
C ov(β ) = ̂σ X X

Q: Does that mean we will be using the estimated values of variances and covariances?

A: Sure. This is what we have been doing since the beginning of our ECON 222 journey.

Q: Are we now ready to dive into the fascinating world of statistical inference over our estimated models?

A: Very much, indeed.

Q: Are you an AI?

A: No. Are you?

0Checkpoint
No: 93

8.9 Statistical inference

We have studied/learned up to this point:

Now, we are ready to place our estimated models under some serious scrutiny. Using the inferential tools that we learned, we will evaluate, test and scientifically question our regression models.

In a bold fashion, we can say that what we did up to here (i.e., estimating regression models) is no more than the half of the job. To have the job actually done, we need to delve into the following tasks:

1.
Estimating confidence intervals for individual model parameters βi
2.
Estimating confidence intervals for linear combinations of (more than one) model parameters
3.
Conducting hypothesis tests for individual model parameters βi
4.
Conducting hypothesis tests for linear combinations of (more than one) model parameters
5.
Conducting hypothesis tests for all of our model parameters at once
6.
Conducting hypothesis tests for specific subsets of our model parameters at once

Now, let us give examples to each category of tasks listed above. To do this, suppose we have the following economic model:

yi = β0 + β1xi1+ β2xi2+ β3xi3 +β 4xi4

Recall that, this is our model written for the population and we turn it into a statistical model (written again for the population) by introducing the statistical error (disturbance, sometimes ’shock’) terms:

yi = β0+ β1xi1+ β 2xi2+ β3xi3 + β4xi4+ ei

where ei Normal(    )
 0,σ2. As you know well now, we do not know the true values (population values of βj ’s). So, we will estimate the model using a sample of n observations and the Least Squares technique.

yi = β0 + β1xi1+ β2xi2+ β3xi3+(β 4xi4)+ ei,
    i = 1,2,...,n,e ∼ Norm al 0,σ2
                 i

Provided that everything goes well on the paper and in the computer, we will end up with a rich set of estimates:

Now, suppose the following claims and/or questions come from an academic/ technical colleague. (Needless to say, even when there is no criticizing colleague around, we need to put these claims on our own and heavily test our models):

1.
Is 0.4 a viable value for β1, with respect to a 95% confidence interval of β1 ?
2.
Is 0.7 a viable value for β1 + β2, with respect to a 95% confidence interval of β1 + β2 ?
3.
Is β3 equal zero or not; how do we know x3 is an important/significant explanatory variable?
4.
Is β3 + β4 equal one or not?
5.
Is β1 = β2 = β3 = β4 = 0; how do we know our explanatory variables x1, x2, x3 and x4 matter as a whole?
6.
Is β1 = β2 = 0; how do we know the explanatory variables x1 and x2 matter together?

Our road map to assess these questions begins with formulating these questions/claims in some formal notation:

Following the same order as above:

1.
We will calculate a 95% C.I. for β1 and will check if 0.4 belongs to the calculated interval. This is simply done as:
 (        (  )               (  ))
P β̂1 − tcse β̂1  ≤ β1 ≤ β̂1+ tcse β̂1 = 1− α
2.
We will calculate a 95% C.1. for β1 +β2 and will check if 0.7 belongs to the calculated interval.
 (             (      )                        (      ))
P β̂1 + ̂β2− tcse  ̂β1+ ̂β2 ≤ β 1+ β2 ≤ β̂1+ ̂β2+ tcse ̂β1+ ̂β2 = 1 − α

Here, we apparently need to calculate Var(̂
β 1 +  ̂
β 2). Using our knowledge from ECON 221:

     ̂   ̂        ̂         ̂  ̂        ̂
Var(β1 + β2) = Var(β1)+ 2Cov(β1,β2)+ Var(β2)

where Var(β̂ 1), Cov(β̂ 1, β̂ 2) and Var(̂β 2) are straightforwardly obtained during the estimation of the model. Once Var(β̂ 1 + ̂β 2) is at hand, se( ̂
β 1 + ̂
β 2) = ∘ ----̂----̂-
  Var(β1+ β2) yields the required standard error.

3.
We will conduct the test
H0 : β3 = 0

H1 : β3 ⁄= 0

Distribution of the test statistic:

  ̂
∘β-3−(β3)-∼ t(n−K− 1)
  Var β̂3

Calculation of the test statistic:

β̂3−-β03
 se (̂β3) ∼ t(n−K−1)

̂β3−-0-
se (̂β3) ∼ t(n−K−1)
4.
We will conduct the test
H  : β + β = 1
  0   3   4
H 1 : β 3+ β4 ⁄= 1

β̂3+-̂β4−-(β3+-β4)
  ∘ ---(̂----̂)-  ∼ t(n−K−1)
    Var β3+ β 4

̂β + β̂ − (β  + β )0
-3---4(----3-)-4--∼ t(n−K −1)
    se β̂3+ ̂β4

̂β3+ ̂β4− 1
se-(β̂-+-̂β-) ∼ t(n−K−1)
    3   4

Var(β̂ 3 + ̂β 4) will be treated as outlined above for the case of Var( ̂
β 1 + ̂
β 2). Note that this test can also be conducted as an F test, as we will cover in our class discussions.

5.
We will conduct the test
H 0 : β 1 = β2 = β3 = β4 = 0
H  : ∃ β ⁄= 0
  1     i

Total sum of squares TSS being (yi− ¯y) 2, explained sum of squares ESS being (̂yi− ¯y) 2 and residual sum of squares RSS being êi2:

(TSS − RSS)/K
RSS/(n-−-K−-1)-∼ F(K,n−K−1)
6.
We will conduct the test
H0 :β1 = β2 = 0

H1 :β1 ⁄= 0 or β2 ⁄= 0

J being the number of joint hypotheses, RSSR being the RSS for the restricted model and RSSU being the RSS for the unrestricted model:

(RSS  − RSS  )/J
----R------U---- ∼ F(J,n−K− 1)
RSSU /(n− K − 1)

Note / clarify again that RSSU is the RSS value of the unrestricted, i.e., full model, which is:

yi = β0+ β1xi1+ β 2xi2+ β3xi3 + β4xi4+ ei

where, RSSR is the RSS value of the restricted model, which is:

yi = β0+ β 3xi3+ β4xi4 + ei

equivalently of:

yi = β0 + 0⋅xi1+ 0 ⋅xi2+ β 3xi3+ β4xi4 + ei

Returning to our previous hypothesis test:

H 0 : β 1 = β2 = β3 = β4 = 0
H 1 : ∃ βi ⁄= 0

you will notice that the restricted model is:

yi = β0+ ei

or

yi = β0+ 0 ⋅xi1+ 0⋅xi2+ 0⋅xi3+ 0⋅xi4+ ei

against the unrestricted (full) model of:

yi = β0+ β1xi1+ β 2xi2+ β3xi3 + β4xi4+ ei

Herein, RSSR becomes the TSS of the full model (verify yourself), RSSU becomes the RSS of the full model (should be trivial) and J becomes K. Then, the equivalence between

(RSSR-−-RSSU-)/J ∼ F
RSSU /(n− K − 1)    (J,n−K− 1)

and

(TSS − RSS)/K
RSS/(n-−-K−-1)-∼ F(K,n−K−1)

becomes apparent.

We will now use a model estimated on a computer to exemplify each of the cases above:
[To be distributed as a handout]

0Checkpoint
No: 94

8.10 Essence of the Gauss-Markov assumptions

Having studied the mechanical aspects of Linear Regression models, now it is the time to establish the conditions under which a linear regression model is viable with workable results. As we often call it ’ideal econometric conditions’, the Gauss-Markov assumptions level the field for us. If a model abides by these assumptions, i.e., if a model has been formed so as to hold the Gauss-Markov assumptions, then it is a good econometric model.

 Gauss -Markov assumptions

A1. Linearity in parameters: The derivative of the y with respect to the
    parameters should not be a function of the parameters. Analytical
    solutions for the parameter estimates (coefficients) require this
    assumption and without it one needs numerical methods to solve
    for coefficients.

A2. Random  sampling (non- stochastic x): The sample should be so
    randomly  picked from the population that it is representative of
    the population. This has two advantages

    ∙ The results we get from the sample can be generalized to the
      whole population
 In a nutshell
    ∙ Our knowledge of x about the population can be applied in the
      sample so it is as if you know the sample x too.

A3. Variation in x: Econometrics analyzes how y changes with respect
    to x; for this x needs to change.

A4. Exogeneity: E (e | x) = E(e) = 0. Knowledge of x does not improve
    expectation of e as they will be independent of each other.

A5. The shocks to each observation are coming from the same dis-
    tribution indepedently and identically: Var(ei) =    σ2, ∀i and
    Cov(ei,ej) = 0, ∀i ⁄= j
                   (    )              (          )
A6. Var(ei) ∼ Normal 0,σ2 ⇒  yi ∼ Normal β0+ β1xi,σ2

Now we can review how good is our LS estimator under these conditions. Consider the Simple Linear regression model yi = β0 + β1xi + ei and consider the Gauss-Markov assumptions. Let us now try to see how good is our LS estimator under these assumptions. Recall that ̂β 1 is:

        x − ¯x  y − ¯y
β̂1 = ∑-(-i---)(-i2--)
        ∑ (xi− ¯x)

which can also be written as:

̂β1 = ∑-(xi−-x¯)yi-
     ∑ (xi− x¯)xi
     ∑-(xi−-x¯)(β0-+-β1xi+ei)
  =        ∑(xi− ¯x)xi
     β   (x − ¯x)+ β   (x − ¯x)x +   (x − x¯)e
  =  -0∑---i-------1∑---i-----i--∑--i-----i-
                   ∑(xi− ¯x)xi

As (xi− x¯) = 0 (shown before), the expression becomes:

̂β = β  + ∑-(xi−-x¯)-ei-
 1   1   ∑ (xi− x¯)xi
         ∑ (xi− x¯)ei
  = β1 + ---------2-
         ∑ (xi− ¯x)

Then,

            (                || )
E (̂β1 | x) = E β1+ ∑(xi−-x¯)ei||x
                   ∑ (xi− ¯x)2 |
                 (  (x − x¯)e || )
         = β1+ E   ∑--i----2i||x
                   ∑ (xi− ¯x) |
               ∑-(xi−-¯x)E-(ei | x)
         = β1+     ∑(x − x¯)2
                      i

can be written since our x (independent variable, explanatory variable) is non-stochastic.

We also know by the Gauss-Markov assumptions that E(ei | x) = 0, i.e., our knowledge of x does not improve expectation of e. So,

  (    )
E  ̂β1 | x = β1

equivalently saying ̂β 1 is an unbiased estimator of β1.

What about E(̂β0 | x) ?

yi = β 0+ β1xi+ ei → ¯y = β0+ β1¯x+ e¯
̂β0 = y¯− ̂β1¯x
                  ̂
  = β 0+ β(1¯x+ ¯e−)β1x¯
  = β 0−  ̂β1− β1 ¯x + ¯e

Then,

 (     )    (    (      )        )
E β̂0 | x = E β0−  ̂β1− β 1 ¯x+ ¯e | x
                  ((      )  )    ( ∑ e||  )
        = β 0− ¯xE--β̂1− β1-| x + E  ---i|| x
                ◟     ◝◜0     ◞   ◟---n◝◜---◞
                                      0

So,

  (    )
E  ̂β0 | x = β0

equivalently saying ̂β 0 is an unbiased estimator of β0.

   (̂ )      ̂    ( ̂  2
Var β1 =  E((β 1− E◟ ◝β◜1◞)) )
                  β1
           ( (           )2)
       =  E(   ∑(xi−-x¯)ei- )
               ∑ (xi− ¯x)2

Expanding the expression and rearranging its terms:

           (                                   |  )
                      2 2            (     )   ||
Var(̂β ) = E |( ∑-(xi−-¯x)-ei +-∑-∑(xi−-¯x)-xj −-¯x-eiej|| x|)
     1                  (  (x − ¯x)2) 2          ||
             (     )     ∑   i  (     )  (      )
  ∑ (xi− ¯x)2E e2i | x + ∑ ∑ (xi− x¯) xj − x¯ E eiej | x
= -----------------(---------2)2-----------------
                    ∑ (xi− ¯x)

As E(    )
 e2i | x = σ2 and E(     )
 eiej | x = 0,

   (̂ )   σ2-∑(xi−-¯x)2-  ----σ2----   --σ2----
Var β1  = (        2)2 = ∑ (x − ¯x)2 = nVar(x)
           ∑ (xi− ¯x)         i

This expression is a Noise/Signal (i.e., a noise-to-signal) ratio expression.

Examining

   (  )      2
Var ̂β1 =  --σ-----
          nVar(x)

We see that, to decrease Var(̂
β 1), a larger sample size n, a larger Var(x) and a smaller σ2 would help. Among these, the researcher’s choice of the sample data affects n and Var(x). σ2, on the other hand, is out of the researcher’s reach.

    ( )     ((      (  )2)
Var β̂0 = E   ̂β0− E  ̂β0
     ̂        (̂     )
   Asβ0 = β0(−(  β1− β1 x¯+ ¯e :          )
Var(̂β0) = E   β0− (̂β1− β1)x¯+ ¯e− E(β̂0)2
           ((     (      )         ))
        = E  β 0−  ̂β1− β1 ¯x +e¯− β0 2
           ((  (      )     )2)
        = E  −  β̂1 −β 1 ¯x+ ¯e
           2 (      )2    ( 2)      ((      ) )
        = ¯x E ̂β1− β 1 + E  ¯e  − 2x¯E   ̂β1− β1 ¯e

To simplify this expression observe/elaborate:

(1)

  (      )2      (  )
E  ̂β1− β1  =  Var ̂β1

(2)

  ( )     (    )2
E  e2 = E   ∑ei-  = -1E (∑ ei)2
             n      n2
                    -1   (   2)    (
                  = n2(E  ∑ ei + E◟--∑◝◜eiej◞))
                                     0
                    E (∑e2i)
                  = ---n2--
                    nσ2
                  = --2-
                     n2
                  = σ-
                    n

(3) E(̂     )
 β1− β1E(ē) = 0

Then,

V ar(̂β ) =---¯x2σ2---+ σ2
      0   ∑ (xi− ¯x)2   n

is reached. Rearranging:

            (               )
          σ2  n¯x2+ ∑ (x − ¯x)2
Var (β̂0) = -------------i------
              n ∑ (xi− x¯)2
               σ2    (   2     2               2)
        = ----------2  n¯x + ∑ xi − 2x¯∑ xi+ ∑  ¯x
          n ∑(xi− ¯x)
        = -----σ2----(nx 2+   x2 − 2n ¯x2+ nx2)
          n ∑(xi− ¯x)2       ∑  i
            (       2    )
Var (β̂0) = σ2  ----∑xi----
              n∑ (xi− ¯x)2

is obtained.

   (  )     (       2    )
Var ̂β0  = σ2  ---∑-xi---2
              n∑ (xi− ¯x)
   (  )       σ2
Var ̂β1  = --------2-
          ∑ (xi− x(¯)          )
    (̂  ̂)    −2  ----−¯x----
Cov  β0,β 1 = σ    ∑ (xi− ¯x)2

(1) The larger the value of σ2 the larger will be the variances of the estimators.

(2) Var (̂ )
 β1 will be smaller, the larger the value of (xi− ¯x) 2. This is also true for Var (  )
 ̂β0, but it is less evident as xi2 appears in the numerator of Var ( )
β̂0 expression.

(3) Because the number of terms in (xi− ¯x) 2 increases in n (sample size), an increase in n generally leads to an increase in precision.

0Checkpoint
No: 95

8.11 Model Specification

There are two main approaches to model specification:

Regarding either of the approaches, we need a good methodological basis. The material of the section entitled ’Statistical inference’, luckily, provides us with the toolset to establish that. The task of model specification involves a systematic sequence of hypothesis tests and evaluation of models with respect to some ad hoc criteria. While the t tests and F tests equip us to assess our models, R2, R2, AIC, BIC (or SIC) and HQ information criteria further strengthen our hand to come up with parsimonious model specifications.

Akaike Information Criterion:

          2   2k
AIC  = ln(σ )+ -n

Bayesian Information Criterion or Schwarz Information Criterion or Schwarz Criterion or Schwarz-Bayesian Criterion:

BIC = SIC = SC = SBC = ln(σ 2) + klnn-
                                n

Hannan-Quin Criterion:

              kln (ln(n))
HQ  = ln (σ2) + ---------
                  n

Among the rival models, the ones with lower information criterion values are preferable to others. Therein, it is a good practice to use the same sample size while comparing models via information criteria.

As of this point, we have a su fficient knowledge base to proceed to
our first econometric practice. In what follows through the lecture
nIonteas n,u wteshmeolblilize our statistical knowledge on field. Note that some
new  formulations and/or theoretical elements can be introduced if a

need arises.

0Checkpoint
No: 96

8.12 Regression analysis at work

In this section we will put our theoretical knowledge into practice. The modeling exercises that we will consider maintain a manageable pedagogical standard, they are somehow downsized and sometimes oversimplified. Yet, they are designed to deliver the intended message of the chapter with regard to applied statistical/ econometric research.

The cases we will consider are as follows:

While these cases are being examined, we will concurrently be learning the use of ’Dummy variables’ in an embedded fashion: The theoretical knowledge needed will be provided when/as necessary.
0Checkpoint
No: 97

Cross-section versus Time series data

Our choice of theoretical exposition in ECON 222 maintained/kept cross-section data at a central position. In that, we often referred to our observations yi, xi1, xi2, , xiK using the observation index ’ i ’. When this is the case, note that there is no natural ordering of observations. For example, writing the USA’s inflation rate in Row 2 of a data file, while we write the UK’s inflation rate in Row 7 for the same year and ’switching their rows’ do not yield different results.

Time series data, on the other hand, do have a natural ordering of observations, merely by the definition of time: before comes before now, now comes before tomorrow, so tomorrow comes after both. This underlines the importance of time as the primary key of our dataset when analyzing time series data and especially when we do it via dynamic models. Indifference/silence of this book, ECON 221 and ECON 222 about time series notation and data was of course intentional from a pedagogical viewpoint. Once you proceed to ECON 301 and ECON 302 (Econometrics sequence) be prepared to replace ’ i ’ with ’ t ’ as your new (and naturally ordered, t = 1, 2, ⋅⋅⋅ , T) observation index. Note that, all our formulations are rock solid / robust up to this change.

In the set of cases/exercises of this section, we make use of cross-section data sets.
NOTICE: Until a proper typeset is prepared, the cases/exercises of this section will be handled using Handouts. These Handouts will follow and summarize what is to be done in class lectures and they are available through the “Handouts” link under sites.google.com/view/erayyucel/teaching. To have the latest available material and stay informed, keep a keen eye on this page.
0Checkpoint
No: 98

8.13 Frisch-Waugh-Lovell theorem (FWL theorem)

FWL theorem shows how to decompose a regression of y on a set of variables x into two pieces. If we divide x into two sets of variables x1 and x2 and regress y on x1 and x2, the coefficient estimates on x2 can also be obtained through the following steps:

1.
Regress all variables in x2 on x1 and take the residuals
2.
Regress y on x1 and take the residuals
3.
Regress the residuals from step (2) on the residuals from step (1).

To demonstrate what the FWL theorem says, consider our Case02 (Home prices) again:

Dependent Variable: LP





     
Variable
Coefficient
Std. Error
t-Statistic
Prob.





     
C
0.652384
0.350431
1.861662
0.0655
LS
0.521313
0.085217
6.117444
0.0000
LT
0.368324
0.065693
5.606762
0.0000





     
  

In case02.wf1, page case02s2, we have the regression equation that

LP = β0+ β1LS + β2LT + e

estimated as above. So,

̂β0 = 0.6523
̂β = 0.5213
 1
̂β2 = 0.3683

Focus on ̂β 2 = 0.3683, i.e., the coefficient estimate of taxes (LT).

As to our application of the FWL theorem,

x = {LS,LT}
x1 = {LS }

x2 = {LT }

and y = LP.

(1) Here, we regress x2 on x1 that is LT on LS and extract the residuals and name it E_LT_ON_LS. You can view this series in case02.wf1.

Dependent Variable: LP





     
Variable
Coefficient
Std. Error
t-Statistic
Prob.





     
C
0.652384
0.350431
1.861662
0.0655
LS
0.521313
0.085217
6.117444
0.0000
LT
0.368324
0.065693
5.606762
0.0000





     
  

(2) Here, we regress y on x1 that is LP on LS and extract the residuals and name it E_LP_ON_LS. You can view this series in case02.wf1.

Dependent Variable: LT





     
Variable
Coefficient
Std. Error
t-Statistic
Prob.





     
C
1.473070
0.500340
2.944136
0.0040
LS
1.095477
0.067801
16.15724
0.0000





     
  

Finally, we regress E_LP_ON_LS on E_LT_ON_LS and obtain the coefficient estimate for E_LT_ON_LS as 0.3683.

Dependent Variable: E_LP_ON_LS





     
Variable
Coefficient
Std. Error
t-Statistic
Prob.





     
C
0.007499
0.013445
0.557758
0.5782
E_LT_ON_LS
0.368324
0.065399
5.631914
0.0000





     
  

Notice that coefficient estimate of LT in the very first regression is identical with the coefficient estimate of E_LT_ON_LS on this page.

This is how the FWL theorem functions.

0Checkpoint
No: 99 0 Checkpoint
No:100
The End