As discussed in the opening lectures, an analytical and evidence-based approach to policymaking is a must for the modern & complex societies. As the famous engineer Edwards Deming put it, “Without data, you’re just another person with an opinion.” Data and its analysis are what distinguish well-designed policies from arbitrary and/or flawed ones. So, understanding the features & structure of data is an indispensable step of analysis.
Practically every sort of empirical analysis begins with a need to describe a data set & its various elements. So, we begin our journey to learn probability theory & statistics from this simple yet crucial task. Recall from the class discussion that data may come in two main forms: qualitative and quantitative. While qualitative data qualifies ‘things’, quantitative data quantifies ‘things’, as the terms suggest. In that, qualitative data often have a categorical nature. If the values of a categorical variable are orderable (sortable) then this categrorical variable is called an ‘ordinal’ categorical variable. Otherwise, it is a ‘nominal’ categorical variable. While the responses in a satisfaction survey are ordinal (consider 1: least liked to 5: most liked), indicators of gender are nominal (F: female and M: male). Note that it is not always trivial to come up with a judgment: while we can treat age categories of ‘young, middle-aged and old’ for people or class/year categories of ‘freshman, sophomore, junior and senior’ for students as ‘ordinal’, another researcher may choose to treat them as nominal. All we need to reveal is our capability to sort a categorical variable with a clear understanding stripped of value-judgments. For instance, one cannot simply put one of the genders on top of others, regardless of the underlying way of thinking.
Quantitative data is by definition numerical. It can be either discrete ‘as in the case of number of automobiles owned by households’ (one cannot own fractional automobiles) or continuous ‘as in the case of daily spending by households’. Household size, i.e., the number of people forming the household is discrete, number of cities in a country is discrete, etc. The case of people’s ages measured in years can be a little confusing: think about it.
Classifying data
An important point regarding the continuous data is the distinction between ‘interval data’ and ‘ratio data’. A simple rule of thumb is: if there is an ‘absolute zero’ of the possible values of a data series, it is ‘ratio data’ & in the absence of an ‘absolute zero’ it is named ‘interval data’. A trivial example of this is the temperature measurements using the Kelvin (K) versus Celsius scale (∘C). While the Kelvin scale has an absolute zero, i.e. 0K, the Celsius scale does not. Freezing point of pure water (under certain conditions) is 0∘C, yet this is not the lowest attainable temperature. Indeed, there are some 273.15∘C more to go down until that point & −273.15∘C is defined as 0K and it is the lowest possible temperature in the Universe. While 200K is two times 100K, 200∘C is not two times 100∘C.
An easier example to understand the ratio data is the measurement of ‘mass’ (in kilograms, let’s say). Mass has an absolute zero, which is ‘0 Kg’ & a 20 Kg object is two times as heavy as a 10 Kg object (assuming there is gravity).
It is possible and often necessary to convert one data series ‘from numerical to categorical’. For instance, age data measured in ‘years lived’ can be expressed in terms of the qualifiers ‘young, middle-aged & old’. Note that, this transformation results in some loss of information. Clearly, a numerical age series tell more about the people surveyed compared to simple categorization. Still, when properly made, a good categorization of numerical values may prove very useful in statistical (or in econometric) analysis.
In the Oxford English Dictionary, ‘frequency’ is defined as “the rate at which something occurs over a particular period of time or in a given sample”. Our understanding covers the cases of ‘being’ in addition to ‘occurring’ or happening: Frequency is the numerical measure of ‘how often something happens or how often some specific way of being is observed’. In that, as we can count car accidents in a certain hour, we can also count the people that survived a certain accident. So, we can count ‘things’ in time (we can call this temporal counting) and in space (we can call this spatial counting).
A frequency distribution is a tabular summary of how numerical values are distributed to classes in a data series.
First, determine the number of classes k, according to:
Number of observations | k |
<50 | 5 − 7 |
50 − 100 | 7 − 8 |
101 − 500 | 8 − 10 |
501 − 1, 000 | 10 − 11 |
1, 001 − 5, 000 | 11 − 14 |
>5, 000 | 14 − 20 |
The table gives a rule of thumb and requires often your professional attention.
Second, determine the class width, w:
|
where Maximum is the ‘largest observation’ and Minimum is the ‘smallest observation’. Always round the formula result up, to find w.
Third, construct the k classes; they are to be inclusive and non-overlapping.
Fourth, allocate your observations to classes and get the count of each class.
At the end, present your result as a table. What is obtained is a "frequency distribution table".
Consider the age data of 20 people (20 subjects) measured in years:
While summarizing the age data, it seems appropriate to use 5 classes following the rule of thumb given before. The max of our data series is 24 and the min is 11. Class width w, then, is calculated as:
Having calculated the class width, beginning from the min value (11 here) we establish our classes as [11, 14), [14, 17), [17, 20), [20, 23) and [23, 26]. Pay attention to openness and closedness of classes (intervals) on the left and on the right.
Once the classes are ready, we carefully count the data values falling into each interval and prepare the following table, a table that we call the ’frequency table’.
The final step is to prepare (draw) the histogram of our data.
Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 11, 14, 17, 20, 23, 26
Consider another researcher who prefers arbitrarily to use 2 classes. In this case, the class width (w) will be:
The classes will be [11, 18) and [18, 25), so our resulting frequency table will look like:
Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 11, 18, 25
The final step is to prepare (draw) the histogram of our data again.
Which histogram (or frequency table) gives a better summary of the data? Avoid any confusions: the first histogram is the winner of the contest. It summarizes our data and conveys a tangible message. The second histogram, on the other hand, suffers from ‘oversummarizing’. Here take our discussion to its limits and consider a third researcher who prefers to use 1 class only. Why would that be nonsense?
What about categorical (qualitative) data? Consider the following data series which consists category markings for 20 people (20 subjects), where Y, M and O stand for ‘young’, ‘middle-aged’ and ‘old’, respectively.
This time, forming a frequency table must be easier: we do not (indeed, we cannot) establish classes & simply count the frequency of each category:
The final step is to prepare (draw) the bar chart of our data. It is more than trivial; do it yourself.
Once the counts in a frequency distribution table are divided by the total number of observations & expressed as "percentages" or as ‘fractions between 0 and 1’, the resulting table is called a "relative frequency distribution table". By construction, relative frequencies of all classes add up to 100% or 1.
Once the frequencies (counts) in a frequency distribution table are accumulated across classes, one row at a time and from the smallest to largest class, the resulting table is called a ‘cumulative frequency distribution table’.
Once the relative frequencies in a relative frequency distribution table are accumulated across classes, one row at a time and from the smallest to largest class, the resulting table is called a ‘relative cumulative frequency distribution table’.
In order to see the linkages between ‘frequency’, ‘relative frequency’, ‘cumulative frequency’ and ‘cumulative relative frequency’, examine the following table:
Class | Frequency | Relative | Cumulative | Cumulative |
frequency | frequency | relative | ||
frequency | ||||
[10, 17) | 500 | 0.333 | 500 | 0.333 |
[17, 24) | 250 | 0.167 | 750 | 0.500 |
[24, 31) | 150 | 0.100 | 900 | 0.600 |
[31, 38] | 600 | 0.400 | 1500 | 1.000 |
Total | 1500 | 1.000 | N.A. | N.A. |
A histogram is a graph that consists of vertical bars constructed on a horizontal line on which intervals are marked for the variable being displayed.
Height of each bar is the frequency or relative frequency associated
Histograms are traditionally used for continuous numerical data. When the midpoints of the top segment of each bar in a histogram are connected with line segments, what we obtain is called a frequency polygon. Note that a ‘bar chart’ resembles a histogram yet it differs in two main aspects: first, it is for categorical data & second, the bars in a bar chart are separated by a visible gap. Examples are provided in the upcoming exercises.
The shape of a distribution is said to be symmetric if the observations are balanced, or approximately evenly distributed, about its center. A distribution is skewed, or asymmetric, if the observations are not symmetrically distributed on either side of the center. A skewed-right distribution (sometimes called positively skewed) has a tail that extends farther to the right. A skewed-left distribution (sometimes called negatively skewed) has a tail that extends farther to the left.
An O-give, also called a cumulative line graph, is a line that connects points that are the cumulative percent of observations below the upper limit of each class (interval) in a cumulative frequency distribution. Even when not said so, an O-give is to present cumulative percentage figures. Beginning vertical value in an O-give is always 0 and ending vertical value is 1. Examples are provided in the upcoming exercises.
1.1 EXERCISES____________________________________________
Fill the empty cells in following table:
Interval | Feq. | Rel. freq.(%) | Cum. freq. | Rel. cum. freq.(%) |
[0, 20] | 20 | 10 | ||
(20, 40] | 80 | |||
(40, 60] | 30 | |||
(60, 80] | ||||
(80, 100] | 20 | 80 | ||
(100, 120] | ||||
Solution: The complete table is as follows:
Interval | Freq. | Rel. freq.(%) | Cum. freq. | Rel. cum. freq.(%) |
[0, 20] | 20 | 10 | 20 | 10 |
(20, 40] | 60 | 30 | 80 | 40 |
(40, 60] | 30 | 15 | 110 | 55 |
(60, 80] | 10 | 5 | 120 | 60 |
(80, 100] | 40 | 20 | 160 | 80 |
(100, 120] | 40 | 20 | 200 | 100 |
Consider the frequency histogram displayed below:
Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 1.5, 3, 4.5, 6, 7.5, 9
Draw the corresponding (relative frequency) o-give.
Solution: Prepare your Cartesian plane. The origin is (0, 0). Mark the following points on your graph space by paying attention to proportions: (0, 0), (1.5, 0.22), (3.0, 0.57), (4.5, 0.68), (6.0, 0.73), (7.5, 0.95), (9.0, 4.00). Then, connect these points with line segments from left to right. Once it is done, you will observe a properly drawn O-give. Make sure you have named the axes.
Consider the relative frequency o-give displayed below:
ii. What can you say about the percentage of observations that takes a value less than or equal to 6.5 (if you need to estimate it what would be a reasonable estimate)?
iii. What can you say about the percentage of observations that takes a value greater than or equal to 4.3 (if you need to estimate it what would be a reasonable estimate)?
Solution:
Consider the relative frequency o-give displayed below:
What can you say about the percentage of observations that takes a value between 4.5 and 9.5?
Solution: Use the same approach. 0.4/4 + 0.1 + 0 + 0.3/4 yields 0.275.
A doctor’s office staff studied the waiting times for patients who arrive at the office with a request for emergency service. The following data with waiting times in minutes were collected over a one-month period:
2, 5, 10, 12, 4, 4, 5, 17, 11, 8, 9, 8, 12, 21, 6, 8, 7, 13, 18, 117
ii. Construct a histogram for this data set after excluding the value of 117. Note that you still have to show it on your histogram (but how?)
iii. Which histogram is more informative? Why?
Solution:
Consider the relative frequency o-give displayed below:
Based on the information above estimate the median and the 3rd quartile (Q3).
Solution: Hint: For the median, find the horizontal value at which the O-give has the value of 50. For Q3, find the horizontal value at which the O-give has the value of 75.
In a data set, the frequency of the interval (0, 10] is 0.10, frequency of the interval (10, 20] is 0.20, frequency of the interval (20, 30] is 0.30 and frequency of the interval (30, 40] is 0.40. Construct the relative frequency O-give and calculate Q3 for this data set.
Solution: Solving this must be straightforward now. Do it and discuss with classmates.
Measures of central tendency or measures of concentration indicate ‘where’ on the real number line our data series is. The three terms connotate:
As you’ll see in the upcoming classes, the knowledge of this is critically important to make several statistical assessments.
The "mode", whenever exists, is the most frequently occurring value in a data series.
Note that, the mode is commonly used with (but not restricted to) categorical data.
Consider,
where N = 10. Among the values of X, the most frequent (mostly repeated) value is 3, so we say Mode = 3. If X included another 4 like:
where N = 11, then we would say the Modes are 3 and 4.
For i=1N:
|
is called the "population mean".
In addition, for i=1n:
|
is called the ‘sample mean’
Considering,
where N = 11, the mean is calculated as:
μ | = | ||
= | |||
= 3.81 |
In another case for X, suppose the last value, i.e., 7, is replaced by 42; let’s call this data series as X′:
where N = 11 again, the mean becomes:
μ | = | ||
= | |||
= 7 |
As this example suggests, mean (μ) is sensitive to outliers/extreme values. However, this sensitivity does not imply that μ is a meaningless or a useless measure. On the contrary, it is a fundamental measure with many good statistical properties, as we will see in the upcoming chapters.
Consider,
X | Frequency | Relative frequency |
[11, 14) | 4 | 4/20 |
[14, 17) | 4 | 4/20 |
[17, 20) | 3 | 3/20 |
[20, 23) | 7 | 7/20 |
[23, 26] | 2 | 2/20 |
Using the midpoints and frequencies of classes:
μ | = | ||
= 18.35 |
is obtained.
Equivalently, one may use the relative frequencies of classes to calculate the same:
μ | = | ||
= 18.35 |
The "p-th" percentile in a data series is the smallest value which is greater than p% of observations. If there are N observations, we find
|
ordered position and read the observation at this position as the p-th percentile.
The "d-th" decile in a data series is the smallest value which is greater than d tenths of observations. The "q-th" quartile in a data series is the smallest value which is greater than q quarters of observations.
While we imagine we are slicing our ordered data set into 100 while finding percentiles, we slice it into 10 while finding deciles, and we slice it into 4 while finding quartiles.
By definition P0, Q0, D0 are equal to the minimum observation and P100, Q4, D10 are equal to the maximum observation.
0th percentile | → | 0th decile | → | 0th quartile (Q0) | → | Minimum |
10th percentile | → | 1st decile | ||||
20th percentile | → | 2nd decile | ||||
25th percentile | → | → | → | 1st quartile (Q1) | ||
30th percentile | → | 3rd decile | ||||
40th percentile | → | 4th decile | ||||
50th percentile | → | 5th decile | → | 2nd quartile (Q2) | → | Median |
60th percentile | → | 6th decile | ||||
70th percentile | → | 7th decile | ||||
75th percentile | → | → | → | 3rd quartile (Q3) | ||
80th percentile | → | 8th decile | ||||
90th percentile | → | 9th decile | ||||
100th percentile | → | 10th decile | → | 4th quartile (Q4) | → | Maximum |
Among the many percentiles, “Median” has a special place. "Median" in a data series is the smallest value which is greater than 50% of observations. It simply divides a data series into two equally-likely halves. Numerically, median is nothing but Q2, P50, D5, which are all the same.
Consider now a variable X,
1st | 2nd | 3rd | 4th | 5th | 6th | 7th | 8th | 9th | 10th | 11th |
1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
To find, for instance, the 30th percentile of X we calculate:
= | |||
= 3.6th |
Then, we find the observation value in the 3.6th position of the ordered data series. As seen here, there may not be such a physical position in data. As an approximation, we take the value of X in the 3rd position and add 0.6 times the difference between the value in 4th position and the value in 3rd position, i.e.,
P30 | = 2 + (3 − 2) ⋅ 0.6 | ||
= 2.6 |
is our 30th percentile.
To find the 80th percentile of X we calculate:
= | |||
= 9.6th |
P80 | = 34 + (55 − 34) ⋅ 0.6 | ||
= 46.6 |
So, 46.6 is our 80th percentile.
To find the Median, i.e., the 50th percentile of X we calculate:
= | |||
= 6th |
Without further calculations, 8 is our Median.
Consider another data series:
Solve yourself to see that the Median is the 20.5th value of this data
series and it is equal to 20.
Before moving forward, consider finally:
Did you notice anything?
When we give the five descriptive measures
|
it is called a "five-number summary". This is a somehow ancient and still
useful tool to summarize data sets.
For a variable X given as:
the five-number summary is (2, 10.25, 20, 24, 49).
To make use of our knowledge gained up to this point, consider a data
that is summarized with the following relative frequency o-give:
Based on the information above estimate the median, the mean, and the 3rd quartile.
In order to reach a good solution, note that the median is the 50th percentile and the 3rd quartile is the 75th percentile. Under the assumption that data is uniformly distributed over each class interval, a relative cumulative frequency o-give gives information about the percentage of observations that takes a value less than or equal to a given number, we can use the o-give to estimate the median and the 3rd quartile using the o-give. On the graph we mark the points that correspond to 50% and 75%.
From similarity of triangles we have:
which yields
Similarly
yields
We will use CMl to denote the class mark of the lth class interval (the center of the lth interval). We will use RFl to denote the relative frequency of the observations that takes values in the lth class interval.
The assumption of data being uniformly distributed over each class interval that the following formula yields a “reasonable” estimate for the mean:
Thus our estimate for the mean is
(0.2 − 0) | |||
+ (0.9 − 0.7) | |||
+ (1.0 − 0.9) | |||
= 54 |
Measures of dispersion or measures of variation indicate how ‘spread’ on the real number line our data series is. The four terms connotate:
Without properly assessing dispersion, the knowledge of location means only a little.
|
or
|
Range measures the length of the interval on the real number line spanned by our data set.
The interquartile range (IQR) is defined as:
|
IQR measures the length of the interval on the real number line spanned by the "central 50%" our data set.
The five-number summary presented as a graph is called a Box-Whisker plot. Sometimes, near outliers and far outliers can also be added while constructing these plots.
For i=1N:
|
is called the "population variance", and for i=1n:
|
is called the ‘sample variance’
Consider:
For this series, μ = 12 (calculate yourself) and the variance is calculated as follows:
σ2 | = | ||
= | |||
= | |||
= 84 | |||
A tabular approach may also be preferred:
i | xi | xi −μ | (xi −μ)2 |
1 | 1 | -11 | 121 |
2 | 3 | -9 | 81 |
3 | 6 | -6 | 36 |
4 | 10 | -2 | 4 |
5 | 15 | 3 | 9 |
6 | 21 | 9 | 81 |
7 | 28 | 16 | 256 |
588 | |||
σ2 = 588/7 = 84 | |||
Finally, one may calculate the sum of squares as 1596 and the mean as 12, and calculates the variance σ2 as 1596/7 − 122, which is 84.
For i=1N:
|
is called the "population standart deviation", and for i=1n:
|
is called the ‘sample variance’
Consider the following data series:
We are now asked to describe this data series, including its mean, five-number summary, range, interquartile range and variance. For ease in calculating the positional measures (quartiles here), it is a good practice to order the observations from the smallest to the largest, i.e., in ascending order:
The following are then found:
For the same data series (call it X), the Box-Whisker plot looks like:
Population coefficient of variation:
|
Sample coefficient of variation:
|
When we deal with one variable in our analysis, it is a case with "univariate" data. When we are concerned with patterns of change of two variables together, it is a case involving "bivariate" data. In these lecture notes,
i=1nX and
i=1nY
indicate univariate data, but
i=1n
indicate bivariate data.
Notice that, bivariate data come in "pairs", so one cannot change the correspondence between x’s and y’s.
Covariance is a measure of the linear relationship between two variables.
For i=1N,
|
For i=1n,
|
The correlation for i=1N is given by
|
and for i=1n
|
When |r|≥ we say the (linear) relationship is strong enough (or
significant). Notice that it is always the case that
−1 ≤ ρxy ≤ 1
−1 ≤ rxy ≤ 1
Despite not paid enough attention by economists and business administration people, almost every data series comes with a unit and scale. For instance, if my income is TRY96, 000, the unit is TRY (international code for Turkish lira) and the scale is not explicitly said. If we write it as TRY96K, the unit is again TR and the scale is "thousands", so 96 means 96, 000 here.
Consider i=1N where x is the body weight in kilograms (kg) y
is the height in centimeters (cm). Then,
Measure | Unit |
μ x | kg |
μ y | cm |
σx2 | kg2 |
σx | kg |
σy2 | cm2 |
σx | cm |
cvx = | Unitless |
cvy = | Unitless |
σxy | kg.cm |
ρxy | Unitless |
Similarly, quartiles, deciles and percentiles of a variable have the same unit as the variable. Range of a variable has the same unit as the variable. Interquartile range of a variable has the same unit as the variable. As a rule of thumb, linear operators do not alter the units.
Use of numerical scales is often a matter of practicality or convenience. Nobody likes to write 123, 000, 000, 000, 000 (except politicians) instead of writing 123 trillions or 123.1012. One may need to learn two important practices of scaling numbers:
In this edition, these are left to readers as individual study.
1.2 EXERCISES____________________________________________
Consider a population with data values of:
5, 6, 3, 3, 6, 9, 10, 4, 10, 4
Compute the mean, range, standard deviation, median, and Q1.
Solution: μ = 6, Range = max−min = 7, σ = 2.61, Q2 = 5.5 and Q1 = 3.75.
Find the mean, median, mode(s), variance, range, 1st quartile, and
the 80th percentile of the data given below:
9, 13, 6, 7, 8, 6, 6, 9, 13, 13
Solution: μ = 9, Q2 = 8.5, modes are 6 and 13, σ2 = 8, Range = 7, Q1 = 6 and P80 = 13.
A population has a range of R and it consists of two observations only. Calculate the variance of this data set.
Solution: x1 and x2 are the only two observations, and
x2 −x1 = R is given (suppose . Then x2 −μ = R/2 and
x1 −μ = −R/2.
A researcher argues that median equals the simple average of the first and third quartiles. By giving a numerical example, show that this is incorrect.
Let a and b be any given real numbers. Let x1, x2, ..., xN and y1, y2, ..., yN be two data sets such that, for any i = 1, 2, ..., N, and yi = axi + b.
ii. What is the relation between the variance of the y-values and the variance of x-values?
Solution: Needs some careful and patient elaboration.
Consider a bivariate data consisting of the 1st midterm and 2nd midterm grades of 216 students. It is known that the 1st midterm grade of each student is 8% less than his 2nd midterm grade. If the mean of the 2nd midterm grades is 64 and the variance is 9 what can you say about the correlation coefficient of this data?
If we remove a data point from a data series, variance decreases. True or false? Explain.
Solution: For a logical statement to be true, it must be true without any exceptions. Consider first {1, 5, 9} and second {1, 9}. Which set of values has a larger variance? What is your conclusion?
When we multiply each point in a data series by the same factor, the variance increases. True or false? Explain.
Solution: Consider yi = kxi, i = 1, 2, …, N. You have seen before that
Then, σy2 is greater than σx2 only when k2 > 1, i.e., |k| > 1. So, the given statement is false (as we are able to find a counter example).
Below is the distribution of a variable X based on a sample of 40 observations. Compute the coefficient of variation.
X | Frequency |
10 − 14 | 8 |
15 − 19 | 16 |
20 − 24 | u |
25 − 29 | 4 |
30 − 34 | 2 |
Solution: μ = 19, σ2 = 28.5 and σ = 5.34. So,
Consider the two populations of bivariate data:
iii. Standardize the x and y values in each population and plot the scatter plots for the standardized values.
Solution:
μ1,x | = | ||
μ1,y | = | ||
μ2,x | = | ||
μ2,y | = |
Thus
Cov1 | = | ||
Cov2 | = | ||
To find the correlation coefficient we first find the variances:
σ1,x | = | ||
σ1,y | = | ||
σ2,x | = | ||
σ2,y | = | ||
Thus
ρ1 | = | ||
ρ2 | = | ||
The standardized values are plotted below:
Note that even though the original populations where on lines with different slope the standardized values are on a line with slope 1.
For any data set with a mean of μ and variance of σ2, and any k > 1, at least
|
of the observations will take a value in the interval [μ−kσ, μ + kσ].
1.3 EXERCISES ___________________________________________________________
Consider a population with a mean of 4 and variance of 36. Using Chebyshev’s theorem find an interval that contains at least 70% of the observations.
Solution: μ = 4 and σ2 = 36, so σ = 6.
So, the requested interval is:
The monthly charges for credit card holders at a department store have a mean of $250 and a standard deviation of $100. Use Chebyshev’s theorem to answer the following questions:
i. What can you say, for sure, about the percentage of card holders who will have monthly charges between $100 and $400?
ii. Provide a range for the credit card charges that will include at least 80% of all credit card customers.
Solution:
of all observations.
In a stock exchange average return over a year turns out to be 1% with a standard deviation of 2%. Over the same year, average exam grade in a university is 60 points with a standard deviation of 24 points. Which one has a higher variability, the returns or the grades?
Solution: CV for returns is 2%/1% = 2 and CV for grades is 24points/60points = 0.4. So, returns have a higher variability.
When we replace a positive data value with its "additive inverse" in a data set, variance increases. Is this claim true or false? Either prove that it is true, or provide a counter example to show that it is false. Make sure you have used a formal mathematical notation.
Solution: Consider {5, −7} and {−5, −7}. Which pair of values has higher variance? Then, come up with a conclusion.
If you’ve N numbers x1, x2, ..., xN, the sum of these numbers, S, is:
|
In expressing sums like this, we always write the first two terms, then three periods, then the last term. A shorter way to write S is:
|
where i is the index of x, running from 1 to N.
Unless otherwise specified, i increases by 1 every time, from 1 to N. So, S = ∑i=1Nxi is read as "sum of xi, i from 1 to N". This means,
For example,
|
If N is a number well-known in a problem:
|
is a valid expression and it means "consider all xi”,
Notice that
|
and so forth.
In case we want to write
|
using our summation operator, we can write it as:
|
As you’ve seen, wisely using i solves many problems.
Consider:
|
which is equivalent to
|
S, then, can be written as
|
So, if each number in the sequence x1, x2, ..., xN is multiplied by the same value which is not a function of i, this value can be taken out of the summation sign Σ.
Consider:
|
Notice that, this is the same thing as:
|
Consider:
|
Notice that,
|
Sum of the products is not equal to product of the sums. Expand (write long) the expressions to see why not.
Consider:
|
How can we write this in short?
|
Notice that:
|
Consider:
|
Since ȳ are not indexed with i, the expression is equal to:
and
|
Consider:
|
Notice that
|
Sum of the squares is not equal to square of the sum.
If you’ve N numbers x1, x2, ..., xN, the product of these numbers, P, is:
|
In expressing products like this, we always write the first two terms, then three periods, then the last term. A shorter way to write P is:
|
where i is the index of x, running from 1 to N.
Consider:
|
What’s this?
It is nothing but P = ∏i=1Nxi with xi = i. So P equals:
|
Regarding our future purposes, an important property to remember is:
|
as we’ll use while writing Likelihood functions in ECON 222.
Finally, consider:
|
What’s this? Expand it to see:
|
which is nothing but the binomial expansion
(x + y)N
as we’ll use while studying the Binomial and Poisson distributions in ECON 221.
1.4 EXERCISES ___________________________________________________________
Consider the expression for the population variance:
|
and simplify it until you see:
|
Solution:
So, variance can be calculated by subtracting ’the square of the mean of observations’ from ’the mean of squared observations’.