Chapter 1
Describing data

As discussed in the opening lectures, an analytical and evidence-based approach to policymaking is a must for the modern & complex societies. As the famous engineer Edwards Deming put it, “Without data, you’re just another person with an opinion.” Data and its analysis are what distinguish well-designed policies from arbitrary and/or flawed ones. So, understanding the features & structure of data is an indispensable step of analysis.

1.1 A taxonomy of data types

Practically every sort of empirical analysis begins with a need to describe a data set & its various elements. So, we begin our journey to learn probability theory & statistics from this simple yet crucial task. Recall from the class discussion that data may come in two main forms: qualitative and quantitative. While qualitative data qualifies ‘things’, quantitative data quantifies ‘things’, as the terms suggest. In that, qualitative data often have a categorical nature. If the values of a categorical variable are orderable (sortable) then this categrorical variable is called an ‘ordinal’ categorical variable. Otherwise, it is a ‘nominal’ categorical variable. While the responses in a satisfaction survey are ordinal (consider 1: least liked to 5: most liked), indicators of gender are nominal (F: female and M: male). Note that it is not always trivial to come up with a judgment: while we can treat age categories of ‘young, middle-aged and old’ for people or class/year categories of ‘freshman, sophomore, junior and senior’ for students as ‘ordinal’, another researcher may choose to treat them as nominal. All we need to reveal is our capability to sort a categorical variable with a clear understanding stripped of value-judgments. For instance, one cannot simply put one of the genders on top of others, regardless of the underlying way of thinking.

Quantitative data is by definition numerical. It can be either discrete ‘as in the case of number of automobiles owned by households’ (one cannot own fractional automobiles) or continuous ‘as in the case of daily spending by households’. Household size, i.e., the number of people forming the household is discrete, number of cities in a country is discrete, etc. The case of people’s ages measured in years can be a little confusing: think about it.

Classifying data
QCQNOuDoIntRuoraisnteratamdncinvaioliiintreulDtnaaitoaatlltaeutivtisaeve
((CNautemgeorriiccaal)l)

An important point regarding the continuous data is the distinction between ‘interval data’ and ‘ratio data’. A simple rule of thumb is: if there is an ‘absolute zero’ of the possible values of a data series, it is ‘ratio data’ & in the absence of an ‘absolute zero’ it is named ‘interval data’. A trivial example of this is the temperature measurements using the Kelvin (K) versus Celsius scale (C). While the Kelvin scale has an absolute zero, i.e. 0K, the Celsius scale does not. Freezing point of pure water (under certain conditions) is 0C, yet this is not the lowest attainable temperature. Indeed, there are some 273.15C more to go down until that point & 273.15C is defined as 0K and it is the lowest possible temperature in the Universe. While 200K is two times 100K, 200C is not two times 100C.

An easier example to understand the ratio data is the measurement of ‘mass’ (in kilograms, let’s say). Mass has an absolute zero, which is ‘0 Kg’ & a 20 Kg object is two times as heavy as a 10 Kg object (assuming there is gravity).

1.2 What is a "data set"?

It is possible and often necessary to convert one data series ‘from numerical to categorical’. For instance, age data measured in ‘years lived’ can be expressed in terms of the qualifiers ‘young, middle-aged & old’. Note that, this transformation results in some loss of information. Clearly, a numerical age series tell more about the people surveyed compared to simple categorization. Still, when properly made, a good categorization of numerical values may prove very useful in statistical (or in econometric) analysis.

CSoenavrecrhsi&onex‘fprloorme categorical to numerical’ may not be so straightfor-
ward: come up with your result.

0Checkpoint
No: 7

1.3 Frequency

In the Oxford English Dictionary, ‘frequency’ is defined as “the rate at which something occurs over a particular period of time or in a given sample”. Our understanding covers the cases of ‘being’ in addition to ‘occurring’ or happening: Frequency is the numerical measure of ‘how often something happens or how often some specific way of being is observed’. In that, as we can count car accidents in a certain hour, we can also count the people that survived a certain accident. So, we can count ‘things’ in time (we can call this temporal counting) and in space (we can call this spatial counting).

In our learning and practice of the Probability theory and Statistics
wIen wai nlult bsehe‘clolunting the things’, simply using our fingertips at the
beginning, and using more sophisticated techniques then.

1.3.1 Frequency distribution

A frequency distribution is a tabular summary of how numerical values are distributed to classes in a data series.

First, determine the number of classes k, according to:

Number of observationsk
<50 5 7
50 100 7 8
101 500 8 10
501 1, 000 10 11
1, 001 5, 000 11 14
>5, 000 14 20

The table gives a rule of thumb and requires often your professional attention.

A ‘rule of thumb ’ is a broadly accurate guide or principle, based on
pIrnacat nicuetsrahtehlelr than theory.

Second, determine the class width, w:

w =  Maximum--−-Minimum--
              k

where Maximum is the ‘largest observation’ and Minimum is the ‘smallest observation’. Always round the formula result up, to find w.

Third, construct the k classes; they are to be inclusive and non-overlapping.

Fourth, allocate your observations to classes and get the count of each class.

At the end, present your result as a table. What is obtained is a "frequency distribution table".

Consider the age data of 20 people (20 subjects) measured in years:

12  11 19  20  20
15  15 24  15  12
20  18 17  20  20
20  22 24  12  14

While summarizing the age data, it seems appropriate to use 5 classes following the rule of thumb given before. The max of our data series is 24 and the min is 11. Class width w, then, is calculated as:

w = 24−-11-= 2.6, always rounding up ̂|w = 3
       5

Having calculated the class width, beginning from the min value (11 here) we establish our classes as [11, 14), [14, 17), [17, 20), [20, 23) and [23, 26]. Pay attention to openness and closedness of classes (intervals) on the left and on the right.

Once the classes are ready, we carefully count the data values falling into each interval and prepare the following table, a table that we call the ’frequency table’.

Class   Frequency
[11,14)      4
[14,17)      4
[17,20)      3
[20,23)      7
[23,26]      2

The final step is to prepare (draw) the histogram of our data.

11222024681470360

Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 11, 14, 17, 20, 23, 26

0Checkpoint
No: 8

Consider another researcher who prefers arbitrarily to use 2 classes. In this case, the class width (w) will be:

    24−-11-                       ̂
w =    2   = 6.5, always rounding up |w = 7

The classes will be [11, 18) and [18, 25), so our resulting frequency table will look like:

Class   Frequency
[11,18)      9

[18,25]     11

1205118505

Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 11, 18, 25

The final step is to prepare (draw) the histogram of our data again.

Which histogram (or frequency table) gives a better summary of the data? Avoid any confusions: the first histogram is the winner of the contest. It summarizes our data and conveys a tangible message. The second histogram, on the other hand, suffers from ‘oversummarizing’. Here take our discussion to its limits and consider a third researcher who prefers to use 1 class only. Why would that be nonsense?

In order to summarize numerical (quantitative) data we use ‘fre-
qIunenac nyuttsahbelelsl’ and ‘histograms ’. The columns belonging to con-
secutive nonempty classes must touch each other while drawing a
histogram

What about categorical (qualitative) data? Consider the following data series which consists category markings for 20 people (20 subjects), where Y, M and O stand for ‘young’, ‘middle-aged’ and ‘old’, respectively.

Y   Y   O  M   Y
O   Y   O  O   Y
M   M   Y  M   O
O   M   Y  Y   M

This time, forming a frequency table must be easier: we do not (indeed, we cannot) establish classes & simply count the frequency of each category:

Category  Frequency
   Y         8
  M          6
   O         6

The final step is to prepare (draw) the bar chart of our data. It is more than trivial; do it yourself.

In order to summarize categorical (qualitative) data we use ‘fre-
quency tables’ and ‘bar charts’. The bars will never touch each other
In a nutshell
while drawing a bar chart.

0Checkpoint
No: 9

1.3.2 Relative frequency distribution

Once the counts in a frequency distribution table are divided by the total number of observations & expressed as "percentages" or as ‘fractions between 0 and 1’, the resulting table is called a "relative frequency distribution table". By construction, relative frequencies of all classes add up to 100% or 1.

1.3.3 Cumulative frequency distribution

Once the frequencies (counts) in a frequency distribution table are accumulated across classes, one row at a time and from the smallest to largest class, the resulting table is called a ‘cumulative frequency distribution table’.

1.3.4 Relative cumulative frequency distribution

Once the relative frequencies in a relative frequency distribution table are accumulated across classes, one row at a time and from the smallest to largest class, the resulting table is called a ‘relative cumulative frequency distribution table’.

In order to see the linkages between ‘frequency’, ‘relative frequency’, ‘cumulative frequency’ and ‘cumulative relative frequency’, examine the following table:

Class Frequency Relative CumulativeCumulative
frequency frequency relative
frequency
[10, 17) 500 0.333 500 0.333
[17, 24) 250 0.167 750 0.500
[24, 31) 150 0.100 900 0.600
[31, 38] 600 0.400 1500 1.000
Total 1500 1.000 N.A. N.A.

1.4 Representation of distributions

1.4.1 Histogram and relative frequency polygon

A histogram is a graph that consists of vertical bars constructed on a horizontal line on which intervals are marked for the variable being displayed.

Histograms are traditionally used for continuous numerical data. When the midpoints of the top segment of each bar in a histogram are connected with line segments, what we obtain is called a frequency polygon. Note that a ‘bar chart’ resembles a histogram yet it differs in two main aspects: first, it is for categorical data & second, the bars in a bar chart are separated by a visible gap. Examples are provided in the upcoming exercises.

1.4.2 Symmetry and skewness

The shape of a distribution is said to be symmetric if the observations are balanced, or approximately evenly distributed, about its center. A distribution is skewed, or asymmetric, if the observations are not symmetrically distributed on either side of the center. A skewed-right distribution (sometimes called positively skewed) has a tail that extends farther to the right. A skewed-left distribution (sometimes called negatively skewed) has a tail that extends farther to the left.

0Checkpoint
No: 10

1.4.3 O-give

An O-give, also called a cumulative line graph, is a line that connects points that are the cumulative percent of observations below the upper limit of each class (interval) in a cumulative frequency distribution. Even when not said so, an O-give is to present cumulative percentage figures. Beginning vertical value in an O-give is always 0 and ending vertical value is 1. Examples are provided in the upcoming exercises.

1.1 EXERCISES____________________________________________     

1. 

Fill the empty cells in following table:

Interval Feq.Rel. freq.(%)Cum. freq.Rel. cum. freq.(%)
[0, 20] 20 10
(20, 40] 80
(40, 60] 30
(60, 80]
(80, 100] 20 80
(100, 120]

Solution: The complete table is as follows:

Interval Freq.Rel. freq.(%)Cum. freq.Rel. cum. freq.(%)
[0, 20] 20 10 20 10
(20, 40] 60 30 80 40
(40, 60] 30 15 110 55
(60, 80] 10 5 120 60
(80, 100] 40 20 160 80
(100, 120] 40 20 200 100
2. 

Consider the frequency histogram displayed below:

1346701234.5.5.50000

Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 1.5, 3, 4.5, 6, 7.5, 9

Draw the corresponding (relative frequency) o-give.

Solution: Prepare your Cartesian plane. The origin is (0, 0). Mark the following points on your graph space by paying attention to proportions: (0, 0), (1.5, 0.22), (3.0, 0.57), (4.5, 0.68), (6.0, 0.73), (7.5, 0.95), (9.0, 4.00). Then, connect these points with line segments from left to right. Once it is done, you will observe a properly drawn O-give. Make sure you have named the axes.

3. 

Consider the relative frequency o-give displayed below:

RX2468103391e00000l0. Cum. Freq. (%)

i. Draw the corresponding histogram.

ii. What can you say about the percentage of observations that takes a value less than or equal to 6.5 (if you need to estimate it what would be a reasonable estimate)?

iii. What can you say about the percentage of observations that takes a value greater than or equal to 4.3 (if you need to estimate it what would be a reasonable estimate)?

Solution:

1.
Consider the classes of [0, 2], (2, 4], (4, 6], (6, 8] and (8, 10]. Taking simple differences, reveal that relative frequencies of these classes are 0, 0.3, 0, 0.6 and 0.1. Locate these numbers on cartesian plane to obtain the histogram. Recall that for two successive classes which are non-empty, the bars must touch each other.
2.
Relative frequency of (6, 8] is 0.6. (6.5 6)/(8 6) = 0.25. So, approximately 0.25 × 0.6, i.e., 0.15 is the frequency (relative) of (6, 6.5]. Since the relative frequency of [0, 6] is 0.3, the frequency of [0, 6.5] becomes 0.45. If the given 0 -give has been drawn properly, then we expect our estimate to be reliable.
3.
Use the same approach. The solution should yield 0.70.
4. 

Consider the relative frequency o-give displayed below:

RX135791000001e1......l026770.000000 Cum. Freq.

What can you say about the percentage of observations that takes a value between 4.5 and 9.5?

Solution: Use the same approach. 0.4/4 + 0.1 + 0 + 0.3/4 yields 0.275.

5. 

A doctor’s office staff studied the waiting times for patients who arrive at the office with a request for emergency service. The following data with waiting times in minutes were collected over a one-month period:

2, 5, 10, 12, 4, 4, 5, 17, 11, 8, 9, 8, 12, 21, 6, 8, 7, 13, 18, 117

i. Construct a histogram for this data set by including all observations.

ii. Construct a histogram for this data set after excluding the value of 117. Note that you still have to show it on your histogram (but how?)

iii. Which histogram is more informative? Why?

Solution:

1.
For ease in processing data sort/order the values as 2, 4, 4, 5, 5, 6, 7, 8, 8, 8, 9, 10, 11, 12, 12, 13, 17, 18, 21 and 117. The minimum is 2 , the maximum is 117 and N is 20. 5 classes would work well here. Then, the class width becomes (117 2)/5 = 23, rounding up always, 24 . This means that our classes will be [2, 26], (26, 50], (50, 74], (74, 98] and (98, 122]. When drawn (do it), this will turn out to be a funny histogram, as 19 observations will fall into the first class and only one observation (117) will fall into the last one.
2.
When 117 is kept aside, the maximum becomes 21. Using 5 classes again, the class width becomes (21 2)/5, which is 3.8, rounding up always, 4 . Our classes will be [2, 6], (6, 10], (10, 14], (14, 18] and (18, 22]. Respective frequencyies of these will be 6, 6, 4, 2 and 1 . When drawn we will observe a neatly drawn histogram. (What about 117?)
3.
The second histogram is more informative. It gives us more details, like the shape of the distribution.
6. 

Consider the relative frequency o-give displayed below:

RX046811027991e0000200000l000. Cum. Freq. (%)

Based on the information above estimate the median and the 3rd quartile (Q3).

Solution: Hint: For the median, find the horizontal value at which the O-give has the value of 50. For Q3, find the horizontal value at which the O-give has the value of 75.

7. 

In a data set, the frequency of the interval (0, 10] is 0.10, frequency of the interval (10, 20] is 0.20, frequency of the interval (20, 30] is 0.30 and frequency of the interval (30, 40] is 0.40. Construct the relative frequency O-give and calculate Q3 for this data set.

Solution: Solving this must be straightforward now. Do it and discuss with classmates.

0Checkpoint
No: 11

1.5 Measures of central tendency

Measures of central tendency or measures of concentration indicate ‘where’ on the real number line our data series is. The three terms connotate:

As you’ll see in the upcoming classes, the knowledge of this is critically important to make several statistical assessments.

1.5.1 Measures of central tendency: Mode

The "mode", whenever exists, is the most frequently occurring value in a data series.

Note that, the mode is commonly used with (but not restricted to) categorical data.

Consider,

X : 1,2,3,3,3,4,4,5,6,7

where N = 10. Among the values of X, the most frequent (mostly repeated) value is 3, so we say Mode = 3. If X included another 4 like:

X : 1,2,3,3,3,4,4,4,5,6,7

where N = 11, then we would say the Modes are 3 and 4.

1.5.2 Measures of central tendency: Mean (Arithmetic mean)

For {xi}i=1N:

      N
μ = ∑-i=-1xi
       N

is called the "population mean".

In addition, for {xi}i=1n:

x¯= ∑ni=-1xi
       n

is called the ‘sample mean’

Considering,

X : 1,2,3,3,3,4,4,4,5,6,7

where N = 11, the mean is calculated as:

μ = ∑Ni=1xi
  N
= 1+-2+-3-+-3+-3+-4+-4-+-4+-5+-6-+7-
                11
= 3.81

In another case for X, suppose the last value, i.e., 7, is replaced by 42; let’s call this data series as X:

X : 1,2,3,3,3,4,4,4,5,6,42

where N = 11 again, the mean becomes:

μ = ∑Ni=1xi
--N---
= 1+ 2+ 3 + 3+ 3+ 4+ 4 + 4+ 5+ 6 + 42
----------------11-----------------
= 7

As this example suggests, mean (μ) is sensitive to outliers/extreme values. However, this sensitivity does not imply that μ is a meaningless or a useless measure. On the contrary, it is a fundamental measure with many good statistical properties, as we will see in the upcoming chapters.

0Checkpoint
No: 12

Writing mathematics
Good  mathematical writing involves:

∙  Using a relevant & consistent notation

∙  Flowing logically well the solution or proof steps
In a nutshell
∙  Including verbal explanations & necessary definitions between
   steps

∙  Putting things gravitationally, i.e., top to bottom

∙  Keeping only the essentials, removing everything redundant,
   avoiding scratches remained

0Checkpoint
No: 13

Working  with grouped data
In many situations, we the researchers are provided with a grouped
summary  of a data set, rather than the full data itself. Grouped data
sets mostly come in the form of classes with their corresponding
frequencies or relative frequencies, like in a histogram. Given popula-

tion data of N observations grouped into K classes, with frequencies
f1,f2,...,fK, if the midpoints of these classes are m 1,m2,...,mK, then
                              K
                        μ = ∑-i=1fimi
                               N
can be written where

In a nutshell              K
                          ∑  fi = N
                          i= 1
Given sample data of n observations grouped into K classes,
with frequencies f1,f2,...,fK, if the midpoints of these classes are
m 1,m2,...,mK, then

                            ∑K  f m
                        x¯= --i=1-i-i
                               n
can be written where
                           K
                             f = n
                          ∑i=1 i

0Checkpoint
No: 14

Consider,

X FrequencyRelative frequency
[11, 14) 4 4/20
[14, 17) 4 4/20
[17, 20) 3 3/20
[20, 23) 7 7/20
[23, 26] 2 2/20

Using the midpoints and frequencies of classes:

μ = 11+14-⋅4+ 14+17⋅4+ 17+20⋅3+  20+23⋅7 + 23+26-⋅2
--2--------2--------2--------2---------2----
               4+ 4+ 3 +7 + 2
= 18.35

is obtained.

Equivalently, one may use the relative frequencies of classes to calculate the same:

μ = 11-+-14-
   2 -4
20 + 14+-17-
   2 4-
20 + 17-+-20-
  2 -3
20 + 20+-23-
   2 -7
20 + 23+-26-
  2 2-
20
= 18.35

0 Checkpoint
No: 15

1.5.3 Measures of central tendency: Percentiles, Deciles and Quarties

The "p-th" percentile in a data series is the smallest value which is greater than p% of observations. If there are N observations, we find

-p-
100(N + 1)th

ordered position and read the observation at this position as the p-th percentile.

OIrndear nu(tssohrte)l tlhe observations in ascending order first. Without order-
ing data, you ’ll not get correct results.

The "d-th" decile in a data series is the smallest value which is greater than d tenths of observations. The "q-th" quartile in a data series is the smallest value which is greater than q quarters of observations.

While we imagine we are slicing our ordered data set into 100 while finding percentiles, we slice it into 10 while finding deciles, and we slice it into 4 while finding quartiles.

By definition P0, Q0, D0 are equal to the minimum observation and P100, Q4, D10 are equal to the maximum observation.

0th percentile 0th decile 0th quartile (Q0)Minimum
10th percentile 1st decile
20th percentile 2nd decile
25th percentile 1st quartile (Q1)
30th percentile 3rd decile
40th percentile 4th decile
50th percentile 5th decile 2nd quartile (Q2) Median
60th percentile 6th decile
70th percentile 7th decile
75th percentile 3rd quartile (Q3)
80th percentile 8th decile
90th percentile 9th decile
100th percentile10th decile4th quartile (Q4)Maximum

1.5.4 Measures of central tendency: Median

Among the many percentiles, “Median” has a special place. "Median" in a data series is the smallest value which is greater than 50% of observations. It simply divides a data series into two equally-likely halves. Numerically, median is nothing but Q2, P50, D5, which are all the same.

Consider now a variable X,

1st2nd3rd4th5th6th7th8th9th10th11th
1 1 2 3 5 8 13 21 34 55 89

To find, for instance, the 30th percentile of X we calculate:

-p-
100(N + 1)th = -30
100(11 + 1)th
= 3.6th

Then, we find the observation value in the 3.6th position of the ordered data series. As seen here, there may not be such a physical position in data. As an approximation, we take the value of X in the 3rd position and add 0.6 times the difference between the value in 4th position and the value in 3rd position, i.e.,

P30 = 2 + (3 2) 0.6
= 2.6

is our 30th percentile.

To find the 80th percentile of X we calculate:

-p-
100(N + 1)th = -80
100(11 + 1)th
= 9.6th

P80 = 34 + (55 34) 0.6
= 46.6

So, 46.6 is our 80th percentile.

To find the Median, i.e., the 50th percentile of X we calculate:

-p-
100(N + 1)th = -50
100(11 + 1)th
= 6th

Without further calculations, 8 is our Median.
Consider another data series:

2   3   4   6   7
8   9   9  10  10
11  11 11  11  13

14  18 19  19  19
21  21 22  22  23
23  23 24  24  24
24  25 25  25  26
26  26 26  26  49

Solve yourself to see that the Median is the 20.5th value of this data series and it is equal to 20.
Before moving forward, consider finally:

Variable  1st 2nd  3rd  4th  5th  6th  7th  8th  9th   Mean
   X      1    3   6   10   15   21  28   36  1000  124.4

   X’     1    3   6   10   15   21  28   36   45   18.3

Did you notice anything?

Unlike the mean (μ), the Median (Q 2) is not sensitive to out-
lIinersa/ neuxttsrehmeell values. Equivalently we say, the Median is robust
up to the presence of outliers/extreme values in a data series.

0Checkpoint
No: 16

1.5.5 Where is my data? Five-number summary

When we give the five descriptive measures

minimum ≤ Q 1 ≤ median ≤ Q3 ≤ maximum

it is called a "five-number summary". This is a somehow ancient and still useful tool to summarize data sets.
For a variable X given as:

2   3   4   6   7
8   9   9  10  10
11  11 11  11  13
14  18 19  19  19

21  21 22  22  23
23  23 24  24  24
24  25 25  25  26
26  26 26  26  49

the five-number summary is (2, 10.25, 20, 24, 49).
To make use of our knowledge gained up to this point, consider a data that is summarized with the following relative frequency o-give:

Rv468112791ea000020000llu000.e Cum. Freq. (%)

Based on the information above estimate the median, the mean, and the 3rd quartile.

In order to reach a good solution, note that the median is the 50th percentile and the 3rd quartile is the 75th percentile. Under the assumption that data is uniformly distributed over each class interval, a relative cumulative frequency o-give gives information about the percentage of observations that takes a value less than or equal to a given number, we can use the o-give to estimate the median and the 3rd quartile using the o-give. On the graph we mark the points that correspond to 50% and 75%.

Rv4681127915m7vea00002000005llu000.e Cum. Freq. (%)

From similarity of triangles we have:

m-−-40-  60−-40-
50− 20 = 70− 20

which yields

          20
m = 40+ 30-- = 52
          50

Similarly

v-−-60--= 80−-60-
75− 70   90− 70

yields

         20
v = 60+ 5--= 65
         20

We will use CMl to denote the class mark of the lth class interval (the center of the lth interval). We will use RFl to denote the relative frequency of the observations that takes values in the lth class interval.

The assumption of data being uniformly distributed over each class interval that the following formula yields a “reasonable” estimate for the mean:

RF CM  + RF CM   + RF CM  + RF  CM  + RF CM
  1   1     2   2    3   3     4   4    5   5

Thus our estimate for the mean is

(0.2 0)0+-40-
  2 + (0.7 0.2)40+-60-
  2
+ (0.9 0.7)60+-80-
  2 + (0.9 0.9)80-+-100-
   2
+ (1.0 0.9)100+-120-
   2
= 54

0 Checkpoint
No: 17

1.6 Measures of dispersion

Measures of dispersion or measures of variation indicate how ‘spread’ on the real number line our data series is. The four terms connotate:

Without properly assessing dispersion, the knowledge of location means only a little.

1.6.1 Measures of dispersion: Range

Range = Largestobs− Smallestobs

or

Range = Max − Min

Range measures the length of the interval on the real number line spanned by our data set.

1.6.2 Measures of dispersion: Interquartile range

The interquartile range (IQR) is defined as:

IQR = Q 3− Q1

IQR measures the length of the interval on the real number line spanned by the "central 50%" our data set.

0Checkpoint
No: 18

1.6.3 Measures of dispersion: Box-Whisker plots

The five-number summary presented as a graph is called a Box-Whisker plot. Sometimes, near outliers and far outliers can also be added while constructing these plots.

1.6.4 Measures of dispersion: Variance

For {xi}i=1N:

 2   ∑Ni=1(xi−-μ)2
σ  =      N

is called the "population variance", and for {xi}i=1n:

     ∑n  (xi− ¯x)2
s2 = -i=1n-−-1----

is called the ‘sample variance’

                       2   1- N  2   2
                      σ  = N ∑  xi − μ
                             i=1
A practical way to calculate variance is to:
In a nutshell                 1-  N   2
1. Calculate the mean of squares: N ⋅∑ i= 1xi
                              2   -1   N    2
2. Calculate the square of mean: μ = (N ⋅∑ i=1xi)
3. Subtract the second from the first

Consider:

1  3  6  10  15  21 28

For this series, μ = 12 (calculate yourself) and the variance is calculated as follows:

σ2 =   N        2
∑-i=1(xi−-μ)-
     N
=        2         2         2          2          2          2          2
(1−-12)-+-(3−-12)-+-(6−-12)-+-(10−-12)-+-(15−-12)-+-(21−-12)-+-(28−-12)-
                                   7
= 588
 7
= 84

A tabular approach may also be preferred:

ixixi μ (xi μ)2
1 1 -11 121
2 3 -9 81
3 6 -6 36
410 -2 4
515 3 9
621 9 81
728 16 256
588
σ2 = 588/7 = 84

Finally, one may calculate the sum of squares as 1596 and the mean as 12, and calculates the variance σ2 as 1596/7 122, which is 84.

0Checkpoint
No: 19

1.6.5 Measures of dispersion: Standard deviation

For {xi}i=1N:

           ∘------------
    √ --     ∑N  (x− μ )2
σ =   σ2 =   -i=1--i-----
                  N

is called the "population standart deviation", and for {xi}i=1n:

         ∘ --n---------
s = √s-2 = ∑-i=-1(xi−-¯x)2
               n− 1

is called the ‘sample variance’

Consider the following data series:

11  23  58 13  21  34

55  89  14 42  33  37
76  10  98 71  59  72
58  44  18 16  76  51
9   46  17 71  12  86
57  46  36 87  50  25
12  13  93 19  64  18

31  78  11 51  42  29

We are now asked to describe this data series, including its mean, five-number summary, range, interquartile range and variance. For ease in calculating the positional measures (quartiles here), it is a good practice to order the observations from the smallest to the largest, i.e., in ascending order:

9   10  11 11  12  12
13  13  14 16  17  18
18  19  21 23  25  29
31  33  34 36  37  42
42  44  46 46  50  51

51  55  57 58  58  59
64  71  71 72  76  76
78  86  87 89  93  98

The following are then found:

For the same data series (call it X), the Box-Whisker plot looks like:

2040608010X0

Working  with grouped data
Given population data of N observations grouped into K classes,
with frequencies f1,f2,...,fK, if the midpoints of these classes are
m 1,m2,...,mK, then

                            ∑Ki=1fimi
                        μ = ---N----

                           K
                          ∑  fi = N
                          i= 1
                           K  f(m  − μ)2
                     σ2 = ∑i=1-i--i-----
In a nutshell                   N
Given sample data of n observations grouped into K classes,
with frequencies f1,f2,...,fK, if the midpoints of these classes are
m 1,m2,...,mK, then

                            ∑Ki=1fimi
                        x¯= ---n----

                           K
                          ∑  fi = n
                          i=1
                           K  f(m − x¯)2
                     s2 = ∑i=-1-i--i-----
                              n − 1

1.6.6 Measures of dispersion: Coefficient of variation

Population coefficient of variation:

    σ
cv = --100% if μ ⁄= 0
    μ

Sample coefficient of variation:

     s
cv = x¯100% ifx¯⁄= 0

0 Checkpoint
No: 20

1.7 Measures of association for bivariate data

When we deal with one variable in our analysis, it is a case with "univariate" data. When we are concerned with patterns of change of two variables together, it is a case involving "bivariate" data. In these lecture notes,

{xi}i=1nX and {xi}i=1nY

indicate univariate data, but

{(xi,yi)}i=1n

indicate bivariate data.

Notice that, bivariate data come in "pairs", so one cannot change the correspondence between x’s and y’s.

1.7.1 Measures of association for bivariate data: Covariance

Covariance is a measure of the linear relationship between two variables. For {(xi,yi)}i=1N,

               ∑Ni=-1(xi−-μX-)(yi−-μY)
Cov(x,y) = σxy =         N

For {(xi,yi)}i=1n,

                 n
Cov(x,y) = sxy = ∑i=1(xi−-x¯)(yi−-¯y)
                        n

1.7.2 Measures of association for bivariate data: Correlation

The correlation for {(xi,yi)}i=1N is given by

ρxy =-σxy
     σxσy

and for {(xi,yi)}i=1n

rxy =-sxy-
     sxsy

When |r|≥2√--
 n we say the (linear) relationship is strong enough (or significant). Notice that it is always the case that

1 ρxy 1

1 rxy 1

0Checkpoint
No: 21

1.8 Issues of unit and scale

Despite not paid enough attention by economists and business administration people, almost every data series comes with a unit and scale. For instance, if my income is TRY96, 000, the unit is TRY (international code for Turkish lira) and the scale is not explicitly said. If we write it as TRY96K, the unit is again TR and the scale is "thousands", so 96 means 96, 000 here.

Consider {(xi,yi)}i=1N where x is the body weight in kilograms (kg) y is the height in centimeters (cm). Then,

MeasureUnit
μ x kg
μ y cm
σx2 kg2
σx kg
σy2 cm2
σx cm
cvx = σx
μxUnitless
cvy = σy
μxUnitless
σxy kg.cm
ρxy Unitless

Similarly, quartiles, deciles and percentiles of a variable have the same unit as the variable. Range of a variable has the same unit as the variable. Interquartile range of a variable has the same unit as the variable. As a rule of thumb, linear operators do not alter the units.

Use of numerical scales is often a matter of practicality or convenience. Nobody likes to write 123, 000, 000, 000, 000 (except politicians) instead of writing 123 trillions or 123.1012. One may need to learn two important practices of scaling numbers:

In this edition, these are left to readers as individual study.

1.2 EXERCISES____________________________________________     

1. 

Consider a population with data values of:
5, 6, 3, 3, 6, 9, 10, 4, 10, 4

Compute the mean, range, standard deviation, median, and Q1.

Solution: μ = 6, Range = maxmin = 7, σ = 2.61, Q2 = 5.5 and Q1 = 3.75.

2. 

Find the mean, median, mode(s), variance, range, 1st quartile, and the 80th percentile of the data given below:
9, 13, 6, 7, 8, 6, 6, 9, 13, 13

Solution: μ = 9, Q2 = 8.5, modes are 6 and 13, σ2 = 8, Range = 7, Q1 = 6 and P80 = 13.

3. 

A population has a range of R and it consists of two observations only. Calculate the variance of this data set.

Solution: x1 and x2 are the only two observations, and x2 x1 = R is given (suppose x  < x )
 1    2. Then x2 μ = R/2 and x1 μ = R/2.

       [                  ]
σ 2 = 1 (x1 − μ)2+ (x2− μ)2
     2 [               ]
   = 1  (−R / 2)2 + (R /2)2
     2
   = 1 ⋅ R2
     2   2
σ 2 = R2/4.
4. 

A researcher argues that median equals the simple average of the first and third quartiles. By giving a numerical example, show that this is incorrect.

Solution: Find/make up your own example.

5. 

Let a and b be any given real numbers. Let x1, x2, ..., xN and y1, y2, ..., yN be two data sets such that, for any i = 1, 2, ..., N, and yi = axi + b.

i. What is the relation between the mean of the y-values and the mean of x-values?

ii. What is the relation between the variance of the y-values and the variance of x-values?

Solution: Needs some careful and patient elaboration.

1.
      1 N
μy = -- ∑ yi
     N i=1
   = -1   (ax + b)
     N ∑     i
   = -1   ax + 1-   b
     N ∑    i  N ∑
      -1       1-
   = aN ∑  xi+ N Nb
   = aμx+ b

μy = aμx+ b
2.
      1 N (      )
σ2y = -- ∑  yi− μy 2
     N i=1
     -1                   2
   = N ∑  (axi+ b − aμx− b)
     -1 2          2
   = N a ∑  (xi− μx)
      2 1          2
   = a N ∑  (xi− μx)
   = a2σ2
 2    2 x2
σy = a σx
6. 

Consider a bivariate data consisting of the 1st midterm and 2nd midterm grades of 216 students. It is known that the 1st midterm grade of each student is 8% less than his 2nd midterm grade. If the mean of the 2nd midterm grades is 64 and the variance is 9 what can you say about the correlation coefficient of this data?

Solution: Without any calculations we can say it is 1. Why?

7. 

If we remove a data point from a data series, variance decreases. True or false? Explain.

Solution: For a logical statement to be true, it must be true without any exceptions. Consider first {1, 5, 9} and second {1, 9}. Which set of values has a larger variance? What is your conclusion?

8. 

When we multiply each point in a data series by the same factor, the variance increases. True or false? Explain.

Solution: Consider yi = kxi, i = 1, 2, , N. You have seen before that

 2    2 2
σy = k σx

Then, σy2 is greater than σx2 only when k2 > 1, i.e., |k| > 1. So, the given statement is false (as we are able to find a counter example).

9. 

Below is the distribution of a variable X based on a sample of 40 observations. Compute the coefficient of variation.

X Frequency
10 14 8
15 19 16
20 24 u
25 29 4
30 34 2

Solution: μ = 19, σ2 = 28.5 and σ = 5.34. So,

CV = σ-=  5.34-= 0.28
     μ    19
10. 

Consider the two populations of bivariate data:

 Population 1       Population 2
--x-----y----     ---x-----y---
--2-----2----     --2.9----3.8--
-------------     -------------
--6-----3----     --−-1---−-4--
-10-----4----     -−-1.9---−5.8-
--4----2.5---     ---4-----6---
 −2     1            6     10
-------------     -------------

i. Find the covariance and correlation coefficient for each population.

ii. Plot the scatter plot for both populations.

iii. Standardize the x and y values in each population and plot the scatter plots for the standardized values.

Solution:

1.
To find the covariance we first find the mean of x and y values for both populations:
μ1,x = 1
5 (2+ 6 +10 + 4+ (−2)) = 4
μ1,y = 1
5 (2+ 34+ 2.5+ 1) = 2.5
μ2,x = 1
5 (2.9 + (− 1)+ (− 1.9)+ 4+ 6) = 2
μ2,y = 1
5 (3.8 + (− 4)+ (− 5.8)+ 6+ 10) = 2

Thus

Cov1 = 1
5 i=15(x1,i μ1,x)(y1,i μ1,y) = 4
Cov2 = 1
5 i=15(x2,i μ2,x)(y2,i μ2,y) = 18.008 18

To find the correlation coefficient we first find the variances:

σ1,x = 1
5 i=15(x1,i μ1,x)2 = 16
σ1,y = 1
5 i=15(y1,i μ1,y)2 = 1
σ2,x = 1
5 i=15(x2,i μ2,x)2 = 9.006 9
σ2,y = 1
5 i=15(y2,i μ2,y)2 = 36.016 36

Thus

ρ1 = -Cov1-
σ1,xσ1,y = -4--
4⋅1 = 1
ρ2 = -Cov2-
σ2,xσ2,y = 18--
3⋅6 = 1
2.
Population 1 is represented with a “x” and population 2 with a “+” in the scatter plot below:

xyxxxxx+++++

3.
Standardized values are obtained by subtracting the corresponding mean from each value and dividing the result by the standard deviation:
-Population-1-     -Population-2-
--x------y-----------x-----y----
-−0.5--−-0.5--     --0.3----0.3--
--0.5----0.5---     --−-1---−-1--
  1.5    1.5         − 1.3   −1.3
--0------0---     --0.6----0.6--
-------------     -------------
-−1.5--−-1.5--     --1.3----1.3--

The standardized values are plotted below:

xyxxxxx+++++

Note that even though the original populations where on lines with different slope the standardized values are on a line with slope 1.

Picking the appropriate statistic or visual
∙  Nominal data: Mode, Bar chart, Column chart

∙  Ordinal data: Median, Mode, Bar chart, Column chart

∙  Interval or Ratio data:

In –a nTuotsdheeslcrlibe center: Mean, Median, Mode, (Midrange ), (Geomet -
     ric Mean ), (Midhinge), Histogram, Box plot

   – To describe variability: Range, IQR, Standard deviation, CV ,(z-
     values), Histogram, Box plot
   – To describe shape: Mean vs Median, Skewness, (Kurtosis), His-
     togram, Box plot

0Checkpoint
No: 22

1.9 Chebyshev’s theorem (Chebyshev’s inequality)

For any data set with a mean of μ and variance of σ2, and any k > 1, at least

(      )
 1 − 12  100%
     k

of the observations will take a value in the interval [μ, μ + ].

1.3 EXERCISES ___________________________________________________________     

1. 

Consider a population with a mean of 4 and variance of 36. Using Chebyshev’s theorem find an interval that contains at least 70% of the observations.

Solution: μ = 4 and σ2 = 36, so σ = 6.

(      )
 1 − 12 100% =  70%
     k
    1− -1 = 0.70
       k2
      1-= 0.30
      k2
      2   -1--
     k  = 0.30
      k = 1.83.

So, the requested interval is:

[μ− kσ,μ + kσ ] = [4 − 6(1.83),4 + 6.(1.83)]
              = [− 6.95,14.95]
2. 

The monthly charges for credit card holders at a department store have a mean of $250 and a standard deviation of $100. Use Chebyshev’s theorem to answer the following questions:

i. What can you say, for sure, about the percentage of card holders who will have monthly charges between $100 and $400?

ii. Provide a range for the credit card charges that will include at least 80% of all credit card customers.

Solution:

1.
Use the same approach. For the interval [100, 400], k = 1.5. Reveal why. Then, this interval contains at least
1− -1--⋅100% = (100− 44)% = 56%
   1.52

of all observations.

2.
[26.4, 473.6] is the answer.
3. 

In a stock exchange average return over a year turns out to be 1% with a standard deviation of 2%. Over the same year, average exam grade in a university is 60 points with a standard deviation of 24 points. Which one has a higher variability, the returns or the grades?

Solution: CV for returns is 2%/1% = 2 and CV for grades is 24points/60points = 0.4. So, returns have a higher variability.

4. 

When we replace a positive data value with its "additive inverse" in a data set, variance increases. Is this claim true or false? Either prove that it is true, or provide a counter example to show that it is false. Make sure you have used a formal mathematical notation.

Solution: Consider {5, 7} and {−5, 7}. Which pair of values has higher variance? Then, come up with a conclusion.

0Checkpoint
No: 23

1.10 Adding and multiplying terms over an index

If you’ve N numbers x1, x2, ..., xN, the sum of these numbers, S, is:

S = x1+ x2+ ...+ xN

In expressing sums like this, we always write the first two terms, then three periods, then the last term. A shorter way to write S is:

     N
S = ∑  xi
    i=1

where i is the index of x, running from 1 to N.

Unless otherwise specified, i increases by 1 every time, from 1 to N. So, S = i=1Nxi is read as "sum of xi, i from 1 to N". This means,

For example,

      4      4     i=4
S = Σ i=1xi = ∑i=1xi = ∑i=1xi = x1+ x2+ x3 + x4

If N is a number well-known in a problem:

S = ∑ xi

is a valid expression and it means "consider all xi”,

Notice that

     (    )        (    )
 N     N−1           N−2
∑i=1 =  ∑i=1  + xN =   ∑i= 1  + xN−1+ xN

and so forth.

In case we want to write

x1+ x3 + ...+ x2k−1

using our summation operator, we can write it as:

 k
∑  x2i−1
i= 1

As you’ve seen, wisely using i solves many problems.

Consider:

S = 2⋅x1+ 2⋅x2 + ...+ 2 ⋅xN

which is equivalent to

N
   2x
∑i=1   i

S, then, can be written as

  N
2   x
 ∑i= 1 i

So, if each number in the sequence x1, x2, ..., xN is multiplied by the same value which is not a function of i, this value can be taken out of the summation sign Σ.

Consider:

S = x1+ y1+ x2+ y2+ ...+ xN + yN

Notice that, this is the same thing as:

N            N     N
∑  (xi+ yi) = ∑ xi+ ∑  yi
i=1          i=1    i=1

Consider:

S = N  xy = x y + x y + ...+ x y
    ∑i=1  ii   1 1   2 2       N N

Notice that,

    N       N     N
S = ∑ xiyi ⁄= ∑ xi.∑  yi
   i=1      i= 1  i= 1

Sum of the products is not equal to product of the sums. Expand (write long) the expressions to see why not.

Consider:

S = x1+ 2x2 + 3x3 + ...+ NxN

How can we write this in short?

S = N  ix
    ∑i= 1 i

Notice that:

    N      N    N
S = ∑ ixi ⁄= ∑ i.∑  xi
   i=1     i= 1 i= 1

Consider:

    N
S = ∑ (xiy¯+ yix¯)
   i=1

Since x and ȳ are not indexed with i, the expression is equal to:

¯y N x + ¯x N y
 ∑i= 1 i   ∑i= 1 i

Consider:

N
∑  x2i,
i= 1

Notice that

N      ( N   )2
∑ x2i ⁄=   ∑ xi
i=1       i=1

Sum of the squares is not equal to square of the sum.

If you’ve N numbers x1, x2, ..., xN, the product of these numbers, P, is:

P = x1.x2...xN

In expressing products like this, we always write the first two terms, then three periods, then the last term. A shorter way to write P is:

     N
P = ∏  xi
    i=1

where i is the index of x, running from 1 to N.

Consider:

    N
P = ∏ i
    i=1

What’s this?

It is nothing but P = i=1Nxi with xi = i. So P equals:

1.2.3...N = N !

Regarding our future purposes, an important property to remember is:

ln N x = N  lnx
  ∏i= 1 i  ∑i= 1  i

as we’ll use while writing Likelihood functions in ECON 222.

Finally, consider:

   (  )
 N  N   N −ii
∑i= 0  i x   y

What’s this? Expand it to see:

  (  )         (  )             (  )
=  N  xN −0y0+  N  xN −1y1+ ...+  N  xN−N yN
   0             1               N

which is nothing but the binomial expansion

(x + y)N

as we’ll use while studying the Binomial and Poisson distributions in ECON 221.

1.4 EXERCISES ___________________________________________________________     

1. 

Consider the expression for the population variance:

 2   1-N        2
σ =  N ∑  (xi− μ)
       i= 1

and simplify it until you see:

        N
σ 2 =-1 ∑ x2i − μ 2
     N  i=1

Solution:

 2  -1 N        2
σ = N  ∑i=1(xi− μ)
     1   (            )
  = -- ∑  x2i − 2μxi+ μ 2
    N1       2μ        1
  = -- ∑ x2i −---∑  xi+  --∑ μ2
    N         N        N
  = -1 ∑ x2i − 2μ2 + 1-Nμ2
    N              N
σ2 =-1 ∑ x2i − μ2.
    N

So, variance can be calculated by subtracting ’the square of the mean of observations’ from ’the mean of squared observations’.

0Checkpoint
No: 24