1 Describing data

Chapter 1
Describing data

As discussed in the opening lectures, an analytical and evidence-based approach to policymaking is a must for the modern & complex societies. As the famous engineer Edwards Deming put it, “Without data, you’re just another person with an opinion.” Data and its analysis are what distinguish well-designed policies from arbitrary and/or flawed ones. So, understanding the features & structure of data is an indispensable step of analysis.

1.1 A taxonomy of data types

Practically every sort of empirical analysis begins with a need to describe a data set & its various elements. So, we begin our journey to learn probability theory & statistics from this simple yet crucial task. Recall from the class discussion that data may come in two main forms: qualitative and quantitative. While qualitative data qualifies ‘things’, quantitative data quantifies ‘things’, as the terms suggest. In that, qualitative data often have a categorical nature. If the values of a categorical variable are orderable (sortable) then this categrorical variable is called an ‘ordinal’ categorical variable. Otherwise, it is a ‘nominal’ categorical variable. While the responses in a satisfaction survey are ordinal (consider 1: least liked to 5: most liked), indicators of gender are nominal (F: female and M: male). Note that it is not always trivial to come up with a judgment: while we can treat age categories of ‘young, middle-aged and old’ for people or class/year categories of ‘freshman, sophomore, junior and senior’ for students as ‘ordinal’, another researcher may choose to treat them as nominal. All we need to reveal is our capability to sort a categorical variable with a clear understanding stripped of value-judgments. For instance, one cannot simply put one of the genders on top of others, regardless of the underlying way of thinking.

Quantitative data is by definition numerical. It can be either discrete ‘as in the case of number of automobiles owned by households’ (one cannot own fractional automobiles) or continuous ‘as in the case of daily spending by households’. Household size, i.e., the number of people forming the household is discrete, number of cities in a country is discrete, etc. The case of people’s ages measured in years can be a little confusing: think about it.

Classifying data
QCQNOuDoIntRuoraisnteratamdncinvaioliiintreulDtnaaitoaatlltaeutivtisaeve
((CNautemgeorriiccaal)l)

An important point regarding the continuous data is the distinction between ‘interval data’ and ‘ratio data’. A simple rule of thumb is: if there is an ‘absolute zero’ of the possible values of a data series, it is ‘ratio data’ & in the absence of an ‘absolute zero’ it is named ‘interval data’. A trivial example of this is the temperature measurements using the Kelvin (K) versus Celsius scale (^∘C). While the Kelvin scale has an absolute zero, i.e. 0K, the Celsius scale does not. Freezing point of pure water (under certain conditions) is 0^∘C, yet this is not the lowest attainable temperature. Indeed, there are some 273.15^∘C more to go down until that point & −273.15^∘C is defined as 0K and it is the lowest possible temperature in the Universe. While 200K is two times 100K, 200^∘C is not two times 100^∘C.

An easier example to understand the ratio data is the measurement of ‘mass’ (in kilograms, let’s say). Mass has an absolute zero, which is ‘0 Kg’ & a 20 Kg object is two times as heavy as a 10 Kg object (assuming there is gravity).

1.2 What is a "data set"?

A data set is composed of one or more data items (series, variables) for use in analysis (in our case statistical analysis)
Each individual sub-item in a series is called a data value
There is often a clear correspondence between the data values of different data items involved, controlled by a primary key (observation number, date or a combination of both)

It is possible and often necessary to convert one data series ‘from numerical to categorical’. For instance, age data measured in ‘years lived’ can be expressed in terms of the qualifiers ‘young, middle-aged & old’. Note that, this transformation results in some loss of information. Clearly, a numerical age series tell more about the people surveyed compared to simple categorization. Still, when properly made, a good categorization of numerical values may prove very useful in statistical (or in econometric) analysis.

CSoenavrecrhsi&onex‘fprloorme categorical to numerical’ may not be so straightfor-
ward: come up with your result.

⁰Checkpoint
No: 7

1.3 Frequency

In the Oxford English Dictionary, ‘frequency’ is defined as “the rate at which something occurs over a particular period of time or in a given sample”. Our understanding covers the cases of ‘being’ in addition to ‘occurring’ or happening: Frequency is the numerical measure of ‘how often something happens or how often some specific way of being is observed’. In that, as we can count car accidents in a certain hour, we can also count the people that survived a certain accident. So, we can count ‘things’ in time (we can call this temporal counting) and in space (we can call this spatial counting).

In our learning and practice of the Probability theory and Statistics
wIen wai nlult bsehe‘clolunting the things’, simply using our fingertips at the
beginning, and using more sophisticated techniques then.

1.3.1 Frequency distribution

A frequency distribution is a tabular summary of how numerical values are distributed to classes in a data series.

First, determine the number of classes k, according to:

Number of observations	k
<50	5 − 7
50 − 100	7 − 8
101 − 500	8 − 10
501 − 1, 000	10 − 11
1, 001 − 5, 000	11 − 14
>5, 000	14 − 20

The table gives a rule of thumb and requires often your professional attention.

A ‘rule of thumb ’ is a broadly accurate guide or principle, based on
pIrnacat nicuetsrahtehlelr than theory.

Second, determine the class width, w:

w = Maximum--−-Minimum-- k

where Maximum is the ‘largest observation’ and Minimum is the ‘smallest observation’. Always round the formula result up, to find w.

Third, construct the k classes; they are to be inclusive and non-overlapping.

Fourth, allocate your observations to classes and get the count of each class.

At the end, present your result as a table. What is obtained is a "frequency distribution table".

Consider the age data of 20 people (20 subjects) measured in years:

12 11 19 20 20 15 15 24 15 12 20 18 17 20 20 20 22 24 12 14

While summarizing the age data, it seems appropriate to use 5 classes following the rule of thumb given before. The max of our data series is 24 and the min is 11. Class width w, then, is calculated as:

w = 24−-11-= 2.6, always rounding up ̂|w = 3 5

Having calculated the class width, beginning from the min value (11 here) we establish our classes as [11, 14), [14, 17), [17, 20), [20, 23) and [23, 26]. Pay attention to openness and closedness of classes (intervals) on the left and on the right.

Once the classes are ready, we carefully count the data values falling into each interval and prepare the following table, a table that we call the ’frequency table’.

Class Frequency [11,14) 4 [14,17) 4 [17,20) 3 [20,23) 7 [23,26] 2

The final step is to prepare (draw) the histogram of our data.

11222024681470360

Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 11, 14, 17, 20, 23, 26

⁰Checkpoint
No: 8

Consider another researcher who prefers arbitrarily to use 2 classes. In this case, the class width (w) will be:

24−-11- ̂ w = 2 = 6.5, always rounding up |w = 7

The classes will be [11, 18) and [18, 25), so our resulting frequency table will look like:

Class Frequency [11,18) 9 [18,25] 11

1205118505

Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 11, 18, 25

The final step is to prepare (draw) the histogram of our data again.

Which histogram (or frequency table) gives a better summary of the data? Avoid any confusions: the first histogram is the winner of the contest. It summarizes our data and conveys a tangible message. The second histogram, on the other hand, suffers from ‘oversummarizing’. Here take our discussion to its limits and consider a third researcher who prefers to use 1 class only. Why would that be nonsense?

In order to summarize numerical (quantitative) data we use ‘fre-
qIunenac nyuttsahbelelsl’ and ‘histograms ’. The columns belonging to con-
secutive nonempty classes must touch each other while drawing a
histogram

What about categorical (qualitative) data? Consider the following data series which consists category markings for 20 people (20 subjects), where Y, M and O stand for ‘young’, ‘middle-aged’ and ‘old’, respectively.

Y Y O M Y O Y O O Y M M Y M O O M Y Y M

This time, forming a frequency table must be easier: we do not (indeed, we cannot) establish classes & simply count the frequency of each category:

Category Frequency Y 8 M 6 O 6

The final step is to prepare (draw) the bar chart of our data. It is more than trivial; do it yourself.

In order to summarize categorical (qualitative) data we use ‘fre-
quency tables’ and ‘bar charts’. The bars will never touch each other
In a nutshell
while drawing a bar chart.

⁰Checkpoint
No: 9

1.3.2 Relative frequency distribution

Once the counts in a frequency distribution table are divided by the total number of observations & expressed as "percentages" or as ‘fractions between 0 and 1’, the resulting table is called a "relative frequency distribution table". By construction, relative frequencies of all classes add up to 100% or 1.

1.3.3 Cumulative frequency distribution

Once the frequencies (counts) in a frequency distribution table are accumulated across classes, one row at a time and from the smallest to largest class, the resulting table is called a ‘cumulative frequency distribution table’.

1.3.4 Relative cumulative frequency distribution

Once the relative frequencies in a relative frequency distribution table are accumulated across classes, one row at a time and from the smallest to largest class, the resulting table is called a ‘relative cumulative frequency distribution table’.

In order to see the linkages between ‘frequency’, ‘relative frequency’, ‘cumulative frequency’ and ‘cumulative relative frequency’, examine the following table:

Class	Frequency	Relative	Cumulative	Cumulative
		frequency	frequency	relative
				frequency
[10, 17)	500	0.333	500	0.333
[17, 24)	250	0.167	750	0.500
[24, 31)	150	0.100	900	0.600
[31, 38]	600	0.400	1500	1.000
Total	1500	1.000	N.A.	N.A.

1.4 Representation of distributions

1.4.1 Histogram and relative frequency polygon

A histogram is a graph that consists of vertical bars constructed on a horizontal line on which intervals are marked for the variable being displayed.

Horizontal intervals are the classes of a frequency or relative frequency distribution table
Height of each bar is the frequency or relative frequency associated
- Warning: not the cumulative figures
- Warning: consecutive bars for non-empty classes are to touch each other, i.e.. no gaps

Histograms are traditionally used for continuous numerical data. When the midpoints of the top segment of each bar in a histogram are connected with line segments, what we obtain is called a frequency polygon. Note that a ‘bar chart’ resembles a histogram yet it differs in two main aspects: first, it is for categorical data & second, the bars in a bar chart are separated by a visible gap. Examples are provided in the upcoming exercises.

1.4.2 Symmetry and skewness

The shape of a distribution is said to be symmetric if the observations are balanced, or approximately evenly distributed, about its center. A distribution is skewed, or asymmetric, if the observations are not symmetrically distributed on either side of the center. A skewed-right distribution (sometimes called positively skewed) has a tail that extends farther to the right. A skewed-left distribution (sometimes called negatively skewed) has a tail that extends farther to the left.

⁰Checkpoint
No: 10

1.4.3 O-give

An O-give, also called a cumulative line graph, is a line that connects points that are the cumulative percent of observations below the upper limit of each class (interval) in a cumulative frequency distribution. Even when not said so, an O-give is to present cumulative percentage figures. Beginning vertical value in an O-give is always 0 and ending vertical value is 1. Examples are provided in the upcoming exercises.

1.1 EXERCISES____________________________________________

1.

Fill the empty cells in following table:


Interval	Feq.	Rel. freq.(%)	Cum. freq.	Rel. cum. freq.(%)

[0, 20]	20	10

(20, 40]			80

(40, 60]	30

(60, 80]

(80, 100]		20		80

(100, 120]

Solution: The complete table is as follows:


Interval	Freq.	Rel. freq.(%)	Cum. freq.	Rel. cum. freq.(%)

[0, 20]	20	10	20	10

(20, 40]	60	30	80	40

(40, 60]	30	15	110	55

(60, 80]	10	5	120	60

(80, 100]	40	20	160	80

(100, 120]	40	20	200	100

2.

Consider the frequency histogram displayed below:

1346701234.5.5.50000

Correction: The x-axis values of the above histogram must read from the beginning of the first bar as 1.5, 3, 4.5, 6, 7.5, 9

Draw the corresponding (relative frequency) o-give.

Solution: Prepare your Cartesian plane. The origin is (0, 0). Mark the following points on your graph space by paying attention to proportions: (0, 0), (1.5, 0.22), (3.0, 0.57), (4.5, 0.68), (6.0, 0.73), (7.5, 0.95), (9.0, 4.00). Then, connect these points with line segments from left to right. Once it is done, you will observe a properly drawn O-give. Make sure you have named the axes.

3.

Consider the relative frequency o-give displayed below:

RX2468103391e00000l0. Cum. Freq. (%)

i. Draw the corresponding histogram.

ii. What can you say about the percentage of observations that takes a value less than or equal to 6.5 (if you need to estimate it what would be a reasonable estimate)?

iii. What can you say about the percentage of observations that takes a value greater than or equal to 4.3 (if you need to estimate it what would be a reasonable estimate)?

Solution:

1.: Consider the classes of [0, 2], (2, 4], (4, 6], (6, 8] and (8, 10]. Taking simple differences, reveal that relative frequencies of these classes are 0, 0.3, 0, 0.6 and 0.1. Locate these numbers on cartesian plane to obtain the histogram. Recall that for two successive classes which are non-empty, the bars must touch each other.
2.: Relative frequency of (6, 8] is 0.6. (6.5 − 6)/(8 − 6) = 0.25. So, approximately 0.25 × 0.6, i.e., 0.15 is the frequency (relative) of (6, 6.5]. Since the relative frequency of [0, 6] is 0.3, the frequency of [0, 6.5] becomes 0.45. If the given 0 -give has been drawn properly, then we expect our estimate to be reliable.
3.: Use the same approach. The solution should yield 0.70.

4.

Consider the relative frequency o-give displayed below:

RX135791000001e1......l026770.000000 Cum. Freq.

What can you say about the percentage of observations that takes a value between 4.5 and 9.5?

Solution: Use the same approach. 0.4/4 + 0.1 + 0 + 0.3/4 yields 0.275.

5.

A doctor’s office staff studied the waiting times for patients who arrive at the office with a request for emergency service. The following data with waiting times in minutes were collected over a one-month period:

2, 5, 10, 12, 4, 4, 5, 17, 11, 8, 9, 8, 12, 21, 6, 8, 7, 13, 18, 117

i. Construct a histogram for this data set by including all observations.

ii. Construct a histogram for this data set after excluding the value of 117. Note that you still have to show it on your histogram (but how?)

iii. Which histogram is more informative? Why?

Solution:

1.: For ease in processing data sort/order the values as 2, 4, 4, 5, 5, 6, 7, 8, 8, 8, 9, 10, 11, 12, 12, 13, 17, 18, 21 and 117. The minimum is 2 , the maximum is 117 and N is 20. 5 classes would work well here. Then, the class width becomes (117 − 2)/5 = 23, rounding up always, 24 . This means that our classes will be [2, 26], (26, 50], (50, 74], (74, 98] and (98, 122]. When drawn (do it), this will turn out to be a funny histogram, as 19 observations will fall into the first class and only one observation (117) will fall into the last one.
2.: When 117 is kept aside, the maximum becomes 21. Using 5 classes again, the class width becomes (21 − 2)/5, which is 3.8, rounding up always, 4 . Our classes will be [2, 6], (6, 10], (10, 14], (14, 18] and (18, 22]. Respective frequencyies of these will be 6, 6, 4, 2 and 1 . When drawn we will observe a neatly drawn histogram. (What about 117?)
3.: The second histogram is more informative. It gives us more details, like the shape of the distribution.

6.

Consider the relative frequency o-give displayed below:

RX046811027991e0000200000l000. Cum. Freq. (%)

Based on the information above estimate the median and the 3rd quartile (Q₃).

Solution: Hint: For the median, find the horizontal value at which the O-give has the value of 50. For Q₃, find the horizontal value at which the O-give has the value of 75.

7.

In a data set, the frequency of the interval (0, 10] is 0.10, frequency of the interval (10, 20] is 0.20, frequency of the interval (20, 30] is 0.30 and frequency of the interval (30, 40] is 0.40. Construct the relative frequency O-give and calculate Q₃ for this data set.

Solution: Solving this must be straightforward now. Do it and discuss with classmates.

⁰Checkpoint
No: 11

1.5 Measures of central tendency

Measures of central tendency or measures of concentration indicate ‘where’ on the real number line our data series is. The three terms connotate:

Central tendency
Concentration
Location

As you’ll see in the upcoming classes, the knowledge of this is critically important to make several statistical assessments.

1.5.1 Measures of central tendency: Mode

The "mode", whenever exists, is the most frequently occurring value in a data series.

If the series has one mode, it is called "unimodal".
If the series has two modes, it is called "bimodal".
If the series has three modes, it is called "trimodal".
If the series has more than two modes, it may simply be called "multimodal", use of the term "trimodal” is not that widespread in everyday professional use.

Note that, the mode is commonly used with (but not restricted to) categorical data.

Consider,

X : 1,2,3,3,3,4,4,5,6,7

where N = 10. Among the values of X, the most frequent (mostly repeated) value is 3, so we say Mode = 3. If X included another 4 like:

X : 1,2,3,3,3,4,4,4,5,6,7

where N = 11, then we would say the Modes are 3 and 4.

1.5.2 Measures of central tendency: Mean (Arithmetic mean)

For {xi} _i=1^N:

N μ = ∑-i=-1xi N

is called the "population mean".

In addition, for {xi} _i=1ⁿ:

x¯= ∑ni=-1xi n

is called the ‘sample mean’

Considering,

X : 1,2,3,3,3,4,4,4,5,6,7

where N = 11, the mean is calculated as:

μ	=
	=
	= 3.81

In another case for X, suppose the last value, i.e., 7, is replaced by 42; let’s call this data series as X′:

X : 1,2,3,3,3,4,4,4,5,6,42

where N = 11 again, the mean becomes:

μ	=
	=
	= 7

As this example suggests, mean (μ) is sensitive to outliers/extreme values. However, this sensitivity does not imply that μ is a meaningless or a useless measure. On the contrary, it is a fundamental measure with many good statistical properties, as we will see in the upcoming chapters.

⁰Checkpoint
No: 12

Writing mathematics
Good mathematical writing involves:

∙ Using a relevant & consistent notation

∙ Flowing logically well the solution or proof steps
In a nutshell
∙ Including verbal explanations & necessary definitions between
steps

∙ Putting things gravitationally, i.e., top to bottom

∙ Keeping only the essentials, removing everything redundant,
avoiding scratches remained

⁰Checkpoint
No: 13

⁰Checkpoint
No: 14

Consider,

X	Frequency	Relative frequency

[11, 14)	4	4/20
[14, 17)	4	4/20
[17, 20)	3	3/20
[20, 23)	7	7/20
[23, 26]	2	2/20

Using the midpoints and frequencies of classes:

μ	=
	= 18.35

is obtained.

Equivalently, one may use the relative frequencies of classes to calculate the same:

μ	= ⋅ + ⋅ + ⋅ + ⋅ + ⋅
	= 18.35

⁰ Checkpoint
No: 15

1.5.3 Measures of central tendency: Percentiles, Deciles and Quarties

The "p-th" percentile in a data series is the smallest value which is greater than p% of observations. If there are N observations, we find

-p- 100(N + 1)th

ordered position and read the observation at this position as the p-th percentile.

OIrndear nu(tssohrte)l tlhe observations in ascending order first. Without order-
ing data, you ’ll not get correct results.

The "d-th" decile in a data series is the smallest value which is greater than d tenths of observations. The "q-th" quartile in a data series is the smallest value which is greater than q quarters of observations.

While we imagine we are slicing our ordered data set into 100 while finding percentiles, we slice it into 10 while finding deciles, and we slice it into 4 while finding quartiles.

By definition P₀, Q₀, D₀ are equal to the minimum observation and P₁₀₀, Q₄, D₁₀ are equal to the maximum observation.

0th percentile	→	0th decile	→	0th quartile (Q₀)	→	Minimum
10th percentile	→	1st decile
20th percentile	→	2nd decile
25th percentile	→	→	→	1st quartile (Q₁)
30th percentile	→	3rd decile
40th percentile	→	4th decile
50th percentile	→	5th decile	→	2nd quartile (Q₂)	→	Median
60th percentile	→	6th decile
70th percentile	→	7th decile
75th percentile	→	→	→	3rd quartile (Q₃)
80th percentile	→	8th decile
90th percentile	→	9th decile
100th percentile	→	10th decile	→	4th quartile (Q₄)	→	Maximum

1.5.4 Measures of central tendency: Median

Among the many percentiles, “Median” has a special place. "Median" in a data series is the smallest value which is greater than 50% of observations. It simply divides a data series into two equally-likely halves. Numerically, median is nothing but Q₂, P₅₀, D₅, which are all the same.

Consider now a variable X,

1st	2nd	3rd	4th	5th	6th	7th	8th	9th	10th	11th
1	1	2	3	5	8	13	21	34	55	89

To find, for instance, the 30th percentile of X we calculate:

(N + 1)th	= (11 + 1)th
	= 3.6th

Then, we find the observation value in the 3.6th position of the ordered data series. As seen here, there may not be such a physical position in data. As an approximation, we take the value of X in the 3rd position and add 0.6 times the difference between the value in 4th position and the value in 3rd position, i.e.,

P₃₀	= 2 + (3 − 2) ⋅ 0.6
	= 2.6

is our 30th percentile.

To find the 80th percentile of X we calculate:

(N + 1)th	= (11 + 1)th
	= 9.6th

P₈₀	= 34 + (55 − 34) ⋅ 0.6
	= 46.6

So, 46.6 is our 80th percentile.

To find the Median, i.e., the 50th percentile of X we calculate:

(N + 1)th	= (11 + 1)th
	= 6th

Without further calculations, 8 is our Median.
Consider another data series:

2 3 4 6 7 8 9 9 10 10 11 11 11 11 13 14 18 19 19 19 21 21 22 22 23 23 23 24 24 24 24 25 25 25 26 26 26 26 26 49

Solve yourself to see that the Median is the 20.5th value of this data series and it is equal to 20.
Before moving forward, consider finally:

Variable 1st 2nd 3rd 4th 5th 6th 7th 8th 9th Mean X 1 3 6 10 15 21 28 36 1000 124.4 X’ 1 3 6 10 15 21 28 36 45 18.3

Did you notice anything?

Unlike the mean (μ), the Median (Q 2) is not sensitive to out-
lIinersa/ neuxttsrehmeell values. Equivalently we say, the Median is robust
up to the presence of outliers/extreme values in a data series.

⁰Checkpoint
No: 16

1.5.5 Where is my data? Five-number summary

When we give the five descriptive measures

minimum ≤ Q 1 ≤ median ≤ Q3 ≤ maximum

it is called a "five-number summary". This is a somehow ancient and still useful tool to summarize data sets.
For a variable X given as:

2 3 4 6 7 8 9 9 10 10 11 11 11 11 13 14 18 19 19 19 21 21 22 22 23 23 23 24 24 24 24 25 25 25 26 26 26 26 26 49

the five-number summary is (2, 10.25, 20, 24, 49).
To make use of our knowledge gained up to this point, consider a data that is summarized with the following relative frequency o-give:

Rv468112791ea000020000llu000.e Cum. Freq. (%)

Based on the information above estimate the median, the mean, and the 3rd quartile.

In order to reach a good solution, note that the median is the 50th percentile and the 3rd quartile is the 75th percentile. Under the assumption that data is uniformly distributed over each class interval, a relative cumulative frequency o-give gives information about the percentage of observations that takes a value less than or equal to a given number, we can use the o-give to estimate the median and the 3rd quartile using the o-give. On the graph we mark the points that correspond to 50% and 75%.

Rv4681127915m7vea00002000005llu000.e Cum. Freq. (%)

From similarity of triangles we have:

m-−-40- 60−-40- 50− 20 = 70− 20

which yields

20 m = 40+ 30-- = 52 50

Similarly

v-−-60--= 80−-60- 75− 70 90− 70

yields

20 v = 60+ 5--= 65 20

We will use CM_l to denote the class mark of the lth class interval (the center of the lth interval). We will use RF_l to denote the relative frequency of the observations that takes values in the lth class interval.

The assumption of data being uniformly distributed over each class interval that the following formula yields a “reasonable” estimate for the mean:

RF CM + RF CM + RF CM + RF CM + RF CM 1 1 2 2 3 3 4 4 5 5

Thus our estimate for the mean is

	(0.2 − 0) + (0.7 − 0.2)
	+ (0.9 − 0.7) + (0.9 − 0.9)
	+ (1.0 − 0.9)
	= 54

⁰ Checkpoint
No: 17

1.6 Measures of dispersion

Measures of dispersion or measures of variation indicate how ‘spread’ on the real number line our data series is. The four terms connotate:

Dispersion
Variation
Spread
Fluctuation

Without properly assessing dispersion, the knowledge of location means only a little.

1.6.1 Measures of dispersion: Range

Range = Largestobs− Smallestobs

Range = Max − Min

Range measures the length of the interval on the real number line spanned by our data set.

1.6.2 Measures of dispersion: Interquartile range

The interquartile range (IQR) is defined as:

IQR = Q 3− Q1

IQR measures the length of the interval on the real number line spanned by the "central 50%" our data set.

⁰Checkpoint
No: 18

1.6.3 Measures of dispersion: Box-Whisker plots

The five-number summary presented as a graph is called a Box-Whisker plot. Sometimes, near outliers and far outliers can also be added while constructing these plots.

Outliers
Near outliers
Far outliers

1.6.4 Measures of dispersion: Variance

For {xi} _i=1^N:

2 ∑Ni=1(xi−-μ)2 σ = N

is called the "population variance", and for {xi} _i=1ⁿ:

∑n (xi− ¯x)2 s2 = -i=1n-−-1----

is called the ‘sample variance’

2 1- N 2 2
σ = N ∑ xi − μ
i=1
A practical way to calculate variance is to:
In a nutshell 1- N 2
1. Calculate the mean of squares: N ⋅∑ i= 1xi
2 -1 N 2
2. Calculate the square of mean: μ = (N ⋅∑ i=1xi)
3. Subtract the second from the first

Consider:

1 3 6 10 15 21 28

For this series, μ = 12 (calculate yourself) and the variance is calculated as follows:

σ²	=
	=
	=
	= 84

A tabular approach may also be preferred:

i	x_i	x_i −μ	(x_i −μ)²

1	1	-11	121
2	3	-9	81
3	6	-6	36
4	10	-2	4
5	15	3	9
6	21	9	81
7	28	16	256

			588

			σ² = 588/7 = 84

Finally, one may calculate the sum of squares as 1596 and the mean as 12, and calculates the variance σ² as 1596/7 − 12², which is 84.

⁰Checkpoint
No: 19

1.6.5 Measures of dispersion: Standard deviation

For {xi} _i=1^N:

∘------------ √ -- ∑N (x− μ )2 σ = σ2 = -i=1--i----- N

is called the "population standart deviation", and for {xi} _i=1ⁿ:

∘ --n--------- s = √s-2 = ∑-i=-1(xi−-¯x)2 n− 1

is called the ‘sample variance’

Consider the following data series:

11 23 58 13 21 34 55 89 14 42 33 37 76 10 98 71 59 72 58 44 18 16 76 51 9 46 17 71 12 86 57 46 36 87 50 25 12 13 93 19 64 18 31 78 11 51 42 29

We are now asked to describe this data series, including its mean, five-number summary, range, interquartile range and variance. For ease in calculating the positional measures (quartiles here), it is a good practice to order the observations from the smallest to the largest, i.e., in ascending order:

9 10 11 11 12 12 13 13 14 16 17 18 18 19 21 23 25 29 31 33 34 36 37 42 42 44 46 46 50 51 51 55 57 58 58 59 64 71 71 72 76 76 78 86 87 89 93 98

The following are then found:

N = 48
Minimum = 9
Maximum = 98
Range = Maximum−Minimum = 98 − 9 = 89
∑_i=1^Nx_i = 2082, so, μ = 2082/48 = 43.375
Q₂ is at the (48 + 1) ⋅ 0.5 = 24.5th position, Q₂ = 42
Q₁ is at the (48 + 1) ⋅ 0.25 = 12.25th position, Q₁ = 18
Q₃ is at the (48 + 1) ⋅ 0.75 = 36.75th position, Q₃ = 62.75
(9, 18, 42, 62.75, 98) is the five-number summary

For the same data series (call it X), the Box-Whisker plot looks like:

2040608010X0

1.6.6 Measures of dispersion: Coefficient of variation

Population coefficient of variation:

σ cv = --100% if μ ⁄= 0 μ

Sample coefficient of variation:

s cv = x¯100% ifx¯⁄= 0

⁰ Checkpoint
No: 20

1.7 Measures of association for bivariate data

When we deal with one variable in our analysis, it is a case with "univariate" data. When we are concerned with patterns of change of two variables together, it is a case involving "bivariate" data. In these lecture notes,

{xi} _i=1^n_X and {xi} _i=1^n_Y

indicate univariate data, but

{(xi,yi)} _i=1ⁿ

indicate bivariate data.

Notice that, bivariate data come in "pairs", so one cannot change the correspondence between x’s and y’s.

1.7.1 Measures of association for bivariate data: Covariance

Covariance is a measure of the linear relationship between two variables. For {(xi,yi)} _i=1^N,

∑Ni=-1(xi−-μX-)(yi−-μY) Cov(x,y) = σxy = N

For {(xi,yi)} _i=1ⁿ,

n Cov(x,y) = sxy = ∑i=1(xi−-x¯)(yi−-¯y) n

1.7.2 Measures of association for bivariate data: Correlation

The correlation for {(xi,yi)} _i=1^N is given by

ρxy =-σxy σxσy

and for {(xi,yi)} _i=1ⁿ

rxy =-sxy- sxsy

When |r|≥ 2√--
n we say the (linear) relationship is strong enough (or significant). Notice that it is always the case that

−1 ≤ ρ_xy ≤ 1

−1 ≤ r_xy ≤ 1

⁰Checkpoint
No: 21

1.8 Issues of unit and scale

Despite not paid enough attention by economists and business administration people, almost every data series comes with a unit and scale. For instance, if my income is TRY96, 000, the unit is TRY (international code for Turkish lira) and the scale is not explicitly said. If we write it as TRY96K, the unit is again TR and the scale is "thousands", so 96 means 96, 000 here.

Consider {(xi,yi)} _i=1^N where x is the body weight in kilograms (kg) y is the height in centimeters (cm). Then,


Measure	Unit

μ_x	kg

μ_y	cm

σ_x²	kg²

σ_x	kg

σ_y²	cm²

σ_x	cm

cv_x =	Unitless

cv_y =	Unitless

σ_xy	kg.cm

ρ_xy	Unitless

Similarly, quartiles, deciles and percentiles of a variable have the same unit as the variable. Range of a variable has the same unit as the variable. Interquartile range of a variable has the same unit as the variable. As a rule of thumb, linear operators do not alter the units.

Use of numerical scales is often a matter of practicality or convenience. Nobody likes to write 123, 000, 000, 000, 000 (except politicians) instead of writing 123 trillions or 123.10¹². One may need to learn two important practices of scaling numbers:

Logarithmic scales
Inverted scales

In this edition, these are left to readers as individual study.

1.2 EXERCISES____________________________________________

1.

Consider a population with data values of:
5, 6, 3, 3, 6, 9, 10, 4, 10, 4

Compute the mean, range, standard deviation, median, and Q₁.

Solution: μ = 6, Range = max−min = 7, σ = 2.61, Q₂ = 5.5 and Q₁ = 3.75.

2.

Find the mean, median, mode(s), variance, range, 1st quartile, and the 80th percentile of the data given below:
9, 13, 6, 7, 8, 6, 6, 9, 13, 13

Solution: μ = 9, Q₂ = 8.5, modes are 6 and 13, σ² = 8, Range = 7, Q₁ = 6 and P₈₀ = 13.

3.

A population has a range of R and it consists of two observations only. Calculate the variance of this data set.

Solution: x₁ and x₂ are the only two observations, and x₂ −x₁ = R is given (suppose x < x )
1 2 . Then x₂ −μ = R/2 and x₁ −μ = −R/2.

[ ] σ 2 = 1 (x1 − μ)2+ (x2− μ)2 2 [ ] = 1 (−R / 2)2 + (R /2)2 2 = 1 ⋅ R2 2 2 σ 2 = R2/4.

4.

A researcher argues that median equals the simple average of the first and third quartiles. By giving a numerical example, show that this is incorrect.

Solution: Find/make up your own example.

5.

Let a and b be any given real numbers. Let x₁, x₂, ..., x_N and y₁, y₂, ..., y_N be two data sets such that, for any i = 1, 2, ..., N, and y_i = ax_i + b.

i. What is the relation between the mean of the y-values and the mean of x-values?

ii. What is the relation between the variance of the y-values and the variance of x-values?

Solution: Needs some careful and patient elaboration.

1.: $1 N μy = -- ∑ yi N i=1 = -1 (ax + b) N ∑ i = -1 ax + 1- b N ∑ i N ∑ -1 1- = aN ∑ xi+ N Nb = aμx+ b μy = aμx+ b$
2.: $1 N ( ) σ2y = -- ∑ yi− μy 2 N i=1 -1 2 = N ∑ (axi+ b − aμx− b) -1 2 2 = N a ∑ (xi− μx) 2 1 2 = a N ∑ (xi− μx) = a2σ2 2 2 x2 σy = a σx$

6.

Consider a bivariate data consisting of the 1st midterm and 2nd midterm grades of 216 students. It is known that the 1st midterm grade of each student is 8% less than his 2nd midterm grade. If the mean of the 2nd midterm grades is 64 and the variance is 9 what can you say about the correlation coefficient of this data?

Solution: Without any calculations we can say it is 1. Why?

7.

If we remove a data point from a data series, variance decreases. True or false? Explain.

Solution: For a logical statement to be true, it must be true without any exceptions. Consider first {1, 5, 9} and second {1, 9}. Which set of values has a larger variance? What is your conclusion?

8.

When we multiply each point in a data series by the same factor, the variance increases. True or false? Explain.

Solution: Consider y_i = kx_i, i = 1, 2, …, N. You have seen before that

2 2 2 σy = k σx

Then, σ_y² is greater than σ_x² only when k² > 1, i.e., |k| > 1. So, the given statement is false (as we are able to find a counter example).

9.

Below is the distribution of a variable X based on a sample of 40 observations. Compute the coefficient of variation.


X	Frequency

10 − 14	8

15 − 19	16

20 − 24	u

25 − 29	4

30 − 34	2

Solution: μ = 19, σ² = 28.5 and σ = 5.34. So,

CV = σ-= 5.34-= 0.28 μ 19

10.

Consider the two populations of bivariate data:

Population 1 Population 2 --x-----y---- ---x-----y--- --2-----2---- --2.9----3.8-- ------------- ------------- --6-----3---- --−-1---−-4-- -10-----4---- -−-1.9---−5.8- --4----2.5--- ---4-----6--- −2 1 6 10 ------------- -------------

i. Find the covariance and correlation coefficient for each population.

ii. Plot the scatter plot for both populations.

iii. Standardize the x and y values in each population and plot the scatter plots for the standardized values.

Solution:

1.

To find the covariance we first find the mean of x and y values for both populations:

μ_1,x	= = 4
μ_1,y	= = 2.5
μ_2,x	= = 2
μ_2,y	= = 2

Thus

Cov₁	= ∑_i=1⁵(x_1,i −μ_1,x)(y_1,i −μ_1,y) = 4
Cov₂	= ∑_i=1⁵(x_2,i −μ_2,x)(y_2,i −μ_2,y) = 18.008 ≈ 18

To find the correlation coefficient we first find the variances:

σ_1,x	= ∑_i=1⁵(x_1,i −μ_1,x)² = 16
σ_1,y	= ∑_i=1⁵(y_1,i −μ_1,y)² = 1
σ_2,x	= ∑_i=1⁵(x_2,i −μ_2,x)² = 9.006 ≈ 9
σ_2,y	= ∑_i=1⁵(y_2,i −μ_2,y)² = 36.016 ≈ 36

Thus

ρ₁	= = = 1
ρ₂	= = = 1

2.

Population 1 is represented with a “x” and population 2 with a “+” in the scatter plot below:

xyxxxxx+++++

3.

Standardized values are obtained by subtracting the corresponding mean from each value and dividing the result by the standard deviation:

-Population-1- -Population-2- --x------y-----------x-----y---- -−0.5--−-0.5-- --0.3----0.3-- --0.5----0.5--- --−-1---−-1-- 1.5 1.5 − 1.3 −1.3 --0------0--- --0.6----0.6-- ------------- ------------- -−1.5--−-1.5-- --1.3----1.3--

The standardized values are plotted below:

xyxxxxx+++++

Note that even though the original populations where on lines with different slope the standardized values are on a line with slope 1.

⁰Checkpoint
No: 22

1.9 Chebyshev’s theorem (Chebyshev’s inequality)

For any data set with a mean of μ and variance of σ², and any k > 1, at least

( ) 1 − 12 100% k

of the observations will take a value in the interval [μ−kσ, μ + kσ].

1.3 EXERCISES ___________________________________________________________

1.

Consider a population with a mean of 4 and variance of 36. Using Chebyshev’s theorem find an interval that contains at least 70% of the observations.

Solution: μ = 4 and σ² = 36, so σ = 6.

( ) 1 − 12 100% = 70% k 1− -1 = 0.70 k2 1-= 0.30 k2 2 -1-- k = 0.30 k = 1.83.

So, the requested interval is:

[μ− kσ,μ + kσ ] = [4 − 6(1.83),4 + 6.(1.83)] = [− 6.95,14.95]

2.

The monthly charges for credit card holders at a department store have a mean of $250 and a standard deviation of $100. Use Chebyshev’s theorem to answer the following questions:

i. What can you say, for sure, about the percentage of card holders who will have monthly charges between $100 and $400?

ii. Provide a range for the credit card charges that will include at least 80% of all credit card customers.

Solution:

1.: Use the same approach. For the interval [100, 400], k = 1.5. Reveal why. Then, this interval contains at least $1− -1--⋅100% = (100− 44)% = 56% 1.52$
of all observations.
2.: [26.4, 473.6] is the answer.

3.

In a stock exchange average return over a year turns out to be 1% with a standard deviation of 2%. Over the same year, average exam grade in a university is 60 points with a standard deviation of 24 points. Which one has a higher variability, the returns or the grades?

Solution: CV for returns is 2%/1% = 2 and CV for grades is 24points/60points = 0.4. So, returns have a higher variability.

4.

When we replace a positive data value with its "additive inverse" in a data set, variance increases. Is this claim true or false? Either prove that it is true, or provide a counter example to show that it is false. Make sure you have used a formal mathematical notation.

Solution: Consider {5, −7} and {−5, −7}. Which pair of values has higher variance? Then, come up with a conclusion.

⁰Checkpoint
No: 23

1.10 Adding and multiplying terms over an index

If you’ve N numbers x₁, x₂, ..., x_N, the sum of these numbers, S, is:

S = x1+ x2+ ...+ xN

In expressing sums like this, we always write the first two terms, then three periods, then the last term. A shorter way to write S is:

N S = ∑ xi i=1

where i is the index of x, running from 1 to N.

Unless otherwise specified, i increases by 1 every time, from 1 to N. So, S = ∑_i=1^Nx_i is read as "sum of x_i, i from 1 to N". This means,

We’ll take (i = 1), x₁ first,
take (i = 2): add x₂,
take (i = 3): add x₃
...
and taking the next x_i each time, we’ll take (i = N) and add x_N the last.

For example,

4 4 i=4 S = Σ i=1xi = ∑i=1xi = ∑i=1xi = x1+ x2+ x3 + x4

If N is a number well-known in a problem:

S = ∑ xi

is a valid expression and it means "consider all x_i”,

Notice that

( ) ( ) N N−1 N−2 ∑i=1 = ∑i=1 + xN = ∑i= 1 + xN−1+ xN

and so forth.

In case we want to write

x1+ x3 + ...+ x2k−1

using our summation operator, we can write it as:

k ∑ x2i−1 i= 1

As you’ve seen, wisely using i solves many problems.

Consider:

S = 2⋅x1+ 2⋅x2 + ...+ 2 ⋅xN

which is equivalent to

N 2x ∑i=1 i

S, then, can be written as

N 2 x ∑i= 1 i

So, if each number in the sequence x₁, x₂, ..., x_N is multiplied by the same value which is not a function of i, this value can be taken out of the summation sign Σ.

Consider: