Conceptual Models

Conceptual Models

A number of graphical representations of data can be used.

I. Pie Chart

The data should be arranged in such a way that each observation can fall into one and only one category of the variable.

For example, if we are trying to categorize “Animals” according to the variable “Animal features”, appropriate categories of the variable might be:

________________________________________________________

Animal features

Locomotion

Sensory stimulation

Embryonic development

Adaptation

_________________________________________________________

Assuming that these categories are clearly defined and assuming that a scientist is properly trained; all animals could be placed into one and only one category of the variable.

However, if these categories are overlapping, the data could not be organized according to our divisions.

Having organized the data according to the categories, there are several ways to graphically display the data. The first and simplest is the Pie Chart. It is used to display the percentage of the total number of measurements falling into each of the categories of the variable by partitioning a circle

The data in table 1 will be used as an example.

Table 1

Animal features	Number	Percentage
Locomotion	20	10
Sensory stimulation	40	20
Embryonic development	60	30
Adaptation	80	40

Figure 1 is the graphical representation of the data listed in table 1 in a pie chart format.

Guidelines for constructing pie charts

1) Choose a small number of categories for the variable, preferably around 5 or 6. Too many categories make the pie chart difficult to interpret.

2) Construct the pie chart so that percentages are in either ascending or descending order.

II. Bar Chart or Bar Graph.

Figure 2 shows the data listed in table 1 in a bar chart format

Guidelines for constructing bar charts

1) Label numbers or frequencies along the vertical axis and categories of the variable along the horizontal axis.

2) Construct a rectangle over each category of the variable with a height equal to the frequency (number of observations) in the category.

3) Leave a space between each category on the horizontal axis to imply distinct, separate categories and to clarify the presentation.

III. Line graph

Figure 3 shows the data listed in table 1 in a line graph format

Guidelines for constructing line graphs

1) Label numbers or frequencies along the vertical axis and categories of the variable along the horizontal axis.

2) Place a point over each category of the variable with a height equal to the frequency or number of observations in the category.

3) Leave a space between each category on the horizontal axis to imply distinct, separate categories and to clarify the presentation.

4) Connect the points placed over each category of the variable.

IV. Frequency Histogram and the Relative Frequency Histogram.

Table 2 lists the results of HDL measurements for 100 adult male humans

37	42	44	44	43	42	44	48	49	44
42	38	42	44	46	39	43	45	48	39
47	42	42	48	45	36	41	43	39	42
40	42	40	45	44	41	40	40	38	46
49	38	43	43	39	38	47	39	40	42
43	47	41	40	46	44	46	44	49	44
40	39	45	43	38	41	43	42	45	44
42	47	38	45	40	42	41	40	47	41
47	41	48	41	43	47	42	41	44	48
41	49	43	44	44	43	46	45	46	40

Note that the largest value is 49 and the smallest is 36. Although we might examine the table very closely, it is difficult to describe how the measurements are situated along the interval from 36 to 49. Are most of the measurements near 36 or near 49, or are they evenly distributed along the interval. To answer this question, the data must be summarized in a frequency table.

To construct a frequency table, begin by dividing the range from 36 to 49 into an arbitrary number of subintervals called class intervals. The number of intervals chosen depends on the number of measurements in the set (it is recommended to use between 5 to 20 class intervals).

Guidelines for constructing class intervals

1) Divide the range of the measurements (the difference between the largest and the smallest measurements) by the approximate number of class intervals desired.

2) After dividing the range by the desired number of intervals, round the resulting number to a convenient (easy to work with) unit. This unit represents a common width for the class intervals.

3) Choose the first class interval so that it contains the smallest measurement. It is also advisable to choose a starting point for the first interval so that no measurement falls on a point of division between two subintervals. This eliminates any ambiguity in placing measurements into the class intervals.

For the data in table 2, the range is

Range = 49 – 36 = 13

Assuming that we want to have 14 subintervals. Dividing the range by 14 and rounding to a convenient unit, we have 13/14 = 0.9 = 1. Thus the interval width is 1.

It is convenient to choose the first interval to be 35.5 – 36.6, the second to be 36.5 – 37.5, and so on. Note that the smallest measurement (36) falls in the first interval and that no measurement falls on the endpoint of a class interval.

Construct a frequency table for the data (table 3).

Table 3: Frequency table for the data in table 2

Class	Class interval	Frequency f	Relative frequency f/n
1	35.5-36.5	1	.01
2	36.5-37.5	1	.01
3	37.5-38.5	6	.06
4	38.5-39.5	6	.06
5	39.5-40.5	10	.1
6	40.5-41.5	10	.1
7	41.5-42.5	13	.13
8	42.5-43.5	11	.11
9	43.5-44.5	13	.13
10	44.5-45.5	7	.07
11	45.5-46.5	6	.06
12	46.5-47.5	7	.07
13	47.5-48.5	5	.05
14	48.5-49.5	4	.04
Totals		n = 100	1.00

To construct a frequency histogram, draw two axes: a horizontal axis labeled with the class intervals and a vertical axis labeled with the frequencies. Then construct a rectangle over each class interval with a height equal to the number of measurements falling in a given subinterval.

Figure 4 is a frequency histogram of the data in table 3

The relative frequency histogram is constructed in the same way as a frequency histogram. In the relative frequency histogram, the vertical axis is labeled as relative frequency.

The distinction between bar charts and histograms is based on the distinction between qualitative and quantitative variables. Bar charts are used to display frequency data from qualitative variables; histograms are appropriate for displaying frequency data for quantitative variables.

Numerical Methods for Describing Data

I Measures of Central Tendency

1) The Mode

The mode of a set of measurements is defined to be the measurement that occurs most often i.e. with the highest frequency.

For example consider the data listed in the following table

62	105	33
80	65	30
75	89	55
105	100	75
42	105	95

The mode is 105

When dealing with grouped data listed in a frequency table. The mode (modal interval) is the midpoint of the class interval with the highest frequency.

Some data may have bimodal or trimodal or …. Distributions

2) The Median

The median of a set of measurements is the middle value when the measurements are arranged in order of magnitude.

The median for an even number of measurements will be the average of two middle values when the measurements are arranged in order of magnitude.

For example, for the data listed below

95	86	78	90	62
73	89	92	84	76

After arranging the data on order of magnitude, the median is

84 + 86 / 2 = 85

To calculate the median for grouped data, use the equation below

Median = L + w/fm (0.5n – cf)

Where,

L = lower class limit of the interval that contains the median

n = total frequency

cf = the sum of the cumulative frequencies for all classes before the median class

fm = frequency of the class interval containing the median

w = interval width

For example, consider the data listed in the following table

Class	Class interval	Frequency f	Cumulative frequency	Relative frequency f/n	Cumulative relative freq
1	35.5-36.5	1	1	.01	0.01
2	36.5-37.5	1	2	.01	0.02
3	37.5-38.5	6	8	.06	0.08
4	38.5-39.5	6	14	.06	0.14
5	39.5-40.5	10	24	.1	0.24
6	40.5-41.5	10	34	.1	0.34
7	41.5-42.5	13	47	.13	0.47
8	42.5-43.5	11	58	.11	0.58
9	43.5-44.5	13	71	.13	0.71
10	44.5-45.5	7	78	.07	0.87
11	45.5-46.5	6	84	.06	0.84
12	46.5-47.5	7	91	.07	0.91
13	47.5-48.5	5	96	.05	0.96
14	48.5-49.5	4	100	.04	1.00
Totals		n = 100		1.00

The interval that contains the median is the first class interval in which the cumulative relative frequency exceeds 0.5

For our data this class interval is number 8 (42.5-43.5)

L = 42.5 fm = 11 n = 100 w = 1 cf = 47

Median = 42.5 + 1/11 (50 – 47) = 42.8

3) The arithmetic mean or mean

The mean is defined to be the sum of the measurements divided by the total number of measurements

Ÿ = S y_i / n

For grouped data, the following formula is used to approximate the mean

Ÿ = S f_iy_i / n

f_i = frequency associated with the ith class interval

y_i= midpoint of the ith class interval

The population mean is denoted by the Greek letter m
The sample mean is denoted by the symbol Ÿ

The sample mean is used to make inferences about the population parameter m

For our data the mean is calculated to be

Class	Class interval	Frequency f	yi	fi yi
1	35.5-36.5	1	36	36
2	36.5-37.5	1	37	37
3	37.5-38.5	6	38	228
4	38.5-39.5	6	39	234
5	39.5-40.5	10	40	400
6	40.5-41.5	10	41	410
7	41.5-42.5	13	42	546
8	42.5-43.5	11	43	473
9	43.5-44.5	13	44	572
10	44.5-45.5	7	45	315
11	45.5-46.5	6	46	276
12	46.5-47.5	7	47	329
13	47.5-48.5	5	48	240
14	48.5-49.5	4	49	196
Totals		n = 100		4292

S f_iy_i = 4292

n = 100

and

Ÿ = S f_iy_i / n

Then

Ÿ = 4292/ 100 = 42.9

II Measures of Variability

1) The range

* The range of a set of measurements is defined to be the difference between the largest and the smallest measurements of the set.

* For grouped data, when the individual measurements are not known, the range is taken to be the difference between the upper limit of the last class interval and the lower limit of the first class interval.

* The range gives very little information about the variability of distribution of the data about the mean

2) Percentiles

* The pth percentile of a set of n measurements arranged in order of magnitude is that value that has p% of the measurements below it and (100-p)% above it. For example the 60^th percentile is value that has 60% of the measurements below it and 40% of the measurement above it.

* Specific percentiles of interest are the 25^th m 50^th (median) and 75^th percentiles often called the lower quartile, the middle quartile and upper quartile respectively.

For grouped data the percentiles are calculated as follows

P = percentile of interest

L = lower limit of the class interval that includes the percentile of interest

n = total frequency

cf = cumulative frequency for all class intervals before percentile class

fp = frequency of the class interval that includes the percentile of interest

w = interval width

For example, the 90^th percentile is calculated as

P = L + w/fp (0.9n – cf)

To determine L, fp and cf find the first interval for which the cumulative relative frequency exceeds 0.9. This interval would contain the 90^th percentile.

For our example,

L = 46.5 n= 100 cf = 84 fp = 7 and w = 1

P90 = 46.5 + 1/7 (90 – 84) = 47.4

This means that 90% of the measurements lie below this value and 10% lie above it.

3) The interquartile range

The interquartile range of a set of measurements is defined as the difference between the upper and lower quartiles.

The interquartile range, although more sensitive than the range, it is still not sufficient for our purposes.

4) The variance

Deviation is calculated to be how much a certain value deviates from its mean (y – Ÿ)

The variance of a set of measurements with mean Ÿ is the sum of the squared deviations divided by n-1.

Sample variance is denoted as S²

S²= S (y – Ÿ)² / n-1

A short cut formula for variance standard deviation are calculated as

S²= 1/n-1 [ Sy_i²- (Sy_i)² / n ]

For grouped data

S² = 1/n-1 [ S f_iy_i²- (S f_iy_i)² / n ]

5) Standard deviation

The standard deviation of a set of measurements is defined to be the positive square root of the variance.

For example,

Class	Class interval	Frequency f	yi	fi yi
1	35.5-36.5	1	36	36
2	36.5-37.5	1	37	37
3	37.5-38.5	6	38	228
4	38.5-39.5	6	39	234
5	39.5-40.5	10	40	400
6	40.5-41.5	10	41	410
7	41.5-42.5	13	42	546
8	42.5-43.5	11	43	473
9	43.5-44.5	13	44	572
10	44.5-45.5	7	45	315
11	45.5-46.5	6	46	276
12	46.5-47.5	7	47	329
13	47.5-48.5	5	48	240
14	48.5-49.5	4	49	196
Totals		n = 100		4292

S² = 0.1

S = 0.316

37	42	44	44	43	42	44	48	49	44
42	38	42	44	46	39	43	45	48	39
47	42	42	48	45	36	41	43	39	42
40	42	40	45	44	41	40	40	38	46
49	38	43	43	39	38	47	39	40	42
43	47	41	40	46	44	46	44	49	44
40	39	45	43	38	41	43	42	45	44
42	47	38	45	40	42	41	40	47	41
47	41	48	41	43	47	42	41	44	48
41	49	43	44	44	43	46	45	46	40

37	42	44	44	43	42	44	48	49	44
42	38	42	44	46	39	43	45	48	39
47	42	42	48	45	36	41	43	39	42
40	42	40	45	44	41	40	40	38	46
49	38	43	43	39	38	47	39	40	42
43	47	41	40	46	44	46	44	49	44
40	39	45	43	38	41	43	42	45	44
42	47	38	45	40	42	41	40	47	41
47	41	48	41	43	47	42	41	44	48
41	49	43	44	44	43	46	45	46	40

37	42	44	44	43	42	44	48	49	44
42	38	42	44	46	39	43	45	48	39
47	42	42	48	45	36	41	43	39	42
40	42	40	45	44	41	40	40	38	46
49	38	43	43	39	38	47	39	40	42
43	47	41	40	46	44	46	44	49	44
40	39	45	43	38	41	43	42	45	44
42	47	38	45	40	42	41	40	47	41
47	41	48	41	43	47	42	41	44	48
41	49	43	44	44	43	46	45	46	40