Conceptual Models
A number of graphical representations of data
can be used.
I. Pie Chart
The data should
be arranged in such a way that each observation can fall into one and only one
category of the variable.
For example, if
we are trying to categorize “Animals” according to the variable “Animal
features”, appropriate categories of the variable might be:
________________________________________________________
Locomotion
Sensory
stimulation
Embryonic
development
Adaptation
_________________________________________________________
Assuming that
these categories are clearly defined and assuming that a scientist is properly trained;
all animals could be placed into one and only one category of the variable.
However, if these
categories are overlapping, the data could not be organized according to our
divisions.
Having
organized the data according to the categories, there are several ways to
graphically display the data. The first and simplest is the Pie Chart. It
is used to display the percentage of the total number of measurements falling
into each of the categories of the variable by partitioning a circle
The data in
table 1 will be used as an example.
Table 1
Animal
features |
Number |
Percentage |
Locomotion |
20 |
10 |
Sensory
stimulation |
40 |
20 |
Embryonic
development |
60 |
30 |
Adaptation |
80 |
40 |
Figure 1 is the
graphical representation of the data listed in table 1 in a pie chart format.
Guidelines for
constructing pie charts
1)
Choose a small number of categories for the
variable, preferably around 5 or 6. Too many categories make the pie chart
difficult to interpret.
2)
Construct the pie chart so that percentages are in
either ascending or descending order.
II. Bar Chart
or Bar Graph.
Figure 2 shows
the data listed in table 1 in a bar chart format
Guidelines for
constructing bar charts
1)
Label numbers or frequencies along the vertical axis
and categories of the variable along the horizontal axis.
2)
Construct a rectangle over each category of the
variable with a height equal to the frequency (number of observations) in the
category.
3)
Leave a space between each category on the
horizontal axis to imply distinct, separate categories and to clarify the
presentation.
III. Line graph
Figure 3 shows
the data listed in table 1 in a line graph format
Guidelines for
constructing line graphs
1)
Label numbers or frequencies along the vertical axis
and categories of the variable along the horizontal axis.
2)
Place a point over each category of the variable
with a height equal to the frequency or number of observations in the category.
3)
Leave a space between each category on the
horizontal axis to imply distinct, separate categories and to clarify the
presentation.
4)
Connect the points placed over each category of the
variable.
IV. Frequency
Histogram and the Relative Frequency Histogram.
Table 2 lists
the results of HDL measurements for 100 adult male humans
37 |
42 |
44 |
44 |
43 |
42 |
44 |
48 |
49 |
44 |
42 |
38 |
42 |
44 |
46 |
39 |
43 |
45 |
48 |
39 |
47 |
42 |
42 |
48 |
45 |
36 |
41 |
43 |
39 |
42 |
40 |
42 |
40 |
45 |
44 |
41 |
40 |
40 |
38 |
46 |
49 |
38 |
43 |
43 |
39 |
38 |
47 |
39 |
40 |
42 |
43 |
47 |
41 |
40 |
46 |
44 |
46 |
44 |
49 |
44 |
40 |
39 |
45 |
43 |
38 |
41 |
43 |
42 |
45 |
44 |
42 |
47 |
38 |
45 |
40 |
42 |
41 |
40 |
47 |
41 |
47 |
41 |
48 |
41 |
43 |
47 |
42 |
41 |
44 |
48 |
41 |
49 |
43 |
44 |
44 |
43 |
46 |
45 |
46 |
40 |
Note that the largest value is 49 and the smallest is 36.
Although we might examine the table very closely, it is difficult to describe
how the measurements are situated along the interval from 36 to 49. Are most of
the measurements near 36 or near 49, or are they evenly distributed along the
interval. To answer this question, the data must be summarized in a frequency
table.
To construct a frequency table, begin by dividing the range
from 36 to 49 into an arbitrary number of subintervals called class
intervals. The number of intervals chosen depends on the number of
measurements in the set (it is recommended to use between 5 to 20 class
intervals).
Guidelines for
constructing class intervals
1)
Divide the range of the measurements (the difference
between the largest and the smallest measurements) by the approximate number of
class intervals desired.
2)
After dividing the range by the desired number of
intervals, round the resulting number to a convenient (easy to work with) unit.
This unit represents a common width for the class intervals.
3)
Choose the first class interval so that it contains
the smallest measurement. It is also advisable to choose a starting point for
the first interval so that no measurement falls on a point of division between
two subintervals. This eliminates any ambiguity in placing measurements into
the class intervals.
For the data in
table 2, the range is
Range = 49 – 36
= 13
Assuming that we want to have 14 subintervals. Dividing the
range by 14 and rounding to a convenient unit, we have 13/14 = 0.9 = 1. Thus
the interval width is 1.
It is convenient to choose the first interval to be 35.5 –
36.6, the second to be 36.5 – 37.5, and so on. Note that the smallest
measurement (36) falls in the first interval and that no measurement falls on
the endpoint of a class interval.
Construct a
frequency table for the data (table 3).
Table 3:
Frequency table for the data in table 2
Class |
Class
interval |
Frequency f |
Relative
frequency f/n |
1 |
35.5-36.5 |
1 |
.01 |
2 |
36.5-37.5 |
1 |
.01 |
3 |
37.5-38.5 |
6 |
.06 |
4 |
38.5-39.5 |
6 |
.06 |
5 |
39.5-40.5 |
10 |
.1 |
6 |
40.5-41.5 |
10 |
.1 |
7 |
41.5-42.5 |
13 |
.13 |
8 |
42.5-43.5 |
11 |
.11 |
9 |
43.5-44.5 |
13 |
.13 |
10 |
44.5-45.5 |
7 |
.07 |
11 |
45.5-46.5 |
6 |
.06 |
12 |
46.5-47.5 |
7 |
.07 |
13 |
47.5-48.5 |
5 |
.05 |
14 |
48.5-49.5 |
4 |
.04 |
Totals |
|
n = 100 |
1.00 |
To construct a frequency histogram, draw two axes: a
horizontal axis labeled with the class intervals and a vertical axis labeled
with the frequencies. Then construct a rectangle over each class interval with
a height equal to the number of measurements falling in a given subinterval.
Figure 4 is a
frequency histogram of the data in table 3
The relative frequency histogram is constructed in the same
way as a frequency histogram. In the relative frequency histogram, the vertical
axis is labeled as relative frequency.
The distinction between bar charts and histograms is based on
the distinction between qualitative and quantitative variables. Bar charts are
used to display frequency data from qualitative variables; histograms are
appropriate for displaying frequency data for quantitative variables.
Numerical Methods for Describing Data
I Measures of Central Tendency
1) The Mode
The mode of a set of measurements is defined to be the
measurement that occurs most often i.e. with the highest frequency.
For example consider the data listed in the following
table
62 |
105 |
33 |
80 |
65 |
30 |
75 |
89 |
55 |
105 |
100 |
75 |
42 |
105 |
95 |
The mode is 105
When dealing with grouped data listed in a frequency
table. The mode (modal interval) is the midpoint of the class interval with the
highest frequency.
Some data may have bimodal or trimodal or ….
Distributions
2) The Median
The median of a set of measurements is the middle
value when the measurements are arranged in order of magnitude.
The median for an even number of measurements will be
the average of two middle values when the measurements are arranged in order of
magnitude.
For example, for the data listed below
95 |
86 |
78 |
90 |
62 |
73 |
89 |
92 |
84 |
76 |
After arranging the data on order of magnitude, the
median is
84 + 86 / 2 = 85
To calculate the median for grouped data, use the
equation below
Median = L + w/fm (0.5n – cf)
Where,
L = lower class limit of the interval that contains
the median
n = total frequency
cf = the sum of the cumulative frequencies for all
classes before the median class
fm = frequency of the class interval containing the
median
w = interval width
For example, consider the data listed in the following
table
Class |
Class
interval |
Frequency f |
Cumulative
frequency |
Relative
frequency f/n |
Cumulative
relative freq |
1 |
35.5-36.5 |
1 |
1 |
.01 |
0.01 |
2 |
36.5-37.5 |
1 |
2 |
.01 |
0.02 |
3 |
37.5-38.5 |
6 |
8 |
.06 |
0.08 |
4 |
38.5-39.5 |
6 |
14 |
.06 |
0.14 |
5 |
39.5-40.5 |
10 |
24 |
.1 |
0.24 |
6 |
40.5-41.5 |
10 |
34 |
.1 |
0.34 |
7 |
41.5-42.5 |
13 |
47 |
.13 |
0.47 |
8 |
42.5-43.5 |
11 |
58 |
.11 |
0.58 |
9 |
43.5-44.5 |
13 |
71 |
.13 |
0.71 |
10 |
44.5-45.5 |
7 |
78 |
.07 |
0.87 |
11 |
45.5-46.5 |
6 |
84 |
.06 |
0.84 |
12 |
46.5-47.5 |
7 |
91 |
.07 |
0.91 |
13 |
47.5-48.5 |
5 |
96 |
.05 |
0.96 |
14 |
48.5-49.5 |
4 |
100 |
.04 |
1.00 |
Totals |
|
n = 100 |
|
1.00 |
|
The interval that contains the median is the first
class interval in which the cumulative relative frequency exceeds 0.5
For our data this class interval is number 8
(42.5-43.5)
L = 42.5 fm =
11 n = 100 w = 1 cf = 47
Median = 42.5 + 1/11 (50 – 47) = 42.8
3) The arithmetic mean or mean
The mean is defined to be the sum of the measurements
divided by the total number of measurements
Ÿ = S yi / n
For grouped data,
the following formula is used to approximate the mean
Ÿ = S fiyi / n
fi = frequency associated with the ith class interval
yi= midpoint of the ith class interval
The sample mean
is used to make inferences about the population parameter m
For our data the mean is calculated to be
Class |
Class
interval |
Frequency f |
yi |
fi yi |
1 |
35.5-36.5 |
1 |
36 |
36 |
2 |
36.5-37.5 |
1 |
37 |
37 |
3 |
37.5-38.5 |
6 |
38 |
228 |
4 |
38.5-39.5 |
6 |
39 |
234 |
5 |
39.5-40.5 |
10 |
40 |
400 |
6 |
40.5-41.5 |
10 |
41 |
410 |
7 |
41.5-42.5 |
13 |
42 |
546 |
8 |
42.5-43.5 |
11 |
43 |
473 |
9 |
43.5-44.5 |
13 |
44 |
572 |
10 |
44.5-45.5 |
7 |
45 |
315 |
11 |
45.5-46.5 |
6 |
46 |
276 |
12 |
46.5-47.5 |
7 |
47 |
329 |
13 |
47.5-48.5 |
5 |
48 |
240 |
14 |
48.5-49.5 |
4 |
49 |
196 |
Totals |
|
n = 100 |
|
4292 |
S fiyi = 4292
n = 100
and
Ÿ = S fiyi / n
Then
Ÿ = 4292/ 100 = 42.9
II Measures of Variability
1) The range
* The range of a set of measurements is defined to be
the difference between the largest and the smallest measurements of the set.
* For grouped data, when the individual measurements
are not known, the range is taken to be the difference between the upper limit
of the last class interval and the lower limit of the first class interval.
* The range gives very little information about the
variability of distribution of the data about the mean
2) Percentiles
* The pth percentile of a set of n measurements
arranged in order of magnitude is that value that has p% of the measurements
below it and (100-p)% above it. For example the 60th percentile is
value that has 60% of the measurements below it and 40% of the measurement
above it.
* Specific percentiles of interest are the 25th
m 50th (median) and 75th percentiles often called the
lower quartile, the middle quartile and upper quartile respectively.
For grouped data the percentiles are calculated as
follows
P = percentile of interest
L = lower limit of the class interval that includes
the percentile of interest
n = total frequency
cf = cumulative frequency for all class intervals
before percentile class
fp = frequency of the class interval that includes the
percentile of interest
w = interval width
For example, the 90th percentile is
calculated as
P = L + w/fp (0.9n – cf)
To determine L, fp and cf find the first interval for
which the cumulative relative frequency exceeds 0.9. This interval would
contain the 90th percentile.
For our example,
L = 46.5 n=
100 cf = 84 fp = 7 and w = 1
P90 = 46.5 + 1/7 (90 – 84) = 47.4
This means that 90% of the measurements lie below this
value and 10% lie above it.
3) The interquartile range
The interquartile range of a set of measurements is
defined as the difference between the upper and lower quartiles.
The interquartile range, although more sensitive than
the range, it is still not sufficient for our purposes.
4) The variance
Deviation is calculated to be how much a certain value
deviates from its mean (y – Ÿ)
The variance of a set of measurements with mean Ÿ is the sum of the squared deviations divided by n-1.
Sample variance is denoted as S2
S2= S (y – Ÿ)2 / n-1
A short cut formula for variance standard deviation
are calculated as
S2= 1/n-1 [ Syi2- (Syi)2 / n ]
For grouped data
S2 = 1/n-1 [ S fiyi2- (S fiyi)2 / n ]
5) Standard deviation
The standard deviation of a set of measurements is
defined to be the positive square root of the variance.
For example,
Class |
Class
interval |
Frequency f |
yi |
fi yi |
1 |
35.5-36.5 |
1 |
36 |
36 |
2 |
36.5-37.5 |
1 |
37 |
37 |
3 |
37.5-38.5 |
6 |
38 |
228 |
4 |
38.5-39.5 |
6 |
39 |
234 |
5 |
39.5-40.5 |
10 |
40 |
400 |
6 |
40.5-41.5 |
10 |
41 |
410 |
7 |
41.5-42.5 |
13 |
42 |
546 |
8 |
42.5-43.5 |
11 |
43 |
473 |
9 |
43.5-44.5 |
13 |
44 |
572 |
10 |
44.5-45.5 |
7 |
45 |
315 |
11 |
45.5-46.5 |
6 |
46 |
276 |
12 |
46.5-47.5 |
7 |
47 |
329 |
13 |
47.5-48.5 |
5 |
48 |
240 |
14 |
48.5-49.5 |
4 |
49 |
196 |
Totals |
|
n = 100 |
|
4292 |
S2 = 0.1
S = 0.316