Conceptual Models

 

 

 

 A number of graphical representations of data can be used.

 

I. Pie Chart

 

The data should be arranged in such a way that each observation can fall into one and only one category of the variable.

For example, if we are trying to categorize “Animals” according to the variable “Animal features”, appropriate categories of the variable might be:

 

________________________________________________________

Animal features

 

Locomotion

Sensory stimulation

Embryonic development

 Adaptation

_________________________________________________________

 

 

Assuming that these categories are clearly defined and assuming that a scientist is properly trained; all animals could be placed into one and only one category of the variable.

 

 

However, if these categories are overlapping, the data could not be organized according to our divisions.

 

 

Having organized the data according to the categories, there are several ways to graphically display the data. The first and simplest is the Pie Chart. It is used to display the percentage of the total number of measurements falling into each of the categories of the variable by partitioning a circle

 

 

 

 

 

The data in table 1 will be used as an example.

 

 

Table 1

 

Animal features

Number

Percentage

Locomotion

20

10

Sensory stimulation

40

20

Embryonic development

60

30

Adaptation

80

40

 

 

Figure 1 is the graphical representation of the data listed in table 1 in a pie chart format.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guidelines for constructing pie charts

 

1)     Choose a small number of categories for the variable, preferably around 5 or 6. Too many categories make the pie chart difficult to interpret.

2)     Construct the pie chart so that percentages are in either ascending or descending order.

 

 

 

II. Bar Chart or Bar Graph.

 

 

Figure 2 shows the data listed in table 1 in a bar chart format

 

 

 

 

 

 

 

 

 

 

 

Guidelines for constructing bar charts

 

1)     Label numbers or frequencies along the vertical axis and categories of the variable along the horizontal axis.

 

2)     Construct a rectangle over each category of the variable with a height equal to the frequency (number of observations) in the category.

 

3)     Leave a space between each category on the horizontal axis to imply distinct, separate categories and to clarify the presentation.

 

 

III. Line graph

 

 

Figure 3 shows the data listed in table 1 in a line graph format

 

 

 

 

 

 

Guidelines for constructing line graphs

 

1)     Label numbers or frequencies along the vertical axis and categories of the variable along the horizontal axis.

 

2)     Place a point over each category of the variable with a height equal to the frequency or number of observations in the category.

 

3)     Leave a space between each category on the horizontal axis to imply distinct, separate categories and to clarify the presentation.

 

4)     Connect the points placed over each category of the variable.

 

 

 

 

IV. Frequency Histogram and the Relative Frequency Histogram.

 

 

Table 2 lists the results of HDL measurements for 100 adult male humans

 

 

37

42

44

44

43

42

44

48

49

44

42

38

42

44

46

39

43

45

48

39

47

42

42

48

45

36

41

43

39

42

40

42

40

45

44

41

40

40

38

46

49

38

43

43

39

38

47

39

40

42

43

47

41

40

46

44

46

44

49

44

40

39

45

43

38

41

43

42

45

44

42

47

38

45

40

42

41

40

47

41

47

41

48

41

43

47

42

41

44

48

41

49

43

44

44

43

46

45

46

40

 

 

 

 

Note that the largest value is 49 and the smallest is 36. Although we might examine the table very closely, it is difficult to describe how the measurements are situated along the interval from 36 to 49. Are most of the measurements near 36 or near 49, or are they evenly distributed along the interval. To answer this question, the data must be summarized in a frequency table.

 

To construct a frequency table, begin by dividing the range from 36 to 49 into an arbitrary number of subintervals called class intervals. The number of intervals chosen depends on the number of measurements in the set (it is recommended to use between 5 to 20 class intervals).

 

Guidelines for constructing class intervals

 

1)     Divide the range of the measurements (the difference between the largest and the smallest measurements) by the approximate number of class intervals desired.

 

2)     After dividing the range by the desired number of intervals, round the resulting number to a convenient (easy to work with) unit. This unit represents a common width for the class intervals.

 

3)     Choose the first class interval so that it contains the smallest measurement. It is also advisable to choose a starting point for the first interval so that no measurement falls on a point of division between two subintervals. This eliminates any ambiguity in placing measurements into the class intervals.

 

 

For the data in table 2, the range is

 

Range = 49 – 36 = 13

 

Assuming that we want to have 14 subintervals. Dividing the range by 14 and rounding to a convenient unit, we have 13/14 = 0.9 = 1. Thus the interval width is 1.

 

It is convenient to choose the first interval to be 35.5 – 36.6, the second to be 36.5 – 37.5, and so on. Note that the smallest measurement (36) falls in the first interval and that no measurement falls on the endpoint of a class interval.

Construct a frequency table for the data (table 3).

 

 

Table 3: Frequency table for the data in table 2

 

Class

Class interval

Frequency f

Relative frequency f/n

1

35.5-36.5

1

.01

2

36.5-37.5

1

.01

3

37.5-38.5

6

.06

4

38.5-39.5

6

.06

5

39.5-40.5

10

.1

6

40.5-41.5

10

.1

7

41.5-42.5

13

.13

8

42.5-43.5

11

.11

9

43.5-44.5

13

.13

10

44.5-45.5

7

.07

11

45.5-46.5

6

.06

12

46.5-47.5

7

.07

13

47.5-48.5

5

.05

14

48.5-49.5

4

.04

Totals

 

n = 100

1.00

 

To construct a frequency histogram, draw two axes: a horizontal axis labeled with the class intervals and a vertical axis labeled with the frequencies. Then construct a rectangle over each class interval with a height equal to the number of measurements falling in a given subinterval.

 

Figure 4 is a frequency histogram of the data in table 3

The relative frequency histogram is constructed in the same way as a frequency histogram. In the relative frequency histogram, the vertical axis is labeled as relative frequency.

The distinction between bar charts and histograms is based on the distinction between qualitative and quantitative variables. Bar charts are used to display frequency data from qualitative variables; histograms are appropriate for displaying frequency data for quantitative variables.

 

Numerical Methods for Describing Data

I Measures of Central Tendency

1) The Mode

 

The mode of a set of measurements is defined to be the measurement that occurs most often i.e. with the highest frequency.

For example consider the data listed in the following table

62

105

33

80

65

30

75

89

55

105

100

75

42

105

95

 

The mode is 105

 

When dealing with grouped data listed in a frequency table. The mode (modal interval) is the midpoint of the class interval with the highest frequency.

Some data may have bimodal or trimodal or …. Distributions

 

 

 

 

2) The Median

The median of a set of measurements is the middle value when the measurements are arranged in order of magnitude.

 

The median for an even number of measurements will be the average of two middle values when the measurements are arranged in order of magnitude.

 

For example, for the data listed below

95

86

78

90

62

73

89

92

84

76

 

After arranging the data on order of magnitude, the median is

 

84 + 86 / 2 = 85

To calculate the median for grouped data, use the equation below

Median = L + w/fm (0.5n – cf)

Where,

L = lower class limit of the interval that contains the median

n = total frequency

cf = the sum of the cumulative frequencies for all classes before the median class

fm = frequency of the class interval containing the median

w = interval width

 

For example, consider the data listed in the following table

Class

Class interval

Frequency f

Cumulative frequency

Relative frequency f/n

Cumulative relative freq

1

35.5-36.5

1

1

.01

0.01

2

36.5-37.5

1

2

.01

0.02

3

37.5-38.5

6

8

.06

0.08

4

38.5-39.5

6

14

.06

0.14

5

39.5-40.5

10

24

.1

0.24

6

40.5-41.5

10

34

.1

0.34

7

41.5-42.5

13

47

.13

0.47

8

42.5-43.5

11

58

.11

0.58

9

43.5-44.5

13

71

.13

0.71

10

44.5-45.5

7

78

.07

0.87

11

45.5-46.5

6

84

.06

0.84

12

46.5-47.5

7

91

.07

0.91

13

47.5-48.5

5

96

.05

0.96

14

48.5-49.5

4

100

.04

1.00

Totals

 

n = 100

 

1.00

 

 

The interval that contains the median is the first class interval in which the cumulative relative frequency exceeds 0.5

For our data this class interval is number 8 (42.5-43.5)

L = 42.5   fm = 11 n = 100 w = 1 cf = 47

Median = 42.5 + 1/11 (50 – 47) = 42.8

 

 

3) The arithmetic mean or mean

 

The mean is defined to be the sum of the measurements divided by the total number of measurements

 

Ÿ = S yi / n

 

For grouped data, the following formula is used to approximate the mean

Ÿ = S fiyi / n

 

fi = frequency associated with the ith class interval

yi= midpoint of the ith class interval

 

 

The sample mean is used to make inferences about the population parameter m

 

 

 

 

 

For our data the mean is calculated to be

 

Class

Class interval

Frequency f

yi

fi yi

1

35.5-36.5

1

36

36

2

36.5-37.5

1

37

37

3

37.5-38.5

6

38

228

4

38.5-39.5

6

39

234

5

39.5-40.5

10

40

400

6

40.5-41.5

10

41

410

7

41.5-42.5

13

42

546

8

42.5-43.5

11

43

473

9

43.5-44.5

13

44

572

10

44.5-45.5

7

45

315

11

45.5-46.5

6

46

276

12

46.5-47.5

7

47

329

13

47.5-48.5

5

48

240

14

48.5-49.5

4

49

196

Totals

 

n = 100

 

4292

 

S fiyi = 4292

n = 100

and

Ÿ = S fiyi / n

Then

Ÿ = 4292/ 100 = 42.9

 

 

 

 

II Measures of Variability

1) The range

* The range of a set of measurements is defined to be the difference between the largest and the smallest measurements of the set.

 

* For grouped data, when the individual measurements are not known, the range is taken to be the difference between the upper limit of the last class interval and the lower limit of the first class interval.

 

* The range gives very little information about the variability of distribution of the data about the mean

 

2) Percentiles

* The pth percentile of a set of n measurements arranged in order of magnitude is that value that has p% of the measurements below it and (100-p)% above it. For example the 60th percentile is value that has 60% of the measurements below it and 40% of the measurement above it.

* Specific percentiles of interest are the 25th m 50th (median) and 75th percentiles often called the lower quartile, the middle quartile and upper quartile respectively.

 

 

For grouped data the percentiles are calculated as follows

P = percentile of interest

L = lower limit of the class interval that includes the percentile of interest

n = total frequency

cf = cumulative frequency for all class intervals before percentile class

fp = frequency of the class interval that includes the percentile of interest

w = interval width

For example, the 90th percentile is calculated as

 

P = L + w/fp (0.9n – cf)

 

To determine L, fp and cf find the first interval for which the cumulative relative frequency exceeds 0.9. This interval would contain the 90th percentile.

For our example,

L = 46.5   n= 100   cf = 84   fp = 7 and w = 1

P90 = 46.5 + 1/7 (90 – 84) = 47.4

This means that 90% of the measurements lie below this value and 10% lie above it.

 

 

 

3) The interquartile range

The interquartile range of a set of measurements is defined as the difference between the upper and lower quartiles.

The interquartile range, although more sensitive than the range, it is still not sufficient for our purposes.

 

4) The variance

Deviation is calculated to be how much a certain value deviates from its mean (y – Ÿ)

 

The variance of a set of measurements with mean Ÿ is the sum of the squared deviations divided by n-1.

Sample variance is denoted as S2

 S2= S (y – Ÿ)2 / n-1

A short cut formula for variance standard deviation are calculated as

S2= 1/n-1 [ Syi2- (Syi)2 / n ]

For grouped data

S2 = 1/n-1 [ S fiyi2- (S fiyi)2 / n ]

 

5) Standard deviation

The standard deviation of a set of measurements is defined to be the positive square root of the variance.

For example,

Class

Class interval

Frequency f

yi

fi yi

1

35.5-36.5

1

36

36

2

36.5-37.5

1

37

37

3

37.5-38.5

6

38

228

4

38.5-39.5

6

39

234

5

39.5-40.5

10

40

400

6

40.5-41.5

10

41

410

7

41.5-42.5

13

42

546

8

42.5-43.5

11

43

473

9

43.5-44.5

13

44

572

10

44.5-45.5

7

45

315

11

45.5-46.5

6

46

276

12

46.5-47.5

7

47

329

13

47.5-48.5

5

48

240

14

48.5-49.5

4

49

196

Totals

 

n = 100

 

4292

 

S2 = 0.1

S = 0.316