
... Q&A: Statistics (cont'd)
...
Ways to display data (questions 8–10)
8. What is
a box and whisker plot (or box plot) graph?

A box and whisker plot is a way of summarizing a set
of data measured on an interval scale. It is often used in exploratory
data analysis to show the shape of the distribution, its central value,
and variability. The picture produced consists of the most extreme values
in the data set (maximum and minimum values at the ends of the line),
the lower and upper quartiles (edges of the box), and the median (line
through the box). (NOTE: The lines extending from the box may be adjusted
to represent a certain fraction of the data: they could be set at 5% and
95% or they could represent the minimum and maximum values.)
A box plot, as it is often called, is especially helpful for indicating whether a
distribution is skewed and whether there are any unusual observations
(outliers) in the data set. Box and whisker plots are also very useful
when large numbers of observations are involved and when two or more data
sets are being compared.

9. What is
a frequency table? 
A frequency table is a way of summarizing a set of data. It is a record
of how often each value (or set of values) of the variable in question
occurs. It may be enhanced by the addition of percentages that fall
into each category. A frequency table is used to summarize categorical,
nominal, and ordinal data. It may also be used to summarize continuous
data once the data set has been divided up into sensible groups.
Example: Suppose that in thirty shots at a target, a marksman makes
the following scores:
The frequencies of the different scores can be summarized as:

10. What
is a normal distribution? 
A normal distribution is a bell curve that extends to infinity in both
directions. The high point represents the mean. Examples of normal distributions
are shown below. Notice that they differ in how spread out they are
but the area under each curve is the same. If the area under the curve
is defined to be 1 and you multiply that by 100, then there is a 100%
chance that any value you name will be somewhere in the distribution.
Because half the area of the curve is below the mean and half is above
the mean, there is a 50% chance that a randomly chosen value will be
above the mean and the same chance that it will be below it. The area
under the normal curve is equivalent to the probability of randomly
drawing a value in that range. The area is greatest in the middle where
the "hump" is and thins out toward the tails.
(Based on graphic from http://davidmlane.com/hyperstat/normal_distribution.html)

Bias (questions 11–13)
Bias is a systematic error in sample statistics that
can occur from the use of poor sampling methods. Sample design results
may be biased for a number of reasons, such as frame error or selection
error.
The sampling frame is the list of population elements or members from
which the sample is selected. Frame error results when the sampling
frame does not represent a true crosssection of the target population.
For example, suppose you survey your neighborhood and talk only to the
people on the street. Any data collected in this manner are heavily
biased because not everybody in the neighborhood had a chance to respond
— what about the people who were inside at the time of the survey?
Any conclusions drawn about your neighborhood using this method of sampling
will not be representative of the population, i.e., the entire neighborhood.
Selection error involves a systematic bias in the manner in
which respondents are selected for participation in the survey. Even
if the sampling frame is defined properly to include the appropriate
population members, selection error can still occur. Incomplete or improper
procedures for selecting participants will lead to selection error.
If a sample list was sorted by zip code and interviewers selected survey
participants by contacting names in order from the beginning of the
list, selection error would occur because those members of the population
appearing at the end of the list (larger zip codes) would never be contacted.

12. How can bias
affect the accuracy of a sample? 
When bias occurs, the results are skewed from the normal
distribution. A negatively skewed curve has a thicker tail on the side
below the mean, while a positively skewed distribution has a larger tail
on the side above the mean. In either case, the accuracy of the results
will be compromised. Note: a skewed sample does not necessarily mean it
is biased. 
13. Is a
computer always unbiased? Do computers always produce random samples? 
The answer is no. A computer's random number generator
could be programmed in such a manner as to yield a biased sample. However,
for the purposes of the Amazing Space "Galaxy Hunter" online
exploration, computers are considered unbiased. 
Sample vs. population (questions 14, 15)
14. How
does a statistic differ from a parameter? 
A statistic is a generalization concerning an entire
sample, such as the mean, mode, or median. A parameter is a generalization
for an entire population, such as the mean, mode, or median. In order
to get a parameter, the entire population is involved, whereas a statistic
is derived from a sample of that population. 
15. How
does one get from a sample statistic to an estimate of the population
parameter? 
There is an infinite number of samples that can be taken from a large
population. One sample from a population might yield a slightly different
statistic than another sample taken from the same population, but the
statistics should be similar to each other. If more and more samples
of the same size were taken from the population, the sampling distribution
of the statistic would resemble a bell curve or normal distribution.
The average of the sampling distribution is essentially equivalent
to the parameter. The standard deviation of the sampling distribution,
called sampling error, tells us something about how different samples
would be distributed which, in turn, tells how far the statistic is
from the parameter. A low sampling error means that we have relatively
less variability or range in the sampling distribution and are therefore
closer to the parameter.


