Astronomy basics > Q&A: Statistics

Page: 1  2  3  Next...
... Q&A: Statistics (cont'd) ...
 
Ways to display data (questions 8–10)
Question
8. What is a box and whisker plot (or box plot) graph?
 
Answer

A box and whisker plot is a way of summarizing a set of data measured on an interval scale. It is often used in exploratory data analysis to show the shape of the distribution, its central value, and variability. The picture produced consists of the most extreme values in the data set (maximum and minimum values at the ends of the line), the lower and upper quartiles (edges of the box), and the median (line through the box). (NOTE: The lines extending from the box may be adjusted to represent a certain fraction of the data: they could be set at 5% and 95% or they could represent the minimum and maximum values.)

Box and whisker plot

A box plot, as it is often called, is especially helpful for indicating whether a distribution is skewed and whether there are any unusual observations (outliers) in the data set. Box and whisker plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared.

 
 
 
Question
9. What is a frequency table?
 
Answer

A frequency table is a way of summarizing a set of data. It is a record of how often each value (or set of values) of the variable in question occurs. It may be enhanced by the addition of percentages that fall into each category. A frequency table is used to summarize categorical, nominal, and ordinal data. It may also be used to summarize continuous data once the data set has been divided up into sensible groups.

Example: Suppose that in thirty shots at a target, a marksman makes the following scores:
List of hypothetical scores

The frequencies of the different scores can be summarized as:
Frequency table

 
 
 
Question
10. What is a normal distribution?
 
Answer

A normal distribution is a bell curve that extends to infinity in both directions. The high point represents the mean. Examples of normal distributions are shown below. Notice that they differ in how spread out they are but the area under each curve is the same. If the area under the curve is defined to be 1 and you multiply that by 100, then there is a 100% chance that any value you name will be somewhere in the distribution.

Because half the area of the curve is below the mean and half is above the mean, there is a 50% chance that a randomly chosen value will be above the mean and the same chance that it will be below it. The area under the normal curve is equivalent to the probability of randomly drawing a value in that range. The area is greatest in the middle where the "hump" is and thins out toward the tails.
Illustration of some normal distributions

(Based on graphic from http://davidmlane.com/hyperstat/normal_distribution.html)

 
 
 
Bias (questions 11–13)
Question
11. What is bias?
 
Answer

Bias is a systematic error in sample statistics that can occur from the use of poor sampling methods. Sample design results may be biased for a number of reasons, such as frame error or selection error.

The sampling frame is the list of population elements or members from which the sample is selected. Frame error results when the sampling frame does not represent a true cross-section of the target population. For example, suppose you survey your neighborhood and talk only to the people on the street. Any data collected in this manner are heavily biased because not everybody in the neighborhood had a chance to respond — what about the people who were inside at the time of the survey? Any conclusions drawn about your neighborhood using this method of sampling will not be representative of the population, i.e., the entire neighborhood.

Selection error involves a systematic bias in the manner in which respondents are selected for participation in the survey. Even if the sampling frame is defined properly to include the appropriate population members, selection error can still occur. Incomplete or improper procedures for selecting participants will lead to selection error. If a sample list was sorted by zip code and interviewers selected survey participants by contacting names in order from the beginning of the list, selection error would occur because those members of the population appearing at the end of the list (larger zip codes) would never be contacted.

 
 
 
Question
12. How can bias affect the accuracy of a sample?
 
Answer

When bias occurs, the results are skewed from the normal distribution. A negatively skewed curve has a thicker tail on the side below the mean, while a positively skewed distribution has a larger tail on the side above the mean. In either case, the accuracy of the results will be compromised. Note: a skewed sample does not necessarily mean it is biased.

 
 
 
Question
13. Is a computer always unbiased? Do computers always produce random samples?
 
Answer

The answer is no. A computer's random number generator could be programmed in such a manner as to yield a biased sample. However, for the purposes of the Amazing Space "Galaxy Hunter" online exploration, computers are considered unbiased.

 
 
 
Sample vs. population (questions 14, 15)
Question
14. How does a statistic differ from a parameter?
 
Answer

A statistic is a generalization concerning an entire sample, such as the mean, mode, or median. A parameter is a generalization for an entire population, such as the mean, mode, or median. In order to get a parameter, the entire population is involved, whereas a statistic is derived from a sample of that population.

 
 
 
Question
15. How does one get from a sample statistic to an estimate of the population parameter?
 
Answer

There is an infinite number of samples that can be taken from a large population. One sample from a population might yield a slightly different statistic than another sample taken from the same population, but the statistics should be similar to each other. If more and more samples of the same size were taken from the population, the sampling distribution of the statistic would resemble a bell curve or normal distribution.

The average of the sampling distribution is essentially equivalent to the parameter. The standard deviation of the sampling distribution, called sampling error, tells us something about how different samples would be distributed which, in turn, tells how far the statistic is from the parameter. A low sampling error means that we have relatively less variability or range in the sampling distribution and are therefore closer to the parameter.

Page: 1  2  3  Next...

Astronomy basics > Q&A: Statistics