10. SAMPLING AND ESTIMATION
A. Define simple random sampling and a sampling distribution :
B. Explain sampling error :
Simple
random sampling is a method of selecting a sample in such a way that
each item or person in the population being studied has the same likelihood of
being included in the sample. As an example of simple random sampling, assume
that you want to draw a sample of five items out of a group of 50 items. This can be accomplished by numbering each of the 50 items, placing
them in a hat, and shaking the hat. Next, one number can be drawn
randomly from the hat. Repeating this process (experiment) four more times
results in a set of five numbers. The five drawn numbers (items) comprise a
simple random sample from the population. In applications like this one, a
random-number table or a computer random-number generator is often used to
create the sample. Another way to form an approximately random sample is systematic
sampling, selecting nth member from a population.
Sampling
error is the difference between a sample statistic (the mean,
variance, or standard deviation of the sample) and its corresponding population
parameter (the true mean, variance, or standard deviation of the population).
For example, the sampling error for the mean is as follows :
Sampling
error of the mean = sample mean – population mean = .
A sampling
distribution :
It is
important to recognize that the sample statistic itself is a random
variable and, therefore, has a probability distribution. The sampling
distribution of the sample statistic is probability distribution of all possible
sample statistics computed from a set of equal-size samples that were randomly
drawn from the same population. Think of it as the probability distribution of
a statistic from many samples.
C. Distinguish between simple random and
stratified random sampling :
Stratified random sampling uses a classification system to separate the
population into smaller groups based on one or more distinguishing
characteristics. From each subgroup, or stratum, a random sample is taken and
the results are pooled. The size of the samples from each stratum is based on
the size of the stratum relative to the population.
Stratified sampling is often used in bond indexing because of the
difficulty and cost of completely replicating the entire population of bonds.
In this case, bonds in a population are categorized (stratified) according to
major bond risk factors such as duration, maturity, coupon rate, and the like.
Then, samples are drawn from each separate category and combined to form a
final sample.
D. Distinguish between time-series and
cross-sectional data :
Time-series data consist of observations taken over a period of time at specific and
equally spaced time intervals.
Cross-sectional data are a sample of observations taken at a single point in time.
Time-series and cross-sectional data can be pooled in the same data set. Longitudinal data are observations over time of multiple characteristics on the same
entity, such as unemployment, inflation, and GDP growth rates for a country
over 10 years. Panel data contain observations over time of the same characteristic for multiple
entities, such as debt/equity ratios for 20 companies over the most recent 24
quarters. Panel and longitudinal data are typically presented in table or
spreadsheet form.
E. Explain the central limit theorem and its importance :
The central limit theorem states that for simple random samples of size n from a population with a
mean and a finite variance σ 2, the
sampling distribution of the sample mean approaches a normal probability
distribution with mean and a
variance equal to as the sample size becomes larger.
The
central limit theorem is extremely useful because the normal distribution is
relatively easy to apply to hypothesis testing and to the construction of
confidence intervals. Specific inferences about the population mean can be made
from the sample mean, regardless of
the population’s distribution, as long as the sample size is “sufficiently
large” which usually means n ≥ 30.
Important
properties of the central limit theorem include :
-
If the sample size is sufficiently large
(i.e. n ≥ 30), the sampling distribution of the sample means will be approximately
normal.
-
The mean of the population, , and the
mean of the distribution of all possible sample means are equal.
-
The variance of the distribution of sample
means is , the
population variance divided by the sample size.
F. Calculate and interpret the standard error of
the sample mean :
The standard
error of the sample mean is the standard deviation of the distribution of
the sample means.
When the
standard deviation of the population is known, the standard error of the sample
mean is calculated as :
As the
sample size increases, the sample mean gets closer, on average, to the true
mean of the population.
G. Identify and describe desirable properties of
an estimator :
Regardless
of whether we are concerned with point estimates of confidence intervals, there
are certain statistical properties that make some estimates more desirable than
others. These desirable properties of an estimator are unbiasedness,
efficiency, and consistency.
-
An unbiased estimator is one for which the
expected value of the estimator is equal to the parameter you are trying to
estimate. For example, because the expected value of the sample mean is equal
to the population mean, the sample mean is an unbiased estimator of the
population mean.
-
An unbiased estimator is also efficient if
the variance of its sampling distribution is smaller than all the other
unbiased estimators of the parameter you are trying to estimate. The sample
mean, for example, is an unbiased and efficient estimator of the population
mean.
-
A consistent estimator is one for which the
accuracy of the parameter increases as the sample size increases. As the sample
size increases, the standard error of the sample mean falls, and the sampling
distribution bunches more closely around the population mean. In fact, as the
sample size approaches infinity, the standard error approaches zero.
H. Distinguish between a point estimate and
confidence interval estimate of a population parameter :
Point
estimates are single (sample) values used to estimate population
parameters. The formula used to compute the point estimate is called in
estimator. For example, the sample mean, , is an
estimator of the population mean and is computed using the familiar formula :
The value generated with this calculation for a given sample is called
the point estimate of the mean.
A confidence interval is a range of values in
which the population parameter is expected to lie.
I. Describe properties of Student’s
t-distribution and calculate and interpret its degrees of freedom
:
Student’s t-distribution is a bell-shaped probability distribution that is symmetrical about its
mean. It is the appropriate distribution to use when constructing confidence
intervals based on small samples (n < 30) from populations with unknown
variance and a normal, or approximately normal, distribution. It may also be
appropriate to use the t-distribution when the population variance is unknown
and the sample size is large enough that the central limit theorem will assure
that the sampling distribution is approximately normal.
Student’s t-distribution has the following properties :
- It is symmetrical.
- It is defined by a single parameter, the degrees of freedom (df), where the degrees of freedom are equal to the number
of sample observations minus 1, n -1, for sample means.
- It has more probability in the tails (“fatter tails”) than the normal
distribution.
- As the degrees of freedom (the sample size) gets
larger, the shape of the t-distribution more closely approaches that of the
normal distribution.
J. Calculate and interpret a confidence interval
for a population mean, given a normal distribution with 1) a known population
variance, 2) an unknown population variance, or 3) an unknown variance and a
large sample size :
Confidence interval estimates result in a range of values within which the actual value of a
parameter will lie, given the probability of 1 –
α. Here α is called the level of significance for the confidence
interval, and the probability 1 – α is referred to as the degree of
confidence.
Confidence intervals are usually constructed
by adding or subtracting an appropriate value from the point estimate. In general, confidence intervals take on the following form :
- point estimate ± (reliability factor * standard error)
where
point estimate = value of a sample statistic of the population parameter
reliability factor = number that depends on the sampling distribution of the point
estimate and the probability that the point estimate falls in the confidence
interval (1 – α)
standard error = standard error of the point estimate
If the population has a normal distribution with a known variance, a
confidence interval for the population mean can be calculated as :
where
= point estimate
of the population mean (sample mean)
= reliability
factor, a standard normal random variable for which the probability in the
right-hand tail of the distribution is α/2 of probability in the upper
tail.
= the standard
error of the sample mean where s is the
known standard deviation of the population, and n is the sample size
The most
commonly used standard normal distribution reliability factors are :
= 1,645 for 90% confidence intervals (the
significance level is 10%, 5% in each tail)
= 1,960 for 95% confidence intervals (the
significance level is 5%, 2,5% in each tail)
= 2,575 for 99% confidence intervals (the
significance level is 1%, 0,5% in each tail)
Confidence
Intervals for Population Mean : Normal with Unknown
Variance
If the
distribution of the population is normal with unknown variance, we can use the
t-distribution to construct a confidence interval :
where
= point estimate
of the population mean
= the t-reliability
factor corresponding to a t-distribution random variable with n-1 degrees of
freedom, where n is the sample size. The area under the tail of the
t-distribution to the right of is α/2
= the standard
error of the sample mean
s = sample standard deviation
Confidence
Intervals for a Population Mean When the Population Variance is Unknown Given a
Large Sample From Any Type of Distribution
We now
know that the z-statistic should be used to construct confidence intervals when
the population distribution is normal and the variance is known, and the
t-statistic should be used when the distribution is normal but the variance is
unknown. But what do we do when the distribution is non-normal?
As it
turns out, the size of the sample influences whether or not we can construct
the appropriate confidence interval for the sample mean.
-
If the distribution is non-normal but the
population variance is known, the z-statistic can be used as long as the sample
size is large (n ≥ 30). We can do this because the central limit theorem assures
us that the distribution of the sample mean is approximately normal when the
sample is large.
-
If the distribution is non-normal and the
population variance in unknown, the t-statistic can be used as long as the sample
size is large (n ≥ 30). It is also acceptable to use the z-statistic, although
use of t-statistic is more conservative.
This means
that if we are sampling from a non-normal distribution, we cannot create a
confidence interval if the sample size is less than 30.
When sampling from a: |
Test Statistic |
|
Small Sample (n<
30) |
Large Sample (n>=
30) |
|
Normal
distribution with known variance |
z-statistic |
z-statistic |
Normal
distribution with unknown variance |
t-statistic |
t-statistic |
Nonnormal distribution with known variance |
Not available |
z-statistic |
Nonnormal distribution with unknown variance |
Not available |
t-statistic |
K. Describe the issues regarding selection of
the appropriate sample size, data-mining bias, sample selection bias,
survivorship bias, look-ahead bias, and time-period bias :
We have
seen that a larger sample reduces sampling error and the standard deviation of
the sample statistic around its true population value. Confidence intervals are
narrower when samples are larger and the standard errors of the point estimates
of population parameters are less.
There are
two limitations on this idea of large is better when it comes to selecting an
appropriate sample size. One is that larger samples may contain observations
from a different population. The other consideration is cost. The costs of
using a larger sample must be weighed against the value of the increase in
precision from the increase in sample size.
Data-mining occurs when analysts repeatedly use the same
database to search for patterns or trading rules until one that “works” is
discovered. Data-mining bias refers to results where the statistical
significance of the pattern is overestimated because the results were found
though data mining.
Sample
selection bias occurs when some data is systematically excluded from
the analysis because of the lack of availability.
Survivorship bias is the most common form of sample selection bias. The solution to
survivorship bias is to use a sample of funds that all started at the same time
and not drop funds that have been dropped from the sample.
Look-ahead bias occurs when a study tests a relationship using sample data that was not
available on the test date.
Time period bias can result if the time period over which the data is gathered is either
too short or too long. If the time period is too short, research may reflect
phenomena specific to that time period, or perhaps even data-mining.
If the time period is too long; the fundamental
economic relationships that underlie the results may have changed.
|