10. SAMPLING AND ESTIMATION

A. Define simple random sampling and a sampling distribution :

B. Explain sampling error :

Simple random sampling is a method of selecting a sample in such a way that each item or person in the population being studied has the same likelihood of being included in the sample. As an example of simple random sampling, assume that you want to draw a sample of five items out of a group of 50 items. This can be accomplished by numbering each of the 50 items, placing them in a hat, and shaking the hat. Next, one number can be drawn randomly from the hat. Repeating this process (experiment) four more times results in a set of five numbers. The five drawn numbers (items) comprise a simple random sample from the population. In applications like this one, a random-number table or a computer random-number generator is often used to create the sample. Another way to form an approximately random sample is systematic sampling, selecting nth member from a population.

Sampling error is the difference between a sample statistic (the mean, variance, or standard deviation of the sample) and its corresponding population parameter (the true mean, variance, or standard deviation of the population). For example, the sampling error for the mean is as follows :

Sampling error of the mean = sample mean – population mean = .

A sampling distribution :

It is important to recognize that the sample statistic itself is a random variable and, therefore, has a probability distribution. The sampling distribution of the sample statistic is probability distribution of all possible sample statistics computed from a set of equal-size samples that were randomly drawn from the same population. Think of it as the probability distribution of a statistic from many samples.

C. Distinguish between simple random and stratified random sampling :

Stratified random sampling uses a classification system to separate the population into smaller groups based on one or more distinguishing characteristics. From each subgroup, or stratum, a random sample is taken and the results are pooled. The size of the samples from each stratum is based on the size of the stratum relative to the population.

Stratified sampling is often used in bond indexing because of the difficulty and cost of completely replicating the entire population of bonds. In this case, bonds in a population are categorized (stratified) according to major bond risk factors such as duration, maturity, coupon rate, and the like. Then, samples are drawn from each separate category and combined to form a final sample.

D. Distinguish between time-series and cross-sectional data :

Time-series data consist of observations taken over a period of time at specific and equally spaced time intervals.

Cross-sectional data are a sample of observations taken at a single point in time.

Time-series and cross-sectional data can be pooled in the same data set. Longitudinal data are observations over time of multiple characteristics on the same entity, such as unemployment, inflation, and GDP growth rates for a country over 10 years. Panel data contain observations over time of the same characteristic for multiple entities, such as debt/equity ratios for 20 companies over the most recent 24 quarters. Panel and longitudinal data are typically presented in table or spreadsheet form.

E. Explain the central limit theorem and its importance :

The central limit theorem states that for simple random samples of size n from a population with a mean and a finite variance σ ², the sampling distribution of the sample mean approaches a normal probability distribution with mean and a variance equal to as the sample size becomes larger.

The central limit theorem is extremely useful because the normal distribution is relatively easy to apply to hypothesis testing and to the construction of confidence intervals. Specific inferences about the population mean can be made from the sample mean, regardless of the population’s distribution, as long as the sample size is “sufficiently large” which usually means n ≥ 30.

Important properties of the central limit theorem include :

- If the sample size is sufficiently large (i.e. n ≥ 30), the sampling distribution of the sample means will be approximately normal.

- The mean of the population, , and the mean of the distribution of all possible sample means are equal.

- The variance of the distribution of sample means is , the population variance divided by the sample size.

F. Calculate and interpret the standard error of the sample mean :

The standard error of the sample mean is the standard deviation of the distribution of the sample means.

When the standard deviation of the population is known, the standard error of the sample mean is calculated as :

As the sample size increases, the sample mean gets closer, on average, to the true mean of the population.

G. Identify and describe desirable properties of an estimator :

Regardless of whether we are concerned with point estimates of confidence intervals, there are certain statistical properties that make some estimates more desirable than others. These desirable properties of an estimator are unbiasedness, efficiency, and consistency.

- An unbiased estimator is one for which the expected value of the estimator is equal to the parameter you are trying to estimate. For example, because the expected value of the sample mean is equal to the population mean, the sample mean is an unbiased estimator of the population mean.

- An unbiased estimator is also efficient if the variance of its sampling distribution is smaller than all the other unbiased estimators of the parameter you are trying to estimate. The sample mean, for example, is an unbiased and efficient estimator of the population mean.

- A consistent estimator is one for which the accuracy of the parameter increases as the sample size increases. As the sample size increases, the standard error of the sample mean falls, and the sampling distribution bunches more closely around the population mean. In fact, as the sample size approaches infinity, the standard error approaches zero.

H. Distinguish between a point estimate and confidence interval estimate of a population parameter :

Point estimates are single (sample) values used to estimate population parameters. The formula used to compute the point estimate is called in estimator. For example, the sample mean, , is an estimator of the population mean and is computed using the familiar formula :

The value generated with this calculation for a given sample is called the point estimate of the mean.

A confidence interval is a range of values in which the population parameter is expected to lie.

I. Describe properties of Student’s t-distribution and calculate and interpret its degrees of freedom :

Student’s t-distribution is a bell-shaped probability distribution that is symmetrical about its mean. It is the appropriate distribution to use when constructing confidence intervals based on small samples (n < 30) from populations with unknown variance and a normal, or approximately normal, distribution. It may also be appropriate to use the t-distribution when the population variance is unknown and the sample size is large enough that the central limit theorem will assure that the sampling distribution is approximately normal.

Student’s t-distribution has the following properties :

- It is symmetrical.

- It is defined by a single parameter, the degrees of freedom (df), where the degrees of freedom are equal to the number of sample observations minus 1, n -1, for sample means.

- It has more probability in the tails (“fatter tails”) than the normal distribution.

- As the degrees of freedom (the sample size) gets larger, the shape of the t-distribution more closely approaches that of the normal distribution.

J. Calculate and interpret a confidence interval for a population mean, given a normal distribution with 1) a known population variance, 2) an unknown population variance, or 3) an unknown variance and a large sample size :

Confidence interval estimates result in a range of values within which the actual value of a parameter will lie, given the probability of 1 – α. Here α is called the level of significance for the confidence interval, and the probability 1 – α is referred to as the degree of confidence.

Confidence intervals are usually constructed by adding or subtracting an appropriate value from the point estimate. In general, confidence intervals take on the following form :

- point estimate ± (reliability factor * standard error)

where

point estimate = value of a sample statistic of the population parameter

reliability factor = number that depends on the sampling distribution of the point estimate and the probability that the point estimate falls in the confidence interval (1 – α)

standard error = standard error of the point estimate

If the population has a normal distribution with a known variance, a confidence interval for the population mean can be calculated as :

where

= point estimate of the population mean (sample mean)

= reliability factor, a standard normal random variable for which the probability in the right-hand tail of the distribution is α/2 of probability in the upper tail.

= the standard error of the sample mean where s is the known standard deviation of the population, and n is the sample size

The most commonly used standard normal distribution reliability factors are :

= 1,645 for 90% confidence intervals (the significance level is 10%, 5% in each tail)

= 1,960 for 95% confidence intervals (the significance level is 5%, 2,5% in each tail)

= 2,575 for 99% confidence intervals (the significance level is 1%, 0,5% in each tail)

Confidence Intervals for Population Mean : Normal with Unknown Variance

If the distribution of the population is normal with unknown variance, we can use the t-distribution to construct a confidence interval :

where

= point estimate of the population mean

= the t-reliability factor corresponding to a t-distribution random variable with n-1 degrees of freedom, where n is the sample size. The area under the tail of the t-distribution to the right of is α/2

= the standard error of the sample mean

s = sample standard deviation

Confidence Intervals for a Population Mean When the Population Variance is Unknown Given a Large Sample From Any Type of Distribution

We now know that the z-statistic should be used to construct confidence intervals when the population distribution is normal and the variance is known, and the t-statistic should be used when the distribution is normal but the variance is unknown. But what do we do when the distribution is non-normal?

As it turns out, the size of the sample influences whether or not we can construct the appropriate confidence interval for the sample mean.

- If the distribution is non-normal but the population variance is known, the z-statistic can be used as long as the sample size is large (n ≥ 30). We can do this because the central limit theorem assures us that the distribution of the sample mean is approximately normal when the sample is large.

- If the distribution is non-normal and the population variance in unknown, the t-statistic can be used as long as the sample size is large (n ≥ 30). It is also acceptable to use the z-statistic, although use of t-statistic is more conservative.

This means that if we are sampling from a non-normal distribution, we cannot create a confidence interval if the sample size is less than 30.

When sampling from a:	Test Statistic
When sampling from a:	Small Sample (n< 30)	Large Sample (n>= 30)
Normal distribution with known variance	z-statistic	z-statistic
Normal distribution with unknown variance	t-statistic	t-statistic
Nonnormal distribution with known variance	Not available	z-statistic
Nonnormal distribution with unknown variance	Not available	t-statistic

K. Describe the issues regarding selection of the appropriate sample size, data-mining bias, sample selection bias, survivorship bias, look-ahead bias, and time-period bias :

We have seen that a larger sample reduces sampling error and the standard deviation of the sample statistic around its true population value. Confidence intervals are narrower when samples are larger and the standard errors of the point estimates of population parameters are less.

There are two limitations on this idea of large is better when it comes to selecting an appropriate sample size. One is that larger samples may contain observations from a different population. The other consideration is cost. The costs of using a larger sample must be weighed against the value of the increase in precision from the increase in sample size.

Data-mining occurs when analysts repeatedly use the same database to search for patterns or trading rules until one that “works” is discovered. Data-mining bias refers to results where the statistical significance of the pattern is overestimated because the results were found though data mining.

Sample selection bias occurs when some data is systematically excluded from the analysis because of the lack of availability.

Survivorship bias is the most common form of sample selection bias. The solution to survivorship bias is to use a sample of funds that all started at the same time and not drop funds that have been dropped from the sample.

Look-ahead bias occurs when a study tests a relationship using sample data that was not available on the test date.

Time period bias can result if the time period over which the data is gathered is either too short or too long. If the time period is too short, research may reflect phenomena specific to that time period, or perhaps even data-mining. If the time period is too long; the fundamental economic relationships that underlie the results may have changed.

Follow this link to next chapter :

Follow this link to Summary.

CFA level i PREPARATION

10. SAMPLING AND ESTIMATION