Introduction to basic statistics with R

Overview

This workshop will introduce you to basic statistics using R.

Introduction

It is useful to begin with some concrete examples of statistical questions to motivate the material that we’ll cover in the workshop. This will also help confirm that your R environment is working. These data files come from the fosdata package, which you can optionally install to your computer in order to get access to all the associated metadata in the help files. I will show loading the data from the workshop website.

# optional: install the fosdata package using remotes::install_github
install.packages( "remotes" )
remotes::install_github( "speegled/fosdata" )

# load the mice_pot and barnacles datasets
mice_pot = read.csv( url( "https://raw.githubusercontent.com/wes-brooks/basic-stats/main/data/mice_pot.csv" ))
barnacles = read.csv( url( "https://raw.githubusercontent.com/wes-brooks/basic-stats/main/data/barnacles.csv" ))

`mice_pot` dataset

The mice_pot dataset contains data on an experiment that dosed four groups of mice with different levels of THC. There was a low, medium, and high dosage group, as well as a control group that got no THC. The mice were were then observed for a while and their total movement was quantified as a percentage of the baseline group mean. Some statistical questions that might arise here are: were there differences in the typical amount of movement between mice of different groups? What was the average amount of movement by mice in the medium dose group?

Both of these questions can be approached in ways that relate to the particular sample, which is kind of interesting descriptions of the data. For instance, if you install the dplyr package in your R environment, we can make those calculations right now.

with( mice_pot, by(percent_of_act, group, mean))

## group: 0.3
## [1] 97.3225
## ------------------------------------------------------------ 
## group: 1
## [1] 99.05235
## ------------------------------------------------------------ 
## group: 3
## [1] 70.66787
## ------------------------------------------------------------ 
## group: VEH
## [1] 100

The means aren’t identical - there are clearly differences between all of the groups? Yes, in terms of the sample. But if you want to generalize your conclusion to cover what would happen to other mice that weren’t in the study, then you need to think about the population. In this case, that’s the population of all the mice that could have been dosed with THC.

Because we don’t see we don’t see data from mice that weren’t part of the study, we rely on statistical inference to reach conclusions about the population. How is that possible? Well, some theorems from mathematical statistics can tell us about the distribution of the sample, relative to the population

`barnacles` dataset

This dataset was collected by counting the barnacles in 88 grid squares on the Flower Garden Banks coral reef in the Gulf of Mexico. The counts were normalized to barnacles per square meter. Some questions that you might approach with statistical methods are, what is the average number of barnacles per square meter, and is it greater than 300?

mean( barnacles$per_m )

## [1] 332.0186

From that calculation, we see that the mean is 332 barnacles per square meter, which is greater than 300. But again, the first calculation has told us only about the mean of the particular locations that were sampled. Wouldn’t it be better to answer the questions by reference to the number of barnaces per square meter of reef, rather than square meter of measurement? Here, the population is then entire area of the Flower Garden Banks reef. Again, we will be able to answer the questions relative to the entire reef by working out the sample mean’s distribution, relative to the population.

Sample and population

I’ve made reference to samples and populations, and they are fundamental concepts in statistics. A sample is a pretty easy thing to understand - it is the data, the hard numbers that go into your calculations. The population is trickier. It’s the units to which you are able to generalize your conclusions.

For the barnacle data, in order for the population to be the entire Flower Garden Banks reef, the sample must have been carefully selected to ensure it was representative of the entire reef (for instance, randomly sampling locations so that any location on the reef might have been selected).

For the mice on THC, the population is all the mice that might have been selected for use in the experiment. How big that population is depends on how the mice were selected for the experiment. Randomly selecting the experimental units from a group is a common way of ensuring that the results can generalize to that whole group.

A non-random sample tends to mean that the population to which you can generalize is quite limited. What sort of population do you think we could generalize about if we recorded the age of everyone in this workshop?

Probability distributions, quantiles, and densities

Probability distributions are functions that divvy up probability among the possible outcomes of a random variable. Each distribution has some important features, like its density and quantile functions.

The Normal distribution

The Normal distribution plays a very special role in statistics, which we will come to. For now, let’s introduce the distribution. I’ve pictured here the density of a standard normal distribution.

# draw the standard normal distribution function, with some annotations
t = seq(-4, 4, length.out=1001)
plot( x=t, y=dnorm(t), bty='n', type='line', main="standard normal density")

## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character

abline(v=0, lty=3)

The probability density is called that because if you select a sample from the distribution, most of the numbers in your sample will be from the part of the curve with the greatest density. The Normal distribution, like any named distribution, represents a population from which samples can be taken. Let’s try it:

# sample 20 numbers from a standard normal distribution:
x = rnorm(20)

# observe your sample:
print( round(x, 2) )

##  [1] -1.17 -0.93  1.87 -0.81 -0.75 -0.57 -0.40  0.62  0.50  0.75  1.09 -0.02
## [13]  0.06 -1.18  1.25  0.63 -1.20  0.47  0.83  1.16

Do the numbers seem to come from the high-density part of the Normal density curve? Are there any that don’t? Most of the samples you’ve just taken should be It isn’t surprising if some of your x samples are not particularly close to zero. One out of twenty (that’s five percent) samples from a standard Normal population are greater than two or less than negative two, on average. That’s “on average” over the population. Your sample may be different.

# plot the standard normal density function with some annotations
plot( x=t, y=dnorm(t), bty='n', type='l', main="standard normal density")

# annotate the density function with the 5% probability mass tails
polygon(x=c(t[t<=qnorm(0.025)], qnorm(0.025), min(t)), y=c(dnorm(t[t<=qnorm(0.025)]), 0, 0), col=grey(0.8))

polygon(x=c(t[t>=qnorm(0.975)], max(t), qnorm(0.975)), y=c(dnorm(t[t>=qnorm(0.975)]), 0, 0), col=grey(0.8))

Histograms and subsamples

While the density curve represents the standard Normal population, your data’s histogram represents your sample:

# show the histogram of x
hist(x)

It might not look particularly like the Normal density, but that’s because it’s representing a particular sample, not the population. The curve might be more recognizable if the number of samples was greater.

# sample 200 numbers from a standard normal
y = rnorm( 200 )

# show the histogram of y
hist( y )

Now, while I’ve said that the histogram represents a sample rather than a population, that line can be blurred. You can subsample from your sample, using the sample function. There are some situations when you’re not working with a named distribution, but need to take samples from a list. This is how to do that.

# take a sample of 20 numbers from x (without replacement)
x2 = sample( x, 20 )

# take a sample of 20 numbers from x (with replacement)
x3 = sample( x, 20, replace=TRUE )

# calculate the means of x, x2, and x3
mean( x )

## [1] 0.1087935

mean( x2 )

## [1] 0.1087935

mean( x3 )

## [1] 0.1772695

Because x2 was sampled without replacement, it is exactly the same as x, but x3 was sampled with replacement and has a different mean.

Random variables and calculations:

A random variable is our name for an abstract observation from a population.

Mean and median

There are a few kinds of average that are commonly used in practical settings and we will use two of them extensively. The mean is the average you’re probably used to, calculated by summing up the values and then dividing by the number of values. The median is the midpoint of the data - it’s the number you arrive at if you write down all the values in order and then find the halfway point. There are special R functions for both of these.

# calculate the mean and median of x. They aren't exactly zero in the sample.
mean(x)

## [1] 0.1087935

median(x)

## [1] 0.2630617

Expectation

Expectation is the mean of a population. So, a random variable has an expectation but data has a mean. There’s no R function to calculate an expectation because R expects to be working on data rather than on a population. If you think of R as a calculator, you can calculate the mean of any numbers that you can type into the calculator, but when the object is this abstract thing we call a distribution, the calculator can’t proceed.

Remember earlier, when we decided to treat a sample as a population and then take samples from it? Then the mean of the original sample becomes the expectation of all the subsamples:

# subsample x 50 times with replacement, and store the mean of each
subx_mean = numeric( 50 )
for (i in 1:50) {
  subx_mean[[ i ]] = mean( sample(x, replace=TRUE) )
}

# print the results and the mean of the means
print( round( subx_mean, 2 ) )

##  [1] -0.15  0.08  0.19  0.35 -0.08  0.19  0.04  0.04  0.10  0.19 -0.15  0.21
## [13]  0.46  0.08  0.08  0.21 -0.23  0.35  0.46 -0.07  0.06  0.07  0.44 -0.07
## [25]  0.53  0.13 -0.11  0.14  0.20 -0.07 -0.06  0.18  0.25  0.29  0.06  0.39
## [37] -0.05 -0.01 -0.03  0.53  0.05  0.03  0.11  0.43  0.59  0.37  0.09 -0.23
## [49] -0.23  0.11

mean( subx_mean )

## [1] 0.1312283

# shoe the histogram of sample means, with an annotation at the expectation
hist( subx_mean )
abline( v=mean(x), lty=3 )

You might have noticed that I didn’t specify the size argument for the sample() function that time. By default, sample() will take a subsample of equal size to the input data. You can see from the above histogram that sample means are clustering around the population expectation, even though they don’t match exactly.

Denote the expectation of the random variable $X$ by $E(X)$. Typically, statisticians use the Greek letter $\mu$ to mean the expectation of a Normal random variable.

Standard deviation and variance

The standard deviation comes about because of the need for a calculation that measures how far a typical sample is from the mean. The direct approach would be to calculate the average distance from the mean, but it turns out that this is always exactly zero. So instead, we’ve settled on the standard of calculating the average squared distance from the mean (almost). Here is the formula for the sample standard deviation, which we write as $s$ for a particular sample or $S$ for a random variable:

\[ s = \text{sd}(x) = \sqrt{ \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1} }\]

The n-1 in the denominator is used in calculating the standard deviation of a sample. For the population standard deviation the denomipnator would be n, but you will rarely calculate a population standard deviation. In either case, the square of the standard deviation is called variance:

\[ \text{sd}(x) = \sqrt{ \text{var}(x) } \]

In R, you calculate the standard deviation with the sd() function. Let’s calculate some standard deviations and variances:

# calculate the standard deviation of the barnacles per unit area
sd( barnacles$per_m )

## [1] 273.2828

# calculate the variance of the body motion for mice who got the metium dose of THC:
var( mice_pot$percent_of_act[ mice_pot$group==1 ] )

## [1] 689.4729

For a Normal distribution, statisticians typically denote the population variance by the Greek letter $\sigma^2$. In sum, then, a Normal distribution is characterized by its mean and variance parameters, and statisticians usually abbreviate the Normal distribution that has mean $\mu$ and variance $\sigma^2$ by N($\mu$, $\sigma^2$). The standard Normal distribution that I’ve referred to is the N(0, 1) distribution, and you will typically use Z to denote a random variable with that distribution. Lets observe a few Normal distributions by changing the mean and sd arguments to dnorm():

# plot several Normal distributions with different means and sds:
yy = c(0, 1.5)
plot(x=t, y=dnorm(t, mean=0, sd=1), bty='n', type='l', main="some Normal distributions", ylab='density', ylim=yy)
par( new=TRUE )
plot(x=t, y=dnorm(t, mean=-1.5, sd=0.6), bty='n', type='l', xaxt='n', yaxt='n', xlab='', ylab='', lty=2, ylim=yy, col=2)
par( new=TRUE )
plot(x=t, y=dnorm(t, mean=02, sd=1.5), bty='n', type='l', xaxt='n', yaxt='n', xlab='', ylab='', lty=3, ylim=yy, col=3)
par( new=TRUE )
plot(x=t, y=dnorm(t, mean=0.5, sd=0.3), bty='n', type='l', xaxt='n', yaxt='n', xlab='', ylab='', lty=4, ylim=yy, col=4)

The basic shape persists, but the location and width change.

Density, sampling, quantile, and distribution functions

All of R’s built-in distributions are characterized by four functions, with a consistent pattern in their naming:

Density function begins with d (dnorm())
Sampling function begins with r (rnorm())
Quantile function begins with q (qnorm())
Distribution function begins with p (pnorm())

Of these, I’ve already demonstrated dnorm() and rnorm().

Quantile function

The quantile function maps from the 0-1 range back to the possible values of the distribution. This becomes useful for answering questions like, “what is the critical value that a standard normal random variable falls below 95% of the time?” If you denote that critical value by $z_{0.95}$, then the quantile function solves for $z_{0.95}$ in \[ Pr( Z \le z_{0.95} ) = 0.95 \] Remember when I said that 5% of standard Normal samples are less than negative two, or greater than two? Let’s prove it:

# demonstrate quantiles of the standard normal distribution
qnorm(0.95)

## [1] 1.644854

qnorm(0.5)

## [1] 0

# if you want an interval that contains the random variable 95% of the time, you may divide the remaining 5% in half, so the interval is in the center:
(1 - 0.95) / 2

## [1] 0.025

qnorm(0.025)

## [1] -1.959964

qnorm(0.975)

## [1] 1.959964

Thus, we can construct an interval from -1.96 to 1.96 that contains 95% of samples from N(0, 1).

If you’re working with a vector of numbers instead of a named distribution, then the quantiles are calculated by the quantile() function:

# calculate the 0.025 and 0.975 quantiles of y
# specify the vector first, then the quantiles:
quantile(y, 0.025)

##      2.5% 
## -2.399738

quantile(y, 0.975)

##    97.5% 
## 2.173209

# get both quantiles in one line:
quantile(y, c(0.025, 0.975))

##      2.5%     97.5% 
## -2.399738  2.173209

# put the quantiles on the histogram
hist(y)
abline( v=quantile(y, c(0.025, 0.975)), lty=3 )

Distribution function

The distribution function is the inverse of the quantile function. It calculates the probability that a random variable is less than or equal to some number. That is, $F_Z(t) = Pr(Z \le t)$. In an introductory statistics class, you’d typically spend a lot of time using paper tables to get comfortable going between the quantile and distribution functions. We can just do the calculations in R:

# Demonstrate the standard normal distribution function:
pnorm(0)

## [1] 0.5

pnorm(-1)

## [1] 0.1586553

pnorm(1)

## [1] 0.8413447

# draw the standard normal distribution function, with some annotations
t = seq(-4, 4, length.out=1001)
plot( x=t, y=pnorm(t), bty='n', type='line', main="standard normal distribution function")

## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character

# annotate the plot with lines indicating the quantiles for a 95% probability interval
lines(x=c(min(t), qnorm(0.025)), y=rep(0.025, 2), lty=2)
lines(x=c(min(t), qnorm(0.975)), y=rep(0.975, 2), lty=2)

lines(x=rep(qnorm(0.025), 2), y=c(0, 0.025), lty=3)
lines(x=rep(qnorm(0.975), 2), y=c(0, 0.975), lty=3)

If you’re working on a vector of numbers rather than a named distribution, then you can use the mean() function to calculate what proportion are less than some value:

# what proportion of the y sample is less than or equal to -2? What proportion is greater than 2?
mean( y <= -2 )

## [1] 0.06

1 - mean( y<= 2 )

## [1] 0.035

Other distributions

I have focused on the Normal distribution, but there are many others! You can take a little time to generate samples or quantiles or probabilities from some other distributions, like Poisson (rpois(), qpois(), etc.), binomial (rbinom(), qbinom(), etc), or t (pt(), rt(), etc). Alternatively, try answering some questions:

What is the probability that a sample from the N($1$, $3^2$) distribution is less than 0?
For the sample y, what is an interval that contains 80% of the values?

Mathematical statistics

Now, let’s look at some of the math that allows us to use manipulate random variables in order to reason about their distributions. Beginning with some important identities that are used constantly when doing algebra with random variables. In particular, we will see how these identities relate to the mean. That’s because the mean has some special properties: you’ve seen how we can calculate intervals based on known distributions that contain samples with a specified frequency. But we need to know the distribution. It turns out that the distribution of the sample mean approaches the Normal distribution as the sample size increases, for almost any independent data. That allows us to create intervals and reason about the distribution of real data, even though the data’s distribution is unknown.

Identities

The following hold for random variables $X_1, X_2, \dots$ that are independent:

$E(aX_1) = aE(X_1)$
$E(X_1 + X_2) = E(X_1) + E(X_2)$
$var(aX) = a^2 var(X_1)$
$var(X_1 + X_2) = var(X_1) + var(X_2)$

In the case of calculating the mean, $a$ is $1/n$. So if the $X_1, X_2, \dots$ all share the same mean $\mu$ and the same variance $\sigma^2$, then the expectation identities say that the expectation of the mean is \[E[ \frac{1}{n} (X_1 + X_2 + \cdots + X_n) ] = \frac{1}{n} [ E(X_1) + E(X_2) + \cdots + E(X_n) ] \\ = \frac{1}{n} [n \times \mu ] \\ =\mu\]

And the variance identities say that the variance of the mean is \[var[ \frac{1}{n} (X_1 + X_2 + \cdots + X_n) ] = \frac{1}{n^2} [ var(X_1) +var(X_2) + \cdots + var(X_n) ] \\ = \frac{1}{n^2} [n \times \sigma^2 ] \\ =\sigma^2 / n\]

So the mean of a sample of independent random variables has the same expectation as the population but its variance is much smaller. As a result, increasing sample size leads to greater precision in estimating the population mean. Let’s take a look at some examples:

# create a random variable X with mean -3 and standard deviation 3
X = function(n=20, mean=-3, sd=3) { rnorm(n, mean=mean, sd=sd) }

# show the histogram of some samples of X as well as the 0.12 and 0.88 quantiles:
x = X()
hist( x )
abline( v=quantile( x, c(0.12, 0.88) ), lty=2 )

# generate 50 samples of X and calculate their means:
mean_x = numeric( 50 )
for (i in 1:50) {
  mean_x[[ i ]] = mean( X() )
}

# plot the histogram of mean_x as well as the 0.12 and 0.88 quantiles
hist( mean_x )
abline( v=quantile( mean_x, c(0.12, 0.88) ), lty=2 )

Law of large numbers

The law of large numbers says that if the individual measurements are independent, then the mean of a sample tends toward the mean of the population as the sample size gets larger. This is what we’d expect, since we showed the rate at which the variance of the sample mean gets smaller is $1/n$.

nn = c(1, 2, 4, 8, 12, 20, 33, 45, 66, 100)
means = sapply( nn, function(n) mean( rnorm(n) ) )
plot(nn, means, bty='n', ylab = "sample mean")
abline(h=0, lty=2)

Central limit theorem

The most important mathematical result in statistics, the Central Limit Theorem says that if you take (almost) any sample of random numbers and calculate its mean, the distribution of the mean tends toward a normal distribution. We illustrate the “tending toward” with an arrow and it indicates that the distribution of a sample mean is only approximately Normal. But if the original samples were from a Normal distribution then the sample mean has an exactly Normal distribution. From here, I’ll start writing the mean of a random variable $X$ as $\bar{X}$ and the mean of a sample $x$ as $\bar{x}$.

\[ \bar{X} \rightarrow N(\mu, \frac{\sigma^2}{n}) \] And because of the identities we learned before, you can write this as \[\frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \rightarrow N(0, 1) \]

In fact, the This is significant because we can use the standard normal functions on the right, and the data on the left, to start answering questions like, “what is the 95% confidence interval for the population mean?”

But isn’t $\sigma$ unknown, too?

I’ve presented calculations that use the parameter $\sigma$. But in reality, $\sigma$ is just as unknown as $\mu$. Is it safe to replace $\sigma$ in the calculations by the standard deviation that we calculate from the sample using the sd() function? Not quite. If the $S$, the sample standard deviation (this is what you get from R’s sd() function) is used for $\sigma$ in the CLT, then you need to include the uncertainty of estimating $\sigma$ by $S$ in the uncertainty of the distribution. As a result, the distribution will be slightly less clustered at the central point and have “heavy tails”. This distribution is called the “t-distribution”, and its discovery was a triumph of early modern statistics.

\[\frac{\bar{X} - \mu}{S/\sqrt{n}} \rightarrow t_{n-1} \]

Here the $n-1$ subscript denotes the “degrees of freedom” of the t-distribution.

# plot the densities of a standard normal distribution, and of a t-distribution with five degrees of freedom:
t = seq(-4, 4, length.out=1001)
  
plot( x=t, y=dnorm(t), ylim=c(0, 0.5), bty='n', lty=2, type='l', ylab="density" )
par( new=TRUE )
plot( x=t, y=dt(t, df=5), ylim=c(0, 0.5), xaxt='n', yaxt='n', xlab='', ylab='', bty='n', col='red', type='l' )

Once again, if the original samples were from a Normal distribution, then the distribution of the mean is exactly a $t$ distribution, but otherwie the $t$ distribution is an approximation that gets better as the sample size increases.

What’s more important than memorizing the precise formula is to remember that for (almost) any data where the samples are independent, the mean comes from some distribution that is more and more like the Normal distribution as the sample size increases. Let’s take a look at an example using the uniform distribution.

# generate 20 samples from a uniform distribution and plot their histogram
N = 20
u = runif( N )
hist( u )

# generate 100 repeated samples of the same size, calculate the mean of each one, and plot the histogram of the means.
B = 100
uu = numeric( B )
for ( i in 1:B ) {
  uu[[i]] = mean( runif(N) )
}

hist(uu)

# what happens as B and N get larger and smaller? Do they play different roles?

Events and intervals

Answering realistic statistical questions is often a matter of evaluating the probability of some event.

Given a t-distribution with eight degrees of freedom, the probability of sampling a number randomly from the distribution which is less than -1 is found from the distribution function:

pt(-1, df=8)

## [1] 0.1732968

For that same distribution, if you want to know the value which is exceeded by 12% of samples, then you would use the quantile function:

qt( 1 - 0.12, df=8 )

## [1] 1.26934

Again using the same distribution, you can evaluate the interval where 90% of samples are within the interval.

qt( c(0.05, 0.95), df=8 )

## [1] -1.859548  1.859548

But if that $t_8$ random variable is the result of a calculation like $\frac{\bar{X} - \mu}{S/\sqrt{n}}$, then the event that it is in the interval (-1.86, 1.86) is the event

\[ -1.86 < \frac{\bar{X} - \mu}{S/\sqrt{n}} \le 1.86 \]

Here, $\bar{X}$ and $S$ are random, while $n$ is the known sample size and $\mu$ is the unknown but fixed paramter that we want to put bounds around. Just rearranging terms we can get:

\[ \bar{X} - 1.86 \times S / \sqrt{n} \le \mu < \bar{X} + 1.86 \times S / \sqrt{n} \] since this event occurs 90% of the time, we call the interval a 90% confidence interval. To change the confidence level, you would evaluate different quantiles of the t-distribution. If the degrees of freedom are not eight, then you would use the correct number for your data.

Questions to ponder

Can you see a way to make the confidence interval smaller other than changing the confidence level? Why would that be helpful?
In the formula for a confidence interval, what is random and what is fixed? What would change if you collected new data by repeating the experiment? Does this suggest something about what the 90% means for our 90% confidence interval?

Inference

Inference means making statements about the population based on a sample. In one simple case, inference can mean using a sample to estimate the mean of the population from which it arose. Since the law of large numbers says that the sample mean tends to the population mean, we’ll use the sample mean to approximate the population mean. You’ve already seen this in the plots where I showed the sample mean getting closer to the population mean as the sample size increased.

Revisiting the standard normal distribution:

The distribution, density, and quantile functions are all about the population. When you’re working with data, the graphs won’t be as clean. When you’re using methods that depend upon the data being approximately normal, you’ll have to check some diagnostics. One simple but effective approach is to look at the histogram and QQ plot, and judge whether there is obvious non-normality. Here’s an example from the standard normal distribution:

# sample 50 numbers from the standard normal distribution
y = rnorm(50)

# look at the histogram - it's like the density function
hist(y)

# look at the QQ plot - it demonstrates how closely the distribution matches the quantiles of a Normal distribution.
qqnorm(y)

# plot a couple of non-normal QQ plots:
qqnorm( rexp(50) )

qqnorm( rt(50, df=2) )

Looking at real data

Normal data: using the t-distribution

In the fosdata package there is a dataset called mice_pot, which contains data from an experiment where mice were dosed with THC and then measured for motor activity as a percentage of their baseline activity. We are going to look at the group that got a medium dose of THC.

# extract just the mice that got the medium dose of THC
mice_med = mice_pot[ mice_pot$group == 1, ]

# assess normality with histogram and QQ plot
hist( mice_med$percent_of_act )

qqnorm( mice_med$percent_of_act )

Find the 80% confidence interval for the population mean

Now we are using our sample to make some determination about the population, so this is statistical inference. Our best guess of the population mean is the sample mean, mean( mice_med$percent_of_act ), which is 99.1%. But to get a confidence interval, we need to use the formula \[ \bar{x} \pm t_{n-1, 0.1} * S / \sqrt{n} \] Going piece-by-piece in this formula:

# calculate the sample mean, standard deviation, and sample size:
x_bar = mean( mice_med$percent_of_act )
s = sd( mice_med$percent_of_act )
n = nrow( mice_med )

# calculate the relevant quantiles of the t-distribution.
# use the 0.1 and 0.9 quantiles because we want the 80% CI
# and (1-0.8) / 2 = 0.1 and 1 - 0.1 = 0.9.
t_low = qt(0.1, df=n-1)
t_high = qt(0.9, df=n-1)

# calculate the confidence interval:
cat( "The 80% CI is (", x_bar + t_low * s / sqrt(n), ", ",
     x_bar + t_high * s / sqrt(n), ").")

## The 80% CI is ( 88.71757 ,  109.3871 ).

# check your calculation with the t.test function:
t.test( mice_med$percent_of_act, conf.level=0.8 )

## 
##  One Sample t-test
## 
## data:  mice_med$percent_of_act
## t = 13.068, df = 11, p-value = 4.822e-08
## alternative hypothesis: true mean is not equal to 0
## 80 percent confidence interval:
##   88.71757 109.38712
## sample estimates:
## mean of x 
##  99.05235

Non-normal data: resampling

In the fosdata package, there is a dataset called barnacles that is from a study where scientists counted the number of barnacles on rocks in the intertidal zone. We’ll import the data and look at its distribution:

# calculate the number of barnacles per unit area for each sample:
barnacles$per_m = barnacles$count / barnacles$area_m

# examine the histogram and QQ plot of the barnacles per unit area:
hist( barnacles$per_m, main="barnacles per square meter" )

qqnorm( barnacles$per_m )

# does it look like the barnacles per unit area
# are distributed like a normal distribution?

We should conclude that the data are obviously non-normal.

Bootstrap-t distribution:

In this case, the observations themselves are not normally distributed so the t-distribution derived from the CLT will be only an approximation. But there is another option: resampling. This means shuffling our original data to generate new values of the t-statistic, and then working directly from this “bootstrap-t” distribution for inferences. Here is the procedure for generating your bootstrap-t distribution:

Compute the sample mean $\bar{x}$.
Repeat the following many times:

Resample from the barnacle data with replacement, calling the resample z.
Calculate the t-statistic (centered at $\bar{x}$) for this resample and call it t_boot[[i]]. The formula is ( mean(z) - mean(barnacles$per_m) ) / ( sd(z) / sqrt(n) ).

Get the quantiles from the bootstrap-t distribution.
Calculate the confidence interval as \[ (\bar{x} - t_{boot, lower} \times s / \sqrt{n}, \bar{x} + t_{boot, upper} \times s / \sqrt{n} \]

# define the sample size and a vector to hold results
B = 1000
t_boot = numeric( B )

# generate the bootstrap-t distribution for this data
for (i in 1:B) {
  z = sample( barnacles$per_m, replace=TRUE )
  t_boot[[i]] = ( mean(z) - mean(barnacles$per_m) ) /
    ( sd(z) / sqrt(length(z)) )
}

# plot the histogram of the bootstrap-t distribution
hist(t_boot)

# annotate the histogram with 0.025 and 0.975 quantiles
t_lower = quantile( t_boot, 0.025 )
t_upper = quantile( t_boot, 0.975 )
abline( v = c(t_lower, t_upper), lty=3)

# calculate the confidence interval
round( mean(barnacles$per_m) + 
        c(t_lower, t_upper) * sd(barnacles$per_m) / 
        sqrt(length(barnacles$per_m)), 2 )

##   2.5%  97.5% 
## 263.52 380.29

# compare to the result we'd see with a t confidence interval
t.test( barnacles$per_m )

## 
##  One Sample t-test
## 
## data:  barnacles$per_m
## t = 11.397, df = 87, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  274.1155 389.9216
## sample estimates:
## mean of x 
##  332.0186

Hypothesis testing

We’ve seen how to generate some confidence intervals for locating the population mean based on a sample. Besides confidence intervals, another main topic of statistical inference is hypothesis testing.

Null hypothesis and p-value

Recall that in statistical inference, we are using some sample to make conclusions about a population. The sample can’t tell us what the population is, since many different populations could generate the same samples. But if we have a sample and a particular idea for a population, then we can see whether the sample is unlikely, given the proposed population.

When testing a hypothesis in statistics, you propose some specific hypothesis, called the null hypothesis. Then, you must evaluate how unusual the data are, given the assumption that the null hypothesis is true. As an example, let’s test whether the mice that got a medium dose of THC have population mean activity that is equal to 100. Call this specific null mean $\mu_0$. Then the t-statistic for the test is

\[t = \frac{\bar{X} - \mu_0}{S / \sqrt{n} } = \frac{99.05 - 100}{S / \sqrt{12} } = -0.125\]

When working out the confidence interval, we already saw that these data are Normally distributed and the t-statistic for these data has 11 degrees of freedom. I will therefore draw the density of a t-distribution with 11 degrees of freedom, annotated with the observed t-statistic:

# plot the t density with 11 df
xx = seq(-4, 4, length.out=1000)
plot( xx, dt(xx, df=11), type='l', bty='n', ylab="density", xlab='t')

# annotate with the observed t-statistic
t = (mean(mice_med$percent_of_act) - 100) /
  (sd(mice_med$percent_of_act) / sqrt(n))
abline(v=t, lty=3, col='red', lwd=2)

The observed t-statistic doesn’t appear to be unusual under the assumed null, because it is from the fat part of the probability density. We can quantify that idea by shading the area under the density curve that is more rare than the observed t-statistic. The area of that shading is the probability that something equally or more unlikely than the observed data would occur if the null hypothesis is true. This gets called the p-value, and when it is small, we conclude that the null hypothesis should be rejected.

# plot the t density with 11 df
xx = seq(-4, 4, length.out=1000)
plot( xx, dt(xx, df=11), type='l', bty='n', ylab="density", xlab='t')

# annotate with the observed t-statistic
abline(v=t, lty=3, col='red', lwd=2)

# shade the tails
polygon(x=c(xx[xx<=t], t, min(xx)), y=c(dt(xx[xx<=t], df=11), 0, 0), col=grey(0.8))
polygon(x=c(xx[xx>-t], t, max(xx)), y=c(0, dt(xx[xx>-t], df=11), 0), col=grey(0.8))

Here, clearly, the shaded area is not small, so don’t reject the null hypothesis. In the plot, I have shaded both the upper and lower tails. That’s because the null hypothesis was not specific about whether the alternative is $\mu \ne 100$ or $\mu < 100$. If you wish, you may formulate a one-sided hypothesis (this needs to be done before looking at the data). As you may imagine, the shaded area is reduced if you take away one of the two tails and this makes it easier to reject a one-sided null hypothesis. The test can be done more simply with the t.test function:

# test whether the population mean movement is equal to 100%
t.test( mice_med$percent_of_act, mu=100 )

## 
##  One Sample t-test
## 
## data:  mice_med$percent_of_act
## t = -0.12502, df = 11, p-value = 0.9028
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
##   82.36893 115.73577
## sample estimates:
## mean of x 
##  99.05235

# same test, but one-sided
t.test( mice_med$percent_of_act, mu=100, alternative='less')

## 
##  One Sample t-test
## 
## data:  mice_med$percent_of_act
## t = -0.12502, df = 11, p-value = 0.4514
## alternative hypothesis: true mean is less than 100
## 95 percent confidence interval:
##      -Inf 112.6651
## sample estimates:
## mean of x 
##  99.05235

Two-population test

The test of $\mu_0 = 100$ is a one-population test because it seeks to compare a single population against a specified standard. On the other hand, you may wish to assess the null hypothesis that the movement of mice in the high-THC group is equal to the movement of mice in the medium-THC group. This is called a two-population test, since there are two populations to compare against each other. The null hypothesis is $\mu_{0, med} = \mu_{0, high}$. Testing a two-population hypothesis requires first assessing normality and also checking whether the variances are equal. There are separate procedures when the variances are equal vs. unequal.

#extract the samples to be compared
a = mice_pot$percent_of_act[ mice_pot$group == 1]
b = mice_pot$percent_of_act[ mice_pot$group == 3]

# check for equal variances - these are close enough
var(a)

## [1] 689.4729

var(b)

## [1] 429.4551

# confirm equal variances with a boxplot
boxplot(a, b)

# check whether the high-THC mice movement is Normal
# (we already checked for the medium-dose mice)
qqnorm(b)

# two pop test
t.test(a, b, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  a and b
## t = 2.7707, df = 20, p-value = 0.0118
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.014608 49.754345
## sample estimates:
## mean of x mean of y 
##  99.05235  70.66787

As a side note, what counts as “equal” variance is a judgement call. I might say that variances within a factor of two are roughly equal. Here’s an example of unequal variances, which is clearly discernable from the boxplot:

# import the mtcars data
data(mtcars)

# make a boxplot of mpg by cylinder count
boxplot( mpg ~ cyl, data=mtcars)

The variances of mpg for the si and eight cylinder cars are approximately equal to each other, but they are different from the variance of the mpg of four-cylinder cars.

Hypothesis tests for non-normal data

Just as with the confidence intervals, there is a bootstrap hypothesis test that can be used where the data are not normal. There are other options, too, with clever derivations. The one I’ll show you is the Wilcoxon test, which is based on the ranks of the data.

Since we’e already seen that the barnacles per square meter data are not normal, I will illustrate testing the null hypothesis that $\mu_0$ = 300 barnacles per square meter. This is a one-population test, and a two-sided alternative.

# wilcoxon test for 300 barnacles per square meter
wilcox.test( barnacles$per_m )

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  barnacles$per_m
## V = 3916, p-value = 3.797e-16
## alternative hypothesis: true location is not equal to 0

The same function can also be used for two-population testing, like the example of the medium vs. high THC doses:

# wilcoxon test of equality between high-THC and medium-THC mice
wilcox.test( a, b )

## 
##  Wilcoxon rank sum exact test
## 
## data:  a and b
## W = 94, p-value = 0.02492
## alternative hypothesis: true location shift is not equal to 0

Analysis of variance

Finally, we will introduce the analysis of variance, called ANOVA. For our examples, we return to data that we first worked with in chapter 1 - it is the amount of motor activity for mice that were dosed with different amounts of THC. ANOVA is specifically for testing hypotheses where there are several groups in the experiment and you want to show whether there is any difference between them. In our hypothesis testing framework, the null hypothesis here is that there is no difference between the groups. For example, the null hypothesis of our example is that there is no difference in the amount of motor activity among the mice that got the low, medium, high, or control doses of THC.

Eyeball test

Before doing any formal testing, it is best to plot the data and see what results would make sense.

# boxplot of activity by mice that got different THC doses
with( mice_pot, boxplot(percent_of_act ~ group) )

The eyeball test says that there is no clear difference between the VEH, 0.3, and 1 groups, but the 3 group (which got the high dose of THC) has a different amount of motor activity. This consititutes a rejection of the null hypothesis, because that hypothesis said that there is no difference among any of the groups.

ANOVA

Now, let’s do a formal ANOVA:

# create the linear model
fit_mice = lm( percent_of_act ~ group, data=mice_pot )

# process the model through the anova function
anova( fit_mice )

## Analysis of Variance Table
## 
## Response: percent_of_act
##           Df Sum Sq Mean Sq F value Pr(>F)  
## group      3   6329 2109.65  3.1261 0.0357 *
## Residuals 42  28344  674.85                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table is telling us that we should reject the null hypothesis of no difference between the different THC doses (p=0.036), which accords with our intuitive sense from the eyeball test. Here is a brief explanation of the columns of the ANOVA table.

Df is the number of degrees of freedom for each treatment. This example has only one treatment (group) and it uses three degrees of freedom because that is one fewer than the number of categories in the group treatment. Residual degrees of freedom are then whatever is left of the total $n-1$ degrees of freedom. Here, that’s 42 because there are 46 rows in the data and 42 = 46-3-1.
Sum Sq is the explained sum of squares for each treatment. This is beyond what we’ll get into today, but think of how there is some total variation in the height of humans, but because men are typically a bit taller than women, there is somewhat less variability left after accounting for each person’s sex. The amount of variability in height that is explained by sex is the difference between the variability of all human heights and the variability that remains after accounting for sex.
Mean Sq is the mean sum of squares for each row. It is calculated by dividing the Sum Sq column by the Df column.
F value is the row’s F-statistic, which is calculated by dividing the Mean Sq column for a treatment by the Mean Sq column for the Residuals. There is only one treatment in this example.
Pr(>F) is the p-value for each treatment in the ANOVA.

An important note is that since the residual mean squares are in the denominator for calculating the F-statistic, smaller residual mean squares means a larger F-statistic, and therefore greater power to detect differences between groups. You can make the residual mean squares smaller by increasing the residual degrees of freedom - i.e., by increasing n (sample size).

Kruskal-Wallis test

The F-test in ANOVA depends on the normality of the group means, which could come because the raw data follows a normal distribution or because the sample size in each group is large enough to rely on the Central Limit Theorem. If neither is the case, then you can’t use an ANOVA F-test to test for differences between groups. The Kruskal-Wallis test is based on ranking the data, and can be applied when ANOVA cannot. When the ANOVA assumptions are valid, the Kruskal-Wallis test can be used too, but in that situation it brings less power than the ANOVA F-test. What happens if we use the Kruskal-Wallis test on the mice THC example?

# use the Kruskal-Wallis test on the mice_pot data
kruskal.test( percent_of_act ~ group, data=mice_pot )

## 
##  Kruskal-Wallis rank sum test
## 
## data:  percent_of_act by group
## Kruskal-Wallis chi-squared = 7.2035, df = 3, p-value = 0.06569

The p-value for this test is 0.066, which would cause you not to reject the null at the 0.05 level of significance.

Regression

Regression is a mathematical tool that allows you to estimate how some response variable is related to some predictor variable(s). There are methods that handle continuous or discrete responses of many different distributions, but we are going to focus on linear regression here.

Linear regression means that the relationship between the predictor variable(s) and the response is a linear one. To illustrate, we’ll create a plot of the relationship between the waist measurement and bmi of 81 adults:

# import the adipose body fat data file
adipose = read.csv( url("https://raw.githubusercontent.com/ucdavisdatalab/basic_stats_r_1/main/data/adipose.csv") )

# plot the relationship between the waist_cm and bmi variables
with(adipose, plot(waist_cm, bmi), bty='n' )

The relationship between the two is apparently linear (you can imagine drawing a straight line through the data). The general mathematical form of a linear regression line is

\[ y = a + \beta x + \epsilon \]

Here, the response variable (e.g., BMI) is called $y$ and the predictor (e.g. waist measurement) is $x$. The coefficient $\beta$ indicates how much the response changes for a change in the predictors (e.g., the expected change in BMI with a 1cm change in waist measurement). Variable $a$ denotes the intercept, which is a constant offset that aligns the mean of $y$ with the mean of $x$. Finally, $\epsilon$ is the so-called residual error in the relationship. It represents the variation in the response that is not due to the predictor(s).

Fitting a regression line

The R function to fit the model is called lm(). Let’s take a look at an example:

# fit the linear regression BMI vs waist_cm
fit = lm( bmi ~ waist_cm, data=adipose )

# plot the fitted regression: begin with the raw data
with( adipose, plot(waist_cm, bmi, bty='n') )

#now plot the fitted regression line (in red)
abline( coef(fit)[[1]], coef(fit)[[2]], col='red' )

Assumptions and diagnostics

“Fitting” a linear regression model involves estimating $a$ and $\beta$ in the regression equation. You can can do this fitting procedure using any data, but the results won’t be reliable unless some conditions are met. The conditions are:

Observations are independent.
The linear model is correct.
The residual error is Normally distributed.
The variance of the residual error is constant for all observations.

The first of these conditions can’t be checked - it has to do with the design of the experiment. The rest can be checked, though, and I’ll take them in order.

Checking that the linear model is correct

In the cast of a simple linear regression model (one predictor variable), you can check this by plotting the predictor against the response and looking for a linear trend. If you have more than one predictor variable, then you need to plot the predictions against the response to look for a linear trend. We’ll see an example by adding height as a predictor for BMI (in addition to waist measurement).

# linear model for BMI using waist size and height as predictors
fit2 = lm( bmi ~ waist_cm + stature_cm, data=adipose )

# plot the fitted versus the predicted values
plot( fit2$fitted.values, adipose$bmi, bty='n' )

Checking that the residuals are normally distributed

We have already learned about the QQ plot, which shows visually whether some values are Normally distributed. In order to depend upon the fit from a linear regression model, we need to see that the residuals are Normally distributed, and we use the QQ plot to check.

Checking that the vaiance is constant

In an earlier part, we saw that the variance is the average of the squared error. But that would just be a single number, when we want to see if there is a trend. So like the QQ plot, you’l plot the residuals and use your eyeball to discern whether there is a trend in the residuals or if they are approximately constant - this is called the scale-location plot. The QQ plot and scale-location plot are both created by plotting the fitted model object

# set up the pattern of the panels
layout( matrix(1:4, 2, 2) )

# make the diagnostic plots
plot( fit )

The “Residuals vs. Fitted” plot is checking whether the linear model is correct. There should be no obvious pattern if the data are linear (as is the casre here). The Scale-Location plot will have no obvios pattern if the variance of the residuals is constant, as is the case here (you might see a slight pattern in the smoothed red line but it isn’t obvious). And the QQ plot will look like a straight line if the residuals are from a Normal distribution, as is the case here. So this model is good. The fourth diagnostic plot is the Residuals vs. Leverage plot, which is used to identify influential outliers. We won’t get into that here.

Functions for inspecting regression fits

When you fit a linear regression model, you are estimating the parameters of the regression equation. In order to see those estimates, use the summary() function on the fitted model object.

# get the model summary
summary( fit2 )

## 
## Call:
## lm(formula = bmi ~ waist_cm + stature_cm, data = adipose)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1290 -1.0484 -0.2603  1.2661  5.2572 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.38196    3.82700   3.758 0.000329 ***
## waist_cm     0.29928    0.01461  20.491  < 2e-16 ***
## stature_cm  -0.08140    0.02300  -3.539 0.000680 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.724 on 78 degrees of freedom
## Multiple R-squared:  0.844,  Adjusted R-squared:   0.84 
## F-statistic:   211 on 2 and 78 DF,  p-value: < 2.2e-16

Here you can see that the average marginal effect of one additional centimeter of waist measurement is to increase BMI by 0.3 and an additional centimeter of height is associated with a change to BMI of -0.08. You can get the coefficients from the fitted model object using the coef() function, and there are some other functions that allow you to generate the values shown in the summary table.

# get the coefficients of the fitted regression
beta = coef( fit2 )
round( beta, 2 )

## (Intercept)    waist_cm  stature_cm 
##       14.38        0.30       -0.08

# get the variance-covariance matrix
cat( "\nvariance-covariance matrix:\n" )

## 
## variance-covariance matrix:

round( vcov( fit2 ), 4)

##             (Intercept) waist_cm stature_cm
## (Intercept)     14.6459  -0.0039    -0.0836
## waist_cm        -0.0039   0.0002    -0.0001
## stature_cm      -0.0836  -0.0001     0.0005

# compare the square root of the diagonals of the variance-covariance matrix
# to the standard errors are reported in the summary table:
se = sqrt( diag( vcov(fit2) ))

cat( "\nstandard errors:\n")

## 
## standard errors:

round( se, 3 )

## (Intercept)    waist_cm  stature_cm 
##       3.827       0.015       0.023

# calculate the t-statistics for the regression coefficients
# (compare these to the t-statistics reorted in the summary table)
t_stats = beta / se

cat("\nt-statistics:\n")

## 
## t-statistics:

round( t_stats, 2 )

## (Intercept)    waist_cm  stature_cm 
##        3.76       20.49       -3.54

# calculate the p-values:
pval = 2 * pt( abs(t_stats), df=78, lower.tail=FALSE )
round(pval, 4)

## (Intercept)    waist_cm  stature_cm 
##       3e-04       0e+00       7e-04

# this is the residual standard error:
sd( fit2$residuals ) * sqrt(80 / 78)

## [1] 1.72357

# R-squared is the proportion of variance
# explained by the regression model
round( 1 - var(fit2$residuals) / var(adipose$bmi), 3 )

## [1] 0.844

A model that fails diagnostics

We’ve seen a model that has good diagnostics. Now let’s look at one that doesn’t. This time, we’ll use linear regression to make a model of the relationship between waist measurement and the visceral adipose tissue fat (measured in grams). The visceral adipose tissue fat is abbreviated vat in the data. First, since the model uses a single predictor variable, let’s look at the relationship with a pair plot.

# plot the relationship between waist_cm and vat
with( adipose, plot( waist_cm, vat, bty='n' ))

The plot is obviously not showing a linear relationship, which will violate one of the conditions for linear regression. Also, you can see that there is less variance of vat among the observations that have smaller waist measurements. So that will violate the assumption that the residual variance has no relationship to the fitted values. To see how these will show up in the diagnostic plots, we need to fit the linear regression model.

# estimate the model for vat
fit_vat = lm( vat ~ waist_cm, data = adipose )

# there is no problem creating the summary table:
summary( fit_vat )

## 
## Call:
## lm(formula = vat ~ waist_cm, data = adipose)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -996.25 -265.96  -61.87  191.24 1903.46 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3604.196    334.241  -10.78   <2e-16 ***
## waist_cm       51.353      3.937   13.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 479 on 79 degrees of freedom
## Multiple R-squared:  0.6829, Adjusted R-squared:  0.6789 
## F-statistic: 170.2 on 1 and 79 DF,  p-value: < 2.2e-16

# show the diagnostic plots
layout( matrix(1:4, 2, 2) )
plot( fit_vat )

There is obviously a curved pattern in the Residuals vs. Fitted plot, and in the Scale vs. Location plot. Residuals vs. Fitted shows a fan-shaped pattern, too, which reflects the increasing variance among the greater fitted values. The QQ plot is not a straight line, although the difference is not as obvious. In particular, the upper tail of residuals is heavier than expected. Together, all of these are indications that we may need to do a log transformation of the response. A log transformation helps to exaggerate the differences between smaller numbers (make the lower tail heavier) and collapse some difference among larger numbers (make the upper tail less heavy).

# fit a regression model where the response is log-transformed
fit_log = lm( log(vat) ~ waist_cm, data=adipose )

# plot the diagnostics for the log-transformed model
plot( fit_log )

The diagnostics do not look good after the log transformation, but now the problem is the opposite: a too-heavy lower tail and residual variance decreases as the fitted value increases. Perhaps a better transformation is something in between the raw data and the log transform. Try a square-root transformation.

# fit a model where the vat is square root transformed
fit_sqrt = lm( sqrt(vat) ~ waist_cm, data=adipose )

# plot the diagnostics for the log-transformed model
plot( fit_sqrt )

These look acceptable for real-world data.

Predictions and variability

There are two scales of uncertainty for a regression model: uncertainty in the fitted relationship, and the uncertainty of a predicted outcome. The uncertainty of a prediction is always greater because it is calculated by adding the uncertainty of the fitted line to the uncertainty of a single data point around that fitted line. We can illustrate using the example of the model we just created to relate the waist measurement to the square root of vat. For this block of code, we’ll need the mvtnorm library to be loaded.

# import mvtnorm. install it if necessary.
library( mvtnorm )

# draw the data on the transformed scale
with( adipose, plot(waist_cm, sqrt(vat), bty='n') )

# plot the fitted regression line
abline( coef(fit_sqrt)[[1]], coef(fit_sqrt)[[2]], col='red' )

# plot 100 samples from the distributon of the regression line.
for (i in 1:100) {
  cc = rmvnorm( n=1, mean=coef(fit_sqrt), sigma=vcov(fit_sqrt) )
  abline( cc[[1]], cc[[2]], col=grey(0.8))
}

Clearly, the variability of the data points is greater than the variability of the fitted line (that’s why they lie outside the envelope of the fitted lines). We can extract a confidence interval for fitted values or predictions with the predict() function.

# draw the data on the transformed scale
with( adipose, plot(waist_cm, sqrt(vat), bty='n') )

# plot the fitted regression line
abline( coef(fit_sqrt)[[1]], coef(fit_sqrt)[[2]], col='red' )

# define some waist measurements where we'll construct confidence intervals
pred_pts = data.frame( waist_cm = c(70, 85, 110) )

# calculate the 90% CI at each of the pred_pts
ff = predict(fit_sqrt, pred_pts, interval="confidence", level=0.9)
pp = predict(fit_sqrt, pred_pts, interval="prediction", level=0.9)

# convert the confidence intervals to data.frames
ff = as.data.frame(ff)
pp = as.data.frame(pp)

# add the three confidence intervals to the plots
# (offset them a bit for clarity in the plot)
for (i in 1:3) {
  lines( x=rep( pred_pts$waist_cm[[i]] - 0.5, 2),
        y=c( ff$lwr[[i]], ff$upr[[i]] ), col='blue', lwd=2 )

  lines( x=rep( pred_pts$waist_cm[[i]] + 0.5, 2),
        y=c( pp$lwr[[i]], pp$upr[[i]] ), col='orange', lwd=2 )
}

# add a legend
legend(c("90% CI (fitted values)", "90% CI (predicted values)"),
       x="topleft", lwd=2, col=c("blue", "orange"), bty='n')

One thing to notice about the confidence intervals is that the interval is smallest (so the precision of the estimation is greatest) at the mean of the predictor variable. This is a general rule of fitting regression.

Introduction to basic statistics with R