In a research laboratory setting, often large data sets are obtained from experiments. In order to make sense of them, it is usually a good idea to conduct a regression of the data points. A linear regression is a method for modeling the linearity between a dependent (response) $y$ variable and one or more independent (explanatory) $x$ variables. This is done by calculating the "best fit" line through the data points. In this article, we look at simple linear regression, involving only a single independent variable.
Consider the general equation for a line: $y=mx+b$, where $m$ denotes the slope of the line and $b$ the $y$-intercept. The general equation for a linear regression remains largely the same, albeit with different notation for slope and the intercept, as well as an additional term. $$y_i=\beta_1x_i+\beta_0+\epsilon_i \tag 1$$ Observe that $\epsilon_i$ is the error of the linear regression—that is, $\epsilon_i$ is the difference between the predicted response value, $\hat y_i$ (read "$y$ hat") and the actual response value, $y_i$. In general, the overall goal of calculating a regression line is to find the correct slope and intercept values that minmize the error. Because error can be either negative or positive, we square the $\epsilon_i$ values to maintain a positive set of error values. Thus, this method is also known as a least-squares regression model.
Let $n$ be the total number of data points in a set, $\bar x$ be the mean of the $x$ values, and $\bar y$ be the mean of the $y$ values. The predicted intercept of a data set for a simple linear regression is given by $$\hat\beta_1=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^{n}(x_i-\bar x)^2}=\frac{\mathrm{Cov}(x,y)}{\mathrm{Var}(x)} =r\left(\frac{s_y}{s_x}\right)\tag 2$$ The predicted intercept is then given by $$\hat\beta_0=\bar y - \hat\beta_1\bar x \tag 3$$ Note that as shown in equation $(2)$, the slope of the linear regression relates the covariance and variance of the sample data. Also, $r$ is the correlation coefficient while $s_x$ and $s_y$ are the sample standard deviations of $x$ and $y$, respectively.
A positive value for $\hat\beta_1$ indicates a positive correlation between $x$ and $y$. That is, as $x$ increases, $y$ also increases. A negative $\hat\beta_1$ value meanwhile indicates a negative correlation—as $x$ increases, $y$ decreases.
The correlation coefficient, $r$, serves as an indication about the strength of the relationship between the variables and is bounded in absolute value by $|r|\leqslant 1$. The higher the absolute value of $r$, the stronger the linear correlation with $r=1$ being a perfect correlation and $r=-1$ being a perfect anticorrelation. Values closer to $0$ suggest a weak linear correlation, with $r=0$ specifying no correlation (i.e. $x$ and $y$ are independent). Note that the correlation coefficient is directly proportional to the slope: $r\propto\hat\beta_1$.
For another measure of dependence, we can rearrange equation $(2)$ to solve for $r$. $$r=\hat\beta_1\left(\frac{s_x}{s_y}\right) \tag 4$$ Observing that variance is the square of standard deviation, we can square the entirety of equation $(4)$. $$r^2=\hat\beta_1^2\left(\frac{s_x^2}{s_y^2}\right)=\hat\beta_1^2\left(\frac{\mathrm{Var}(x)}{\mathrm{Var}(y)}\right) \tag 5$$ The square of the correlation coefficient, $r^2$, is known as the coefficient of determination. This value represents the proportion of the variation in $y$ that can be explained by $x$.
The number of partially-resistant bacteria that can grow on an agar plate is related to the concentration of an antibiotic that it is vulnerable to which is present in the plate. Suppose you obtain the following data:
Kanamycin Conc. (mg/mL) | No. Bacteria Colonies |
---|---|
10 | 53 |
20 | 41 |
30 | 37 |
40 | 21 |
50 | 8 |
Let $x$ be the kanamycin concentration and $y$ the number of bacteria colonies that grew on the plate. We begin by calculating the mean of both $x$ and $y$. $$\bar x = \sum_{i=1}^{n}\frac{x_i}{n}=30 \qquad\quad \bar y = \sum_{i=1}^{n}\frac{y_i}{n}=32$$ We can then find $\hat\beta_1$ by equation $(2)$. $$\hat\beta_1=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^{n}(x_i-\bar x)^2}=\frac{(10-30)(53-32)+(20-30)(41-32)+\cdots+(50-30)(8-32)}{(10-30)^2+(20-30)^2+\cdots+(50-30)^2}=\frac{-1100}{1000}=-1.1$$ The intercept is then given by equation $(3)$. $$\hat\beta_0=\bar y - \hat\beta_1 \bar x = 32-(-1.1)(30)=65$$ Altogether, this yields the following linear regression equation. $$\hat y = -1.1x + 65$$ Thus according to this data set, every increase 1 mg/mL of kanamycin decreases the number of bacteria colonies that grow by 1.1. To observe the strength of this relationship, we can calculate the correlation coefficient. We will need the sample standard deviations of both $x$ and $y$. $$s_x=\sqrt{\frac{\sum_{i=1}^{n}(x_i-\bar x)^2}{n}}\doteq 15.811 \qquad\quad s_y=\sqrt{\frac{\sum_{i=1}^{n}(y_i-\bar x)^2}{n}}\doteq 17.635$$ Equation $(4)$ then yields the desired correlation coefficient. $$r=\hat\beta_1\left(\frac{s_x}{s_y}\right)=(-1.1)\left(\frac{15.811}{17.635}\right)\doteq -0.986$$ Therefore, the data suggests a strong, linear, and negative correlation between kanamycin concentration and the number of bacteria colonies. Finally, to observe the variation between the two, we simply square the correlation coefficient to find the coefficient of determination. $$r^2=(-0.986)^2\doteq 0.973$$ Thus, approximately 97.3% of the variation in the number of bacteria colonies can be explained by the kanamycin concentration.