In the last post, we examined an available genomic dataset from
Brauer et al 2008
about yeast gene expression under nutrient starvation. We learned to
tidy
it with the dplyr and tidyr packages, and saw how useful this tidied form is for visualizing and understanding individual genes.
But we were just getting started with our tidy data analysis. We were able to look at a few genes at a time, but that approach doesn’t scale to a genome with 6,000 genes in it. In order to select interesting genes from the entire dataset (without guessing which genes we should look at in advance), we’ll need to do some kind of modeling.
Here I’ll show how to use dplyr to fit linear models to each gene-nutrient combination, while using my
broom
package to recombine these models so that we can keep working with the same tidy tools.
(Please note that I’ve added two steps of cleaning we didn’t do in the last post. First, I spelled out the full names of the nutrients- “Glucose” instead of just “G”, for example. Second, I filtered out missing values from the expression column, as well as genes that have no systematic ID).
Tidying the data in this way lets us make graphs like this:
For starters, let’s wrap this useful graph into a function, so that we can make it easily in the rest of the post.
OK, so it looks like this trend is significant (p-value 0.0035645).
We could do this one at a time for each gene/nutrient combination. But really we want to apply it to
every
combination of a gene and a nutrient.
1
This is a problem, because that
lm
object isn’t designed for recombining. It contains many components of varying shapes: residuals, fitted values, an F-statistic, and so on. Having a list of those objects isn’t going to work with dplyr or with ggplot2: it would take us out of the tidy data framework.
This is where we bring in my
broom package
, which is designed for this very purpose: turning model objects into data frames. This lets us work with the
outputs
of models- graphing, sorting and summarizing them- using the same tidy tools we used to process our data.
In particular, right now we want the
tidy
method, which extracts the coefficients of a model.
Notice that it has the same information as
coef(summary(mod))
: slope estimate, standard error, t-statistic, and p-value for each coefficient. But now it’s in a data frame, rather than a matrix, and the rownames have been moved into a column of the data frame. This lets us combine multiple models, which in turn lets us perform the modeling within a
do
statement.
Notice that there are two rows for each gene-nutrient combination: an intercept and a slope term. (You can read more about using broom and dplyr together in
this vignette
). This is simplifying each gene-nutrient combination into two values:
Intercept
: How highly expressed the gene is when it’s starved of that nutrient.
rate
: How much the gene’s expression responds to an increasing supply of that nutrient (and therfore an increasing growth rate)
The p-values aren’t actually interesting to us here: they’re testing whether the intercept is equal to 0, which is not a particularly special number in terms of these normalized gene expressions. (Confidence intervals and standard errors would be, which I may discuss in a future post).
What we’re really interested in is the value of each intercept relative to the other nutrients in that gene. For example, let’s again consider our favorite gene, LEU1.
This gene has a low intercept term for all nutrients except leucine. I’ve marked the average intercept with a horizontal dashed line to demonstrate this. Suppose we want to look for other genes like this, with a single outlying intercept term. We could do this by
centering
the intercepts around the average for each gene, using a
group_by
and
mutate
:
Note that here I’m looking for cases where a single nutrient was greatly
overexpressed
in starvation (to look for
underexpressed
nutrients, we could have used
-centered_intercept
instead). We can then pull these genes out of the original data with the useful
semi_join
, at which point we can graph it with our
plot_expression_data
:
These certainly do look like interesting genes! We notice that some genes, like PHO11, only one nutrient is highly expressed while the rest show low expression, while other genes, such as ADH2, show varying levels of expression for each nutrient. We also notice that in most cases the highly expressed nutrient is moving back down towards the others as growth rate increases (that is, as the yeast is less starved). This makes sense, since it’s the starvation that is eliciting the unusual behavior.
What do these genes do? Beats me; I’m not a biologist, I just play one on my degree. But it certainly looks promising that
PHO
11,
PHO
12. and
PHO
5 are both much higher expressed when
pho
sphorus is the limiting nutrient, as well as
SUL
1 when
sul
fur is rare- and indeed each gene is involved in transport of that nutrient. (And we do see our Gene of the Week, LEU1).
Looking up the others in
yeastgenome.org
, we see that a
lot
of them are involved in transport across membranes (e.g.
DAL5
,
GAP1
,
QDR2
). This makes sense: the cell notices that it is missing a nutrient, and puts more energy into importing it. Notice that this would be a great way to make inferences about genes whose function we
don’t
yet know. (This is the focus of
functional genomics
).
Slope terms
Now let’s take a look at the slope terms, which shows whether each gene increased or decreased its growth rate in a particular condition.
Here, we’ll focus a bit more on statistical significance. First we can make a histogram of the p-values. These p-values are spread across six different nutrients, so we’ll facet our histogram by those nutrients:
See here for my guide on interpreting this kind of p-value histogram.
In this case, we can see that the tests are generally well-behaved, with a mix of nulls (genes that don’t respond to growth rate) and alternatives (genes that do). Thus, we can use p-value correction to identify significant genes. Since we have a lot of hypotheses, it’s a good idea to use the
Storey q-value
.
2
# see https://bioconductor.org/packages/release/bioc/html/qvalue.html# for installation instructionslibrary(qvalue)slope_terms<-slope_terms%>%mutate(q.value=qvalue(p.value)$qvalues)
(I talk a bit about FDR and q-value from a Bayesian perspective
here
).
Now that we have a measure of signiifcance, we can ask all sorts of questions. We could ask which nutrients have the most genes significantly correlated with expression (at a 1% FDR):
Here’s an interesting question: are there any cases where the gene is significantly
positively
correlated with growth rate in one limiting nutrient, and significantly
negatively
correlated in another? We can discover this with a
group_by
and
filter
:
Those certainly look like genes worthy of further study. But you may be interested in different phenomena when you’re analyzing your data- you might want cases where expression responds to glucose, but no other conditions. This tidy setup of the models makes it easy to answer these questions in an interactive and exploratory way.
Conclusion: Surprise, surprise
These linear regressions are a great way to pull out the most interesting genes. Why didn’t we start with that step? Why bother visualizing individual genes at all?
Well, as Hadley Wickham has noted (by way of
John D. Cook
):
Visualization can surprise you, but it doesn’t scale well.
Modeling scales well, but it can’t surprise you.
When we chose to model gene expression with linear regression, we made a strong assumption: that we could represent each gene’s trend using a linear regression within each gene/nutrient combination. We couldn’t do that if we’d noticed some genes had a non-linear trend- something like this:
That’s why I always recommend looking at small subsets first, like a single gene or biological process.
If your data's huge, analyze a small subset first to check approach. Like navigating the route on a bike before trying your 18-wheeler truck
If we’d seen any genes that had this kind of expression profile, we’d have known to choose a different approach. It’s still possible there are weird genes like this! If they are, our regressions would
not
find them, because our model will never surprise us.
There’s no free lunch!
Next time
Earlier in this post I showed the expression of four genes involved in “leucine biosynthesis”. Now that we have our per-gene-per-nutrient linear model, here’s another way we can look at them:
Notice how clear it is that these genes respond to leucine starvation in particular. Unlike the earlier visualization, this can be applied to gene sets containing dozens or even hundreds of genes while still making the general trend apparent. Furthermore, we could use these summaries to look at many gene sets at once, and even use statistical tests to discover new gene sets that respond to starvation.
Thus, in my next post in this series, we’ll apply our “tidy modeling” approach to a new problem. Instead of testing whether each gene responds to starvation in an interesting way, we’ll test functional
groups
of genes in order to find higher-level biological insights. And we’ll do it with these same set of tidy tools we’ve been using so far.
Footnotes
Most statisticians will note that we could have performed one-model-per-gene, and included an interaction term. I chose to perform one-model-per-gene-per-nutrient only because it’s simpler: we don’t have to explain the concept of interaction terms or parse apart that column, and we get the same results.
↩
If we had used Bonferroni correction with more than 30,000 genes, basically none of them would have ended up significant. Even Benjamini-Hochberg FDR control may not be powerful enough. But the Storey q-value is well suited to control false discovery rate when we expect that a substantial portion of our hypotheses are not null (say, 20%, rather than .02%). Based on our p-value histograms, this appears to be the case.
↩