SCC tutorial series

Setup
Data quality check
Subsetting Data
Normalisations
Alpha-diversity
Beta-diversity
Differential abundance
Session info
References

This tutorial will go over a basic analysis workflow using HMP data kindly provided in phyloseq format by Paul J. McMurdie
Download the data here

For pre-processing your own data, I recommend following this DADA2 tutorial.

If you need help getting your own data into phyloseq format, see phyloseq data import tutorials.

The R package “phyloseq” is designed specifically for analysing microbiome data. For more information, see the home page. A phyloseq object holds all of the data necessary for the analysis in a single place. It can contain: metadata for the samples, the abundance table, taxonomic assignments, a phylogenetic tree, and reference sequences. See the schematic below for how these data are linked together in a phyloseq object.

Setup

Let’s start by loading the packages we will be using.

pkgs <- c("tidyverse", "phyloseq", "ggpubr", "ggplot2", 
          "vegan", "reshape2", "DESeq2")
lapply(pkgs, require, character.only = TRUE)

We also need to define some settings for our plots. Use the color-blind-friendly palette from http://www.cookbook-r.com.

theme_set(theme_bw()) 
color_pal <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

The dataset we will be using contains 16S rRNA gene amplicon sequencing (16S profiling) data from the V3-V5 regions of the gene. It was acquired as part of the Human Microbiome Project (HMP). Details about the study design are here. It is important to note that this data was pre-processed using QIIME, a clustering approach that results in operational taxonomic units (OTUs), or groups of similar sequences. New methods have been developed to avoid needing to cluster sequences and to instead analyse amplicon sequence variants; these are generally recommended over OTU-based approaches. See this paper from Callahan et al., 2017 for more information.

Load the data.

load("HMPv35.RData")
HMPv35

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 45336 taxa and 4743 samples ]
## sample_data() Sample Data:       [ 4743 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 45336 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 45336 tips and 45099 internal nodes ]
## refseq()      DNAStringSet:      [ 45336 reference sequences ]

This dataset contains 16S profiling data from 4743 samples with 9 metadata variables to describe these samples. Let’s have a look at the metadata.

ps_meta <- sample_data(HMPv35) %>%
  data.frame() # convert from phyloseq format
summary(ps_meta)

##    X.SampleID             RSID              visitno         sex      
##  Min.   :700013549   Min.   :132902142   Min.   :1.00   female:2188  
##  1st Qu.:700033710   1st Qu.:159611913   1st Qu.:1.00   male  :2555  
##  Median :700097914   Median :161270782   Median :1.00                
##  Mean   :700074251   Mean   :392105561   Mean   :1.41                
##  3rd Qu.:700106406   3rd Qu.:763678604   3rd Qu.:2.00                
##  Max.   :700114750   Max.   :970836795   Max.   :3.00                
##                                                                      
##      RUNCENTER                         HMPbodysubsite Mislabeled     
##  WUGC     :1539   Stool                       : 319   Mode :logical  
##  BI       :1078   Tongue_dorsum               : 316   FALSE:3730     
##  JCVI     :1009   Attached_Keratinized_gingiva: 313   NA's :1013     
##  BCM      : 696   Supragingival_plaque        : 313                  
##  BCM,BI   : 103   Buccal_mucosa               : 312                  
##  JCVI,WUGC:  93   Palatine_Tonsils            : 312                  
##  (Other)  : 225   (Other)                     :2858                  
##  Contaminated   
##  Mode :logical  
##  FALSE:3730     
##  NA's :1013     
##                 
##                 
##                 
##                 
##                                                                     Description  
##  HMP_Human_metagenome_sample_700013549_from_subject_158013734__sex_female_:   1  
##  HMP_Human_metagenome_sample_700014386_from_subject_158398106__sex_male_  :   1  
##  HMP_Human_metagenome_sample_700014403_from_subject_158398106__sex_male_  :   1  
##  HMP_Human_metagenome_sample_700014409_from_subject_158398106__sex_male_  :   1  
##  HMP_Human_metagenome_sample_700014412_from_subject_158398106__sex_male_  :   1  
##  HMP_Human_metagenome_sample_700014415_from_subject_158398106__sex_male_  :   1  
##  (Other)                                                                  :4737

For each sample, we know the individual (RSID: random subject ID) that it corresponds to, which part of the body (HMPbodysubsite) it came from, which visit it was collected during (visitno), and the center (RUNCENTER) where the sequencing data was acquired. We also know whether the individual is a male or female (sex).

n_distinct(ps_meta$RSID)

## [1] 235

The samples originate from 235 subjects, so each individual was sampled from multiple body sites over multiple visits and samples were sequenced at various centers.

Data quality check

Sequencing depth (number of reads per sample) is a major confounder in 16S profiling data analysis. Let’s inspect our sequencing depths. First, we need to add this variable to our data frame. Then we can look at the distribution of sequencing depths by sample type (fill="HMPbodysubsite") or sequencing center (fill="RUNCENTER").

ps_meta$Sequencing_depth <- sample_sums(HMPv35) # sample_sums are sequencing depths

ggdensity(ps_meta,
          x = "Sequencing_depth",
          xlab = "Sequencing depth",
          y = "..count..",
          ylab = "Number of samples (%)",
          fill = "HMPbodysubsite",
          alpha = 0.4) +
  theme(legend.key.size = unit(3, "mm"), # resize legend so it fits in the figure
        legend.text = element_text(size = 6), legend.title = element_text(size = 7))

From this plot, we see that most of our samples have less than 10,000 reads but a few samples have 25,000-90,000 reads. It is also evident that two sample types (Left and Right Antecubital fossa) have much lower sequencing depths than the rest. Let’s change fill="HMPbodysubsite" to fill="RUNCENTER".

ggdensity(ps_meta,
          x = "Sequencing_depth",
          xlab = "Sequencing depth",
          y = "..count..",
          ylab = "Number of Samples (%)",
          fill = "RUNCENTER",
          alpha = 0.4)

From this plot, we see that most samples were sequenced at 4 centers and that sequencing depth can vary between the centers.

Subsetting Data

For the purposes of this tutorial, we will subset the data so that we are working with a smaller dataset. We will keep only saliva and stool samples sequenced at the top 4 centers and get rid of any samples with a low sequencing depth. We will use a threshold of 2,000 reads here.

to_keep <- table(ps_meta$RUNCENTER) %>% # count samples from each center
  sort(decreasing = T) # order them from most to least samples

to_keep <- to_keep[1:4] %>% # keep top 4
  names() # extract center names

ps_obj <- HMPv35 %>%
  subset_samples(HMPbodysubsite %in% c("Saliva", "Stool") & # keep only saliva/stool samples
                   sample_sums(HMPv35) > 2000 & # keep samples with more than 2000 reads
                   RUNCENTER %in% to_keep) # keep samples from the 4 centers
ps_obj

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 45336 taxa and 480 samples ]
## sample_data() Sample Data:       [ 480 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 45336 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 45336 tips and 45099 internal nodes ]
## refseq()      DNAStringSet:      [ 45336 reference sequences ]

We can check that our code worked.

table(sample_data(ps_obj)$HMPbodysubsite) # which sample types remain?

## 
## Saliva  Stool 
##    233    247

min(sample_sums(ps_obj)) # what is the minimum sequencing depth?

## [1] 2040

table(sample_data(ps_obj)$RUNCENTER) # which centers remain?

## 
##  BCM   BI JCVI WUGC 
##   90  136  106  148

n_distinct(sample_data(ps_obj)$RSID) # how many individuals are left?

## [1] 205

We are left with 480 stool and saliva samples from 4 centers representing 205 individuals. We need to make sure there is just one sample per subject for each sample type. Repeated measure analysis is for another time!

ps_meta <- sample_data(ps_obj) %>% # extract metadata from data subset
  data.frame() %>%
  arrange(RSID, visitno) %>% # sort by subject in order of visit number
  unite(col = to_filter, c("RSID", "HMPbodysubsite"), sep = "_", remove = F) %>% # new variable "RSID_subsite"
  filter(!duplicated(to_filter)) # only keep earliest visit for each subsite

ps_obj <- ps_obj %>%
  subset_samples(sample_names(ps_obj) %in% ps_meta$X.SampleID)
ps_obj

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 45336 taxa and 363 samples ]
## sample_data() Sample Data:       [ 363 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 45336 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 45336 tips and 45099 internal nodes ]
## refseq()      DNAStringSet:      [ 45336 reference sequences ]

363 samples remain. Just a note: some subjects have “matching” samples that were taken at different visits. Depending on our research question, we might want to make sure that samples from the same individual were taken at the same time point, but we will leave this as is for the tutorial. Now we need to remove taxa that are no longer present in any of the samples.

ps_obj <- ps_obj %>%
  filter_taxa(function(x) sum(x) > 0, prune = TRUE) # keep only taxa with more than 0 total counts
ps_obj

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 28357 taxa and 363 samples ]
## sample_data() Sample Data:       [ 363 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 28357 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 28357 tips and 28174 internal nodes ]
## refseq()      DNAStringSet:      [ 28357 reference sequences ]

Normalisations

Normalisation of 16S profiling data is widely discussed in the literature. The most commonly used methods are: total sum scaling (figure below, left panel), rarefying (figure below, right panel), and log transformation. However, since this data is compositional, alternative approaches such as centered log-ratio transformation have been suggested. See Gloor et al., 2017. Additional methods are discussed by McMurdie and Holmes, 2014. There are many ways to normalise 16S profiling data, but we will keep it simple and use the traditional methods. The type of normalisation you need will also depend on the analysis you want to run.

As you can see from the figure above neither of these methods are ideal, particularly when sequencing depths vary drastically between samples. In such a case, log transformation might be helpful.

The distribution of sequencing depths is still not exactly normal, but it looks better. The following functions can be used for normalisations:

Relative abundance - transform_sample_counts(ps_obj, function(x) { x/sum(x) })
- x/sum(x) counts for each taxon are divided by total counts in the sample
Rarefy - rarefy_even_depth(ps_obj)
Log transformation - transform_sample_counts(ps_obj, function(x) { log(x+1) })
- log(x+1) log of counts for each taxon
- A pseudo count (+1) is added to avoid log(0)

Alpha-diversity

Let’s look at the diversity within samples and compare this between the different body sites. First, we need to normalise the data because alpha-diversity usually increases with sequencing depth. Samples with many more reads may look like they have a higher number of different bacteria present than samples with a low sequencing depth. We can illustrate this with a rarefaction curve.

abund <- ps_obj %>%
  subset_samples(HMPbodysubsite == "Stool") %>% # only plot stool samples
  otu_table() %>% # extract abundance table
  as("matrix") %>% # convert from phyloseq format
  t() # transpose - rarecurve() wants samples in rows

rarecurve(abund, 
          step = 1000,
          label = F)

This plot shows that as the sequencing depth (Sample Size) gets higher, more species are observed. One way to overcome this is to normalise by subsampling to even sequencing depths. This is sort of like putting all of your counts for a sample in a bag and randomly pulling out a set number of counts so that all samples end up having the same total (refer to the “Rarefying | Subsampling” figure above). This approach is widely criticised because it can lead to throwing out huge amounts of data, but it remains one of the best (and most simple) ways to normalise before estimating alpha-diversity since some metrics rely on actual counts (integers).

Rarefy the data

ps_rar <- rarefy_even_depth(ps_obj, 
                            sample.size = min(sample_sums(ps_obj)), 
                            rngseed = 9375, # set this so that subsampling is reproducible
                            replace = F, 
                            trimOTUs = T)

## `set.seed(9375)` was used to initialize repeatable random subsampling.

## Please record this for your records so others can reproduce.

## Try `set.seed(9375); .Random.seed` for the full vector

## ...

## 3432OTUs were removed because they are no longer 
## present in any sample after random subsampling

## ...

Calculate alpha-diversities

alpha_div <- ps_rar %>%
  estimate_richness(split = T, 
                    measures = c("Observed", "Shannon", "Simpson", "Fisher")) %>%
  mutate(X.SampleID = as.numeric(sample_names(ps_rar))) # add sample IDs
  
alpha_div <- right_join(ps_meta, # add metadata
                        alpha_div, 
                        by = "X.SampleID")

Plot the results

alpha_div_long <- melt(alpha_div, 
                       measure.vars = c("Observed", "Shannon", "Simpson", "Fisher"))

ggboxplot(alpha_div_long,
          x = "variable",
          y = "value", 
          xlab = FALSE,
          ylab = FALSE,
          color = "HMPbodysubsite", 
          palette = color_pal,
          facet.by = "variable",
          scales = "free") +
  rremove("x.text") +
  rremove("x.ticks")

The trend is similar among the different metrics, so let’s choose one and plot with stats. We first need to format the data to do a paired test since saliva and stool samples can come from the same subjects.

df <- alpha_div_long %>%
  filter(variable == "Shannon") # consider only shannon index

to_keep <- df$RSID[duplicated(df$RSID)] # keep only subjects with matching samples
df <- subset(df, df$RSID %in% to_keep)

ggpaired(df,
         x = "HMPbodysubsite",
         xlab = FALSE,
         y = "value",
         ylab = "Shannon Index",
         id = "RSID", # for pairing 
         color = "HMPbodysubsite", 
         palette = color_pal,
         line.color = "grey", 
         line.size = 0.2) +
  stat_compare_means(comparisons = list(c("Saliva", "Stool")), 
                     method = "t.test", 
                     paired = T) +
  theme(legend.position = "none")

Saliva samples are more diverse than stool samples.

Beta-diversity

Next we will look at differences in microbial community structure between samples.

For this, rarefying is not a recommended normalisation method. We will instead try total sum scaling (relative abundance) and log transformation with the commonly used Bray-Curtis dissimilarity. There are many options for both distance metric and ordination method; you should explore which are best for your data. It is also important to compare both non-phylogenetic (like Bray-Curtis) and phylogenetic (like UniFrac) metrics. Phylogenetic beta-diversity metrics consider the evolutionary relationships between taxa in addition to their abundances. The same is true for alpha-diversity. Faith’s Phylogenetic Diversity (picante::pd()) is widely used for calculating phylogenetic alpha-diversity.

Let’s define a function to do the beta-diversity and plotting so we avoid repetition in our code.

plot_bdiv <- function(ps_norm, # normalised phyloseq object
                      bdiv, # distance metric for beta-diversity
                      ordination, # ordination method
                      color_var, # variable to color samples by
                      title) {

  bdiv_mat <- phyloseq::distance(ps_norm, bdiv) # calculate beta-diversity
  ord <- ordinate(ps_norm, # ordinate beta-diversity
                  method = ordination, 
                  distance = bdiv_mat) 
  p <- plot_ordination(ps_norm, # plot
                       ord, 
                       color = color_var, 
                       title = title) + 
    scale_color_manual(values = color_pal)
  return(p)
  
}

Let’s see how different normalisations affect Bray-Curtis dissimilarities.

ps_rel_abund <- ps_obj %>%
  transform_sample_counts(function(x) { x/sum(x) })
ps_log <- ps_obj %>%
  transform_sample_counts(function(x) { log(x+1) })

p_rel <- plot_bdiv(ps_rel_abund,
                   bdiv = "bray",
                   ordination = "PCoA",
                   color_var = "HMPbodysubsite",
                   title = "Relative Abundance")
p_log <- plot_bdiv(ps_log,
                   bdiv = "bray",
                   ordination = "PCoA",
                   color_var = "HMPbodysubsite",
                   title = "Log transformed")

ggarrange(p_rel, p_log, common.legend = T)

It looks like there is a lot of variation within the saliva samples. Let’s see if this can be attributed to the different sequencing centers. We just need to change the color of the points.

p_rel <- plot_bdiv(ps_rel_abund,
                   bdiv = "bray",
                   ordination = "PCoA",
                   color_var = "RUNCENTER",
                   title = "Relative Abundance")
p_log <- plot_bdiv(ps_log,
                   bdiv = "bray",
                   ordination = "PCoA",
                   color_var = "RUNCENTER",
                   title = "Log transformed")

ggarrange(p_rel, p_log, common.legend = T)

Saliva samples from “JCVI” seem to have distinct bacterial profiles from those sequenced at the other centers. Let’s take a closer look at the saliva samples.

ps_saliva <- ps_obj %>%
  subset_samples(HMPbodysubsite == "Saliva")

ps_saliva_log <- ps_saliva %>%
  transform_sample_counts(function(x) { log(x+1) })

We can also check another ordination method.

p_pcoa <- plot_bdiv(ps_saliva_log,
                   bdiv = "bray",
                   ordination = "PCoA",
                   color_var = "RUNCENTER",
                   title = "PCoA")
p_nmds <- plot_bdiv(ps_saliva_log,
                   bdiv = "bray",
                   ordination = "NMDS",
                   color_var = "RUNCENTER",
                   title = "NMDS")

ggarrange(p_pcoa, p_nmds, common.legend = T)

We can then test the significance of this difference in beta-diversity.

bdiv_mat <- phyloseq::distance(ps_saliva_log, "bray")

adonis(bdiv_mat ~ RUNCENTER, data.frame(sample_data(ps_saliva_log)))

## 
## Call:
## adonis(formula = bdiv_mat ~ RUNCENTER, data = data.frame(sample_data(ps_saliva_log))) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##            Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## RUNCENTER   3     5.465 1.82174  7.0542 0.10788  0.001 ***
## Residuals 175    45.193 0.25825         0.89212           
## Total     178    50.659                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P = 0.001 but we also need to check that the dispersions are similar between the groups. This should be non-significant for us to have confidence in the adonis() results.

disp <- betadisper(bdiv_mat, sample_data(ps_saliva_log)$RUNCENTER)
permutest(disp)

## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##            Df   Sum Sq   Mean Sq      F N.Perm Pr(>F)    
## Groups      3 0.035383 0.0117944 8.2937    999  0.001 ***
## Residuals 175 0.248865 0.0014221                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The dispersions are significantly different between groups which can have a large effect on the adonis() results. So let’s do univariate tests looking at individual taxa to confirm why samples from “JCVI” differ from the rest.

Differential abundance

First, we need to find out which taxa differ between our groups of interest. Let’s look at high-level differences by combining counts at the family level. We can also remove low abundance taxa (filtering criteria should be based on your data) because these probably won’t differ significantly between the groups. Finally, we will log transform the data and extract normalised counts as a data frame.

fam_abund <- ps_saliva %>%
  tax_glom("Family") %>% # combine counts at family level
  # only keep families with at least 5 counts in 10% of samples
  filter_taxa(function(x) { sum(x >= 5) > 0.1*nsamples(ps_saliva) }, prune = T) %>% 
  transform_sample_counts(function(x) { log(x+1)}) %>%
  psmelt() # convert phyloseq object to data frame

The function ps_melt() takes the metadata, abundance table, and taxonomy from the phyloseq object and combines them into long format. This means that for every taxon, there are as many rows as samples so all abundance values are in a single column rather than a matrix. The counts are in the column Abundance and we are interested in the taxonomic rank Family. Let’s plot the abundances.

ggviolin(fam_abund, 
         x = "RUNCENTER", 
         y = "Abundance",
         ylab = "Log(Abundance)",
         facet.by = "Family", 
         fill = "RUNCENTER", 
         palette = color_pal, 
         panel.labs.font = list(size = 6)) +
  theme(legend.position = "none", 
        axis.text = element_text(size = 5.5))

One obvious difference in the “JCVI” samples is lower abundance of Campylobacteraceae. Let’s test this hypothesis.

df <- fam_abund %>%
  filter(Family == "Campylobacteraceae")

ggviolin(df, 
         x = "RUNCENTER", 
         y = "Abundance", 
         ylab = "Log(Abundance)", 
         fill = "RUNCENTER", 
         palette = color_pal) +
  theme(legend.position = "none") +
  stat_compare_means(comparisons = list(c("JCVI", "BCM"),
                                        c("JCVI", "BI"),
                                        c("JCVI", "WUGC")), 
                     method = "wilcox.test") # non-parametric test!

DESeq2

Another way to calculate differential abundance is using the package “DESeq2”. The method was developed for differential analysis of RNA-seq data (which is also count data) but has been applied to 16S profiling data too. DESeq2 has its own normalisation method, so we need to provide raw counts.

We also need to add a new metadata variable since we only want to compare “JCVI” samples with “non-JCVI” samples.

sample_data(ps_saliva)$JCVI <- ifelse(sample_data(ps_saliva)$RUNCENTER == "JCVI", "JCVI", "other")
sample_data(ps_saliva)$JCVI <- factor(sample_data(ps_saliva)$JCVI, levels = c("other", "JCVI")) # other = ref

Next, we want to apply a prevalence filter and run DESeq2.

ds <- ps_saliva %>%
  # only keep taxa present in 10% of samples
  filter_taxa(function(x) { sum(x > 0) > 0.10*nsamples(ps_saliva) }, prune = T) %>% 
  phyloseq_to_deseq2(~ JCVI) %>% # convert to deseq object
  DESeq(sfType = "poscounts") # run deseq

## converting counts to integer mode

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## -- replacing outliers and refitting for 139 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)

## estimating dispersions

## fitting model and testing

We can extract the differential abundance results for plotting and add taxonomic assignments since the OTU names are not very informative.

ds_res <- results(ds) %>%
  data.frame() %>%
  mutate(OTU = row.names(.)) # add otu names

tax <- ps_saliva %>%
  tax_table() %>%
  as("matrix") %>%
  data.frame() %>%
  mutate(OTU = row.names(.))

ds_res <- left_join(ds_res, tax, by = "OTU")

The output is a data frame with one row per taxon and all the stats for that taxon. Let’s plot the top significant results.

ds_res <- ds_res %>% 
  mutate(Significance = ifelse(padj < 0.05, "SIG", "NS"), # label significant taxa
         SEmin = log2FoldChange-lfcSE, # add values for error bars
         SEmax = log2FoldChange+lfcSE,
         Genus = as.character(Genus), # remove factor
         Genus = replace_na(Genus, " Unclassified")) # change NA to unclassified

to_plot <- ds_res %>%
  arrange(padj) # lowest p values on top

to_plot <- to_plot[1:25,] %>% # keep top 25
  arrange(desc(log2FoldChange)) %>% # order by effect size
  mutate(OTU = factor(OTU, levels = OTU)) # factor so that order stays same in plot

ggplot(to_plot,
       aes(y = log2FoldChange, 
           x = OTU, 
           color = Genus)) +
  geom_errorbar(aes(ymax = SEmax, 
                    ymin = SEmin), 
                color = "black", 
                width = 0.3) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0, 
             linetype = 2, 
             alpha = 0.7) +
  coord_flip() +
  scale_color_manual(values = c("grey", "black", color_pal)) + # make unclassified grey and Campylobacter black
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

We can see that differential abundance at the OTU level also shows significant reductions in Campylobacter spp. (Campylobacteraceae family). Now let’s go back to the counts and look at the distributions between the centers.

to_keep <- to_plot$OTU[to_plot$Genus == "Campylobacter"] %>% as.character() # otu names for Campylobacter spp.

df <- counts(ds, normalized = TRUE) # get normalised counts from deseq object
df <- df[row.names(df) %in% to_keep,]
df <- melt(df) %>%
  left_join(data.frame(sample_data(ps_saliva)), by = c("Var2" = "X.SampleID")) # add metadata

ggplot(df, 
       aes(JCVI, value)) +
  geom_jitter(aes(color = JCVI), size = 1) +
  scale_color_manual(values = color_pal) + 
  theme(legend.position = "none",
        axis.title.x = element_blank()) +
  ylab("Normalised counts") +
  facet_wrap(~Var1)

Group discussion: What are potential explanations for what we are seeing?

Session info

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DESeq2_1.22.2               SummarizedExperiment_1.12.0
##  [3] DelayedArray_0.8.0          BiocParallel_1.16.6        
##  [5] matrixStats_0.54.0          Biobase_2.42.0             
##  [7] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
##  [9] IRanges_2.16.0              S4Vectors_0.20.1           
## [11] BiocGenerics_0.28.0         reshape2_1.4.3             
## [13] vegan_2.5-4                 lattice_0.20-38            
## [15] permute_0.9-5               ggpubr_0.2                 
## [17] magrittr_1.5                phyloseq_1.26.1            
## [19] forcats_0.4.0               stringr_1.4.0              
## [21] dplyr_0.8.0.1               purrr_0.3.2                
## [23] readr_1.3.1                 tidyr_0.8.3                
## [25] tibble_2.1.1                ggplot2_3.1.1              
## [27] tidyverse_1.2.1            
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1       ggsignif_0.5.0         htmlTable_1.13.1      
##  [4] XVector_0.22.0         base64enc_0.1-3        rstudioapi_0.10       
##  [7] bit64_0.9-7            AnnotationDbi_1.44.0   lubridate_1.7.4       
## [10] xml2_1.2.0             codetools_0.2-16       splines_3.5.0         
## [13] geneplotter_1.60.0     knitr_1.22             ade4_1.7-13           
## [16] Formula_1.2-3          jsonlite_1.6           broom_0.5.2           
## [19] annotate_1.60.1        cluster_2.0.8          compiler_3.5.0        
## [22] httr_1.4.0             backports_1.1.3        assertthat_0.2.1      
## [25] Matrix_1.2-17          lazyeval_0.2.2         cli_1.1.0             
## [28] acepack_1.4.1          htmltools_0.3.6        tools_3.5.0           
## [31] igraph_1.2.4           gtable_0.3.0           glue_1.3.1            
## [34] GenomeInfoDbData_1.2.0 Rcpp_1.0.1             cellranger_1.1.0      
## [37] Biostrings_2.50.2      multtest_2.38.0        ape_5.3               
## [40] nlme_3.1-138           iterators_1.0.10       xfun_0.6              
## [43] rvest_0.3.2            XML_3.98-1.19          zlibbioc_1.28.0       
## [46] MASS_7.3-51.4          scales_1.0.0           hms_0.4.2             
## [49] biomformat_1.10.1      rhdf5_2.26.2           RColorBrewer_1.1-2    
## [52] yaml_2.2.0             memoise_1.1.0          gridExtra_2.3         
## [55] rpart_4.1-13           RSQLite_2.1.1          latticeExtra_0.6-28   
## [58] stringi_1.4.3          genefilter_1.64.0      foreach_1.4.4         
## [61] checkmate_1.9.1        rlang_0.3.4            pkgconfig_2.0.2       
## [64] bitops_1.0-6           evaluate_0.13          Rhdf5lib_1.4.3        
## [67] labeling_0.3           htmlwidgets_1.3        cowplot_0.9.4         
## [70] bit_1.1-14             tidyselect_0.2.5       plyr_1.8.4            
## [73] R6_2.4.0               generics_0.0.2         Hmisc_4.2-0           
## [76] DBI_1.0.0              pillar_1.3.1           haven_2.1.0           
## [79] foreign_0.8-71         withr_2.1.2            mgcv_1.8-28           
## [82] survival_2.44-1.1      RCurl_1.95-4.12        nnet_7.3-12           
## [85] modelr_0.1.4           crayon_1.3.4           rmarkdown_1.12        
## [88] locfit_1.5-9.1         grid_3.5.0             readxl_1.3.1          
## [91] data.table_1.12.2      blob_1.1.1             digest_0.6.18         
## [94] xtable_1.8-3           munsell_0.5.0

References

Callahan, B. J., McMurdie, P. J., & Holmes, S. P. (2017). Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The Isme Journal, 11, 2639. doi:10.1038/ismej.2017.119
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome Datasets Are Compositional: And This Is Not Optional. Frontiers in Microbiology, 8, 2224. doi:10.3389/fmicb.2017.02224
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. doi:10.1186/s13059-014-0550-8
McMurdie, P. J., & Holmes, S. (2013). phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE, 8(4), e61217. doi:10.1371/journal.pone.0061217
McMurdie, P. J., & Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Computational Biology, 10(4), e1003531-e1003531. doi:10.1371/journal.pcbi.1003531
The Human Microbiome Project Consortium. (2012). Structure, function and diversity of the healthy human microbiome. Nature, 486(7402), 207-214. doi:10.1038/nature11234