FTICR-MS data analysis with ftmsRanalysis

Background

Fourier-transform ion cyclotron resonance mass spectrometry (FTICR-MS) is an exciting technique that enables scientists to acquire detailed information of soil organic matter. Processing FTICR-MS data, however, is not straightforward. Luckily, folks at the Environmental Molecular Sciences Laboratory (EMSL) have developed a R package ftmsRanalysis that makes FTICR-MS data processing and visualization more feasible than ever before.

A tutorial of the ftmsRanalysis package offers very valuable information on how to get started. Several vignettes on data visualization can also be found on here. The package also enables other cool analyses that weren’t discussed by these vignettes, such as ordination and group comparison.

In this blog, I will show how to perform a non-metric multidimensional scaling (NMDS) and plot the results. I will also show how to compare the nominal oxidation state of carbon (NOSC) between treatments.

Most of the scripts in this demo were provided to me by the amazing Allison Thompson at EMSL with some modifications.

Installation

The package may be installed from github using the command:

devtools::install_github("EMSL-Computing/ftmsRanalysis")

Getting the data

For this demo, I will use the example dataset included with the ftmsRanalysis package. This dataset compares the composition of water-extractable soil organic matter between two locations (M and W) and two crop flora (C and S).

There are three data tables that are essential to the analysis:

  1. Expression Data, the values here correspond to the peak intensity measured by the MS. Columns represent individual samples (i.e., extractant). Rows indicate the chemical compounds (or peaks) detected by the MS. This example dataset includes 20 samples, and over 24,400 types of peaks.

  2. Sample Data include the meta-data characterizing the experimental design. For example, there are columns indicating the location (M and W), crop flora (C and S), and experimental blocks (1-5).

  3. Molecular Identification Data include stoichiometry information of the detected compounds. Columns correspond to the elements (e.g., carbon, hydrogen, nitrogen, etc.). In some cases, these data columns may be stored together with Expression Data.

e_data (Expression Data)

library(ftmsRanalysis)
## 
## Attaching package: 'ftmsRanalysis'
## The following object is masked from 'package:stats':
## 
##     heatmap
data("ftms12T_edata")
str(ftms12T_edata)
## 'data.frame':    24442 obs. of  21 variables:
##  $ Mass         : num  98.5 98.8 98.8 101.7 103.3 ...
##  $ EM0011_sample: num  0 0 5524739 0 0 ...
##  $ EM0013_sample: num  0 13070372 0 0 0 ...
##  $ EM0015_sample: num  0.0 0.0 2.4e+07 0.0 0.0 ...
##  $ EM0017_sample: num  0 16120890 0 0 0 ...
##  $ EM0019_sample: num  0 21228496 0 0 0 ...
##  $ EM0061_sample: num  1197974 0 30656158 0 0 ...
##  $ EM0063_sample: num  0 12305626 0 0 0 ...
##  $ EM0065_sample: num  0.0 1.1e+07 0.0 0.0 0.0 ...
##  $ EM0067_sample: num  0 0 12664590 0 0 ...
##  $ EM0069_sample: num  2535836 38329628 0 0 0 ...
##  $ EW0111_sample: num  0 0 21416774 0 0 ...
##  $ EW0113_sample: num  0 8070915 0 0 0 ...
##  $ EW0115_sample: num  3636046 0 38608164 0 0 ...
##  $ EW0117_sample: num  0 3965230 0 0 0 ...
##  $ EW0119_sample: num  0 0 2439325 0 1153547 ...
##  $ EW0161_sample: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EW0163_sample: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EW0165_sample: num  0 0 0 16443347 0 ...
##  $ EW0167_sample: num  0 1598119 0 0 0 ...
##  $ EW0169_sample: num  0 0 0 0 0 0 0 0 0 0 ...

f_data (Sample Data)

data("ftms12T_fdata")
str(ftms12T_fdata)
## 'data.frame':    20 obs. of  4 variables:
##  $ SampleID  : Factor w/ 20 levels "EM0011_sample",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Location  : Factor w/ 2 levels "M","W": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Block     : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ Crop.Flora: Factor w/ 2 levels "C","S": 2 2 2 2 2 1 1 1 1 1 ...

e_meta (Molecular Identification Data)

data("ftms12T_emeta")
str(ftms12T_emeta)
## 'data.frame':    24442 obs. of  10 variables:
##  $ Mass       : num  98.5 98.8 98.8 101.7 103.3 ...
##  $ C          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ O          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ N          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ C13        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ S          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ P          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Error      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NeutralMass: num  99.5 99.8 99.8 102.7 104.4 ...

Constructing a peakData object

Here, I merge the three data tables together to form a peakData object. I need to provide the names of the three data tables and the column names of the following: peak name in the e_data table (edata_cname), sample name in the f_data table (fdata_cname), peak name in the e_meta table (mass_cname), and names of the elemental counts (e.g., c_cname for carbon).

peakObj <- as.peakData(ftms12T_edata, ftms12T_fdata, ftms12T_emeta, 
                       edata_cname="Mass", fdata_cname="SampleID", 
                       mass_cname="Mass", c_cname="C", h_cname="H", 
                       o_cname="O", n_cname="N", s_cname="S", 
                       p_cname="P", isotopic_cname = "C13", 
                       isotopic_notation = "1")
peakObj
## peakData object
## # Peaks: 23060
## # Samples: 20
## Meta data columns: [Mass, C, H, O, N, C13, S, P, Error, NeutralMass, MolForm]

Not surprisingly, the ‘peakData’ object contains three elements, i.e., e_data, f_data, and e_meta:

names(peakObj)
## [1] "e_data" "f_data" "e_meta"

Data preprocessing

I apply three filters here. The first removes any peak outside of the range 200-900 m/z. The second removes any peak that is only seen in one sample. The third removes any peak that isn’t assigned to a formula (remaining analyses all require a formula).

# filter data to have mass between 200 and 900 #
peakObj.massFilt <- mass_filter(peakObj)
peakObj <- applyFilt(filter_object = peakObj.massFilt, peakObj, min_mass = 200, max_mass = 900)

# count the number of samples each peak was observed in #
peakObj.molFilt <- molecule_filter(peakObj)
# filter out peaks/masses observed in less than 1 samples #
peakObj <- applyFilt(filter_object = peakObj.molFilt, peakObj, min_num = 1)

# filter to only those peaks that have a formula assigned #
peakObj.peakFilt <- formula_filter(peakObj)

A simple peaks-per-sample plot can be obtained as follows.

# Plot number of peaks per sample
library(ggplot2)
numPeaksPlot(peakObj)

A sample may be considered as an outlier, if it contains a very small number of peaks, say < 100. An outlier can be removed as follows.

peakObj.1 <- peakObj
# include the names of outliers here
outlierlist <- c("EM0011_sample","EM0065_sample")
peakObj.1$e_data <- peakObj.1$e_data[,-which(colnames(peakObj.1$e_data) %in% outlierlist)]

NMDS ordination

I start this exercise by converting peak intensity to presence/absence and then run NMDS ordination using metaMDS function of the vegan package.

# transform from abundance to presence/absence
peakObj.pa <- edata_transform(peakObj, data_scale = "pres")
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
# calculate nmds values
peak.nmds <- metaMDS(t(as.matrix(peakObj.pa$e_data[,-1])), distance="jaccard", autotransform=FALSE)
## Run 0 stress 0.1291971 
## Run 1 stress 0.1751683 
## Run 2 stress 0.1478652 
## Run 3 stress 0.1478664 
## Run 4 stress 0.1862746 
## Run 5 stress 0.1478644 
## Run 6 stress 0.1478643 
## Run 7 stress 0.1478642 
## Run 8 stress 0.1478664 
## Run 9 stress 0.1291969 
## ... New best solution
## ... Procrustes: rmse 0.0001130679  max resid 0.0002568671 
## ... Similar to previous best
## Run 10 stress 0.1291969 
## ... Procrustes: rmse 0.0002186617  max resid 0.0006155436 
## ... Similar to previous best
## Run 11 stress 0.1291975 
## ... Procrustes: rmse 0.0005765247  max resid 0.001554315 
## ... Similar to previous best
## Run 12 stress 0.1969809 
## Run 13 stress 0.1737838 
## Run 14 stress 0.1291969 
## ... New best solution
## ... Procrustes: rmse 6.894749e-05  max resid 0.0001873194 
## ... Similar to previous best
## Run 15 stress 0.1780075 
## Run 16 stress 0.1751535 
## Run 17 stress 0.1291981 
## ... Procrustes: rmse 0.0005657277  max resid 0.001349988 
## ... Similar to previous best
## Run 18 stress 0.129197 
## ... Procrustes: rmse 0.0001690448  max resid 0.0004141371 
## ... Similar to previous best
## Run 19 stress 0.1751531 
## Run 20 stress 0.1478649 
## *** Solution reached

I create a new column in the f_data table called ‘treatment’ by merging the two treatments: location and crop flora. It will help us examine their interactive effects.

peakObj$f_data$treatment <- paste(peakObj$f_data$Location,peakObj$f_data$Crop.Flora)
peakObj.pa$f_data$treatment <- paste(peakObj.pa$f_data$Location,peakObj.pa$f_data$Crop.Flora)

NMDS scores will extracted and then merged with f_data for plotting.

#format results
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
peak_plot <- data.frame(NMD1=peak.nmds$points[,1], NMD2=peak.nmds$points[,2], Water=names(peakObj.pa$e_data)[-1])
peak_plot <- merge(peak_plot, peakObj.pa$f_data, by.x="Water", by.y="SampleID")
peak_p <- ggplot(peak_plot, aes(x=NMD1, y=NMD2, colour=treatment, label=Water))+
  geom_point(size=4)+
  scale_color_manual(values=c("#e0115f","#fabbd3","#5a4fcf","#dcdaf5"))+
  theme_bw()+
  theme(panel.grid = element_blank(),
        plot.background = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(colour="black"))+
  labs(x = "NMD1", y = "NMD2", title = "NMDS")
plotly::ggplotly(peak_p)

NOSC comparison

For each peak, its NOSC value needs to be calculated first. Other calculations can be done simultaneously, including aromaticity, double bond equivalent, gibbs free energy, Kendrick mass and Kendrick mass defect, and elemental ratios.

peakObj <- compound_calcs(peakObj)
# a table showing the quantiles of the indices 
apply(peakObj$e_meta[,-c(1:11)], 2, function(x) quantile(x, na.rm=TRUE))
##             AI    AI_Mod  DBE DBE_O DBE_AI       GFE    kmass      kdefect
## 0%   0.0000000 0.0000000  0.5   -23  -24.0 -47.36667 199.8228 4.556335e-05
## 25%  0.0000000 0.0000000  5.0    -4   -5.0  48.42500 353.6150 2.802880e-01
## 50%  0.0000000 0.2336879  9.0     0   -1.5  60.30000 458.5230 3.799112e-01
## 75%  0.2222222 0.5000000 13.0     4    2.0  76.58571 584.5270 5.035388e-01
## 100% 2.0000000 1.5000000 38.0    35   33.0 115.51875 898.3329 9.999253e-01
##            NOSC OtoC_ratio HtoC_ratio NtoC_ratio PtoC_ratio NtoP_ratio
## 0%   -1.9375000 0.01612903  0.2000000 0.00000000  0.0000000          0
## 25%  -0.5714286 0.33333333  0.8461538 0.00000000  0.0000000          1
## 50%   0.0000000 0.50000000  1.2000000 0.04347826  0.0000000          1
## 75%   0.4166667 0.64705882  1.6000000 0.09090909  0.0000000          2
## 100%  3.7777778 1.20000000  2.7142857 0.60000000  0.2857143          3

For now, NOSC values are stored in the e_meta table. A new data table called nosc is created by merging e_meta with f_data. I am also interested in extracting the compound classes of peaks. I am using the class boundary definitions as in Kim et al. (2003).

# extract chemical classes
cats<- assign_class(peakObj, "bs1")$e_meta$bs1_class
cats <- gsub(";.*","",cats)

# format data
library(reshape2)
nosc <- merge(melt(data.frame(CLASS=cats,peakObj$e_data)), peakObj$e_meta, by="Mass")
## Using CLASS, Mass as id variables
nosc <- merge(nosc, peakObj$f_data, by.x="variable", by.y="SampleID")
# remove those peaks that haven't been assigned a molecular formula 
nosc <- subset(nosc, !is.na(MolForm) & value != 0)

Then, mean NOSC value is calculated for each sample and plotted.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
nosc.mean <- nosc %>% group_by(variable, Location, Crop.Flora, treatment) %>% summarise(Mean=mean(NOSC, na.rm=TRUE),SE=sd(NOSC)/sqrt(length(NOSC)),SD=sd(NOSC))

nosc_p <- ggplot(data = nosc.mean, aes(x=treatment, y=Mean, colour=treatment)) +
  geom_boxplot(outlier.size = 1)+
  scale_color_manual(values=c("#e0115f","#fabbd3","#5a4fcf","#dcdaf5"))+
  labs(title = '', y = 'Mean NOSC')+   
  theme_bw()
plotly::ggplotly(nosc_p)

Analysis of variance (ANOVA) can be used to evaluate treatment effects on mean NOSC. Results suggest that Crop Flora had a significant effect on NOSC, while Location did not.

lm <- lm(Mean ~ Location*Crop.Flora, data=nosc.mean)
summary(lm)
## 
## Call:
## lm(formula = Mean ~ Location * Crop.Flora, data = nosc.mean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.136152 -0.036811 -0.001419  0.054024  0.080682 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            0.04272    0.02974   1.437   0.1701  
## LocationW              0.04675    0.04206   1.112   0.2827  
## Crop.FloraS           -0.10815    0.04206  -2.571   0.0205 *
## LocationW:Crop.FloraS  0.01929    0.05948   0.324   0.7499  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0665 on 16 degrees of freedom
## Multiple R-squared:  0.4784, Adjusted R-squared:  0.3805 
## F-statistic: 4.891 on 3 and 16 DF,  p-value: 0.01339

Mean NOSC values can also be compared per compound class. The effects of Crop Flora are apparent in the classes of condensed hydrocarbon, lignin, and tannin.

nosc.class.mean <- nosc %>% group_by(variable, CLASS, Location, Crop.Flora, treatment) %>% summarise(Mean=mean(NOSC, na.rm=TRUE),SE=sd(NOSC)/sqrt(length(NOSC)),SD=sd(NOSC))

nosc_class_p <- ggplot(data = nosc.class.mean, aes(x=treatment, y=Mean, colour=treatment)) +
  geom_boxplot(outlier.size = 1)+
  scale_color_manual(values=c("#e0115f","#fabbd3","#5a4fcf","#dcdaf5"))+
  labs(title = '', y = 'Mean NOSC')+  
  facet_wrap(CLASS~., scales="free_y")+
  theme_bw()
plotly::ggplotly(nosc_class_p)
Avatar
Yang Lin
Assistant Professor

My research interests include soil biogeochemistry and ecosystem ecology.