Documentation

Version 1.3 / 04.05.2018

Overview

Genocrunch is a web-based data analysis platform dedicated to metagenomics and metataxonomics. It is tailored to process datasets derived from high-throughput nucleic acids sequencing of microbial communities Genocrunch pipeline 1 , such as gene counts or counts of operational taxonomic units (OTUs) Genocrunch pipeline 2 . In addition to such primary dataset, it also allows the integration of a secondary dataset (e.g. metabolites levels) Genocrunch pipeline 3 .
Genocrunch provides tools covering data pre-processing and transformation, diversity analysis, multivariate statistics, dimensionality reduction, differential analysis, clustering as well as similarity network analysis and associative analysis. The results of clustering are automatically inferred as additional models into other analyses Genocrunch pipeline 4 .
Interactive visualization is offered for all figures.

New users and registration

Running an analysis session does not require any registration, however, managing multiple analysis (recovering/editing previous work) is done through a personnal index that requires users to sign in. Before signing in, new users need to register.

Registering

Registering is done via the Register button of the top-bar. The registration process requires new users to chose a password. The email address is only used to recover forgotten passwords.

Signing in

Signing in is done from the welcome page, which is accessible via the Sign In button of the top-bar. Before signing in, new users need to complete the registration process.

Trying Genocrunch

The best way to try Genocrunch is to load an example and edit it.

Examples

Examples of analyses are provided via the Examples link of the top-bar. Loading Example data into a new analysis session is possible via the Load button. This allows to visualize and edit Examples without restriction. For signed-in users, it will also copy the example into their personal analysis index.

Analyzing data

Running a new analysis

New analyses can be submitted via the New Analysis button of the topbar. If the user is signed out, the analysis session will be associated to the browser session and remain available as long as the browser is not closed. If the user had been signed-in, the analysis will be stored in his personal index. Analyses parameters are set by filling a single form comprising four parts for Inputs, Pre-processing, Transformation and Analysis. Analyses are then submitted via the Start Analysis button located at the bottom the form. See the Inputs and Tools section below for details.

Editing an analysis

Analyses can be edited via the associated Edit Analysis button present on the analysis page and via the edit icons () of the analysis index (for signed-in users only). Editing is performed by updating the analysis form and submiting the modifications via the Restart Analysis button located at the bottom of the form.

Cloning an analysis

Analyses can be copied via the copy icons () of the analysis index (for signed-in users only).

Deleting an analysis

Analyses can be deleted via the delete icons () of the analysis index (for signed-in users only). This process is irreversible.

Inputs and Tools

The analyses are set from a single form composed of four sections: Inputs, Pre-processing, Transformation, Analysis. These sections and their associated parameters are described below.

Inputs

The Inputs section of the analysis creation form allows to upload data files.

General information
- Name (mandatory)
  The name will appear in the analysis index. It should be informative enough to differentiate from other analyses.
- Description
  The description can be used to provide additional information about the analysis.
Data files
- Primary dataset (mandatory)
  The primary dataset is the principal data file. It must contain a data table in the tab-delimited text format with columns representing samples and rows representing observations. The first row must contain the names of samples. The first column must contain the names of the observations. Both columns and rows names must be unique. A column containing a description of each observation, in the form of semi-column-delimited categories, can be added at the end. This format can be obtained from the BIOM format using the biom_convert command.
  
  Format example:
```
#OTU ID	Spl1	Spl2	Spl3	Spl4	taxonomy
1602	1	2	4	4	k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Propionibacteriaceae; g__Propionibacterium; s__acnes
1603	0	0	24	0	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Clostridiaceae; g__Clostridium; s__difficile
1604	0	12	23	0	k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Enterococcaceae; g__Enterococcus; s__casseliflavus
1605	23	4	2	14	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae; g__Veillonella; s__parvula
1606	45	10	42	12	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__fragilis
1607	8	15	20	13	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella; s__copri
```
- Category column (mandatory)
  To increase compatibility, the category column specifies which column in the pimary dataset contains a categorical description of the observations. This must be either the first or the last column.
- Map (mandatory)
  The map contains information about the experimental design. It must contain a table in the tab-delimited text format with columns representing experimental variables and rows representing samples. The first column must contain the names of the samples as they appear in the primary dataset. The first row must contain the names of experimental variables. Both columns and rows names must be unique. This format is compatible with the mapping file used by the Qiime pipeline.
  
  Format example:
```
ID	Sex	Treatment
Spl1	M	Treated
Spl2	F	Treated
Spl3	M	Control
Spl4	F	Control
```
- Secondary dataset
  The secondary dataset is an optional data file containing additional observations. It must contain a data table in the tab-delimited text format with columns representing samples and rows representing observations. The first row must contain the names of samples as they appear in the primary dataset and the map. The first column must contain the names of the observations. Both columns and rows names must be unique.
  
  Format example:
```
Metabolites	Spl1	Spl2	Spl3	Spl4
metabolite1	0.24	0.41	1.02	0.92
metabolite2	0.98	0.82	1.12	0.99
metabolite3	0.05	0.11	0.03	0.02
```

Pre-processing

The Pre-processing section of the analysis creation form specifies modifications that will be applied to the primary dataset prior to analysis.

Filtering

Data in the primary dataset can be filtered based on Relative or Absolute abundance and presence.

Abundance threshold
Minimal abundance per sample to be retained.
Presence thresholds
Minimal presence among samples to be retained.

Example: Filtering data with an absolute abundance threshold of 5 and and an absolute presence threshold of 2.

#Before filtering
#OTU ID	Spl1	Spl2	Spl3	Spl4	taxonomy
1602	1	2	4	4	k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Propionibacteriaceae; g__Propionibacterium; s__acnes
1603	0	0	24	0	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Clostridiaceae; g__Clostridium; s__difficile
1604	0	12	23	0	k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Enterococcaceae; g__Enterococcus; s__casseliflavus
1605	23	4	2	14	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae; g__Veillonella; s__parvula
1606	45	10	42	12	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__fragilis
1607	8	15	20	13	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella; s__copri

#After filtering
#OTU ID	Spl1	Spl2	Spl3	Spl4	taxonomy
1604	0	12	23	0	k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Enterococcaceae; g__Enterococcus; s__casseliflavus
1605	23	4	2	14	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae; g__Veillonella; s__parvula
1606	45	10	42	12	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__fragilis
1607	8	15	20	13	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella; s__copri

Category binning

Data in primary dataset can be binned by category thanks to the category column.

Binning levels
Category levels at which data should be binned.
Category binning function
Chose whether binning should be performed by summing or averaging values.

Example: Binning data at level 2 with sum.

#Before binning
#OTU ID	Spl1	Spl2	Spl3	Spl4	taxonomy
1604	0	12	23	0	k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Enterococcaceae; g__Enterococcus; s__casseliflavus
1605	23	4	2	14	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae; g__Veillonella; s__parvula
1606	45	10	42	12	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__fragilis
1607	8	15	20	13	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Prevotellaceae; g__Prevotella; s__copri

#After binning
	ID	Spl1	Spl2	Spl3	Spl4	taxonomy
1	23	16	25	14	k__Bacteria; p__Firmicutes
2	53	25	62	25	k__Bacteria; p__Bacteroidetes

Transformation

The Transformation section of the analysis creation form propose additional modifications for both primary and secondary datasets.

Rarefaction
Although controversial (McMurdie and Holmes, 2014), rarefaction is a popular method amongst microbial ecologists used to correct for variations in sequencing depth, inherent to high-throughput sequencing methods. It consists in random sampling (without replacement) of a fixed number of count from each sample to be compared. The same number of count being drawn out of each sample, this corrects for differences in sequencing depth while conserving the original count distribution among observations. Note that the analysis of diversity is particularly sensitive to variations in sequencing depth.
- Sampling depth
  This specifies the number of count to be drawn from each sample.
  
  Tip: The maximal sampling depth corresponds to the minimal sequencing depth. First check the sequencing depth of your data by suming the counts in each sample. Then chose a sampling depth accordingly.
- N samplings
  This specifies how many times the random drawing should be repeated. If more than 1 random sampling is done, the result is the average of all random samplings.
Transformation
Whether it is to correct for sequencing depth, to stabilize the variance or simply to improve the visualization of skewed data, a transformation step may be needed. Common methods include transforming counts into proportions (percent or counts per million) and applying a log. We are working to also propose more advanced transformations, including those developed for RNA sequencing (RNA-seq) as part of the DESeq2 (Love et al., 2014) and the Voom (Law et al., 2014) pipelines as well as methods to limit the effect of known experimental bias (or batch-effect), such as Combat.
- Transformation
  Specifies which transformation method should be applied.

Analysis

The Analysis section of the analysis creation form sets which statistics to apply and which figures to generate.

Experimental design

This sets the default statistical method to apply for the analysis. Two options are available: Basic and Advanced.

Statistics (Advanced only)
Chose a statistical method to compare groups of samples. If Basic is chosen, this defaults to ANOVA.
Model
Chose a model defining groups of samples. If Basic is chosen, a model can be picked among headers of the map. If Advanced is chosen, the model must be typed by the user in the form of an R formula. The formula must be compatible with the specified statistics. All terms of the formula must refer to column headers of the map.

Examples:

Simple design:

ID	Subject	Site
Spl1	subject1	Hand
Spl2	subject2	Hand
Spl3	subject3	Hand
Spl4	subject4	Foot
Spl5	subject5	Foot
Spl6	subject6	Foot

	Statistics	Model
Comparing sites (parametric)	T-test	Site
Comparing sites (non-parametric)	Wilcoxon rank sum test	Site

Simple paired design: The order of paired samples in the map is important!

ID	Subject	Site
Spl1	subject1	Hand
Spl2	subject2	Hand
Spl3	subject3	Hand
Spl4	subject1	Foot
Spl5	subject2	Foot
Spl6	subject3	Foot

	Statistics	Model
Comparing sites (parametric)	Paired t-test	Site
Comparing sites (non-parametric)	Wilcoxon signed rank test	Site

Multiple comparisons:

ID	Subject	Site
Spl1	subject1	Hand
Spl2	subject2	Hand
Spl3	subject3	Hand
Spl4	subject4	Foot
Spl5	subject5	Foot
Spl6	subject6	Foot
Spl7	subject7	Mouth
Spl8	subject8	Mouth
Spl9	subject9	Mouth

	Statistics	Model
Comparing sites (parametric)	ANOVA	Site
Comparing sites (non-parametric)	Kruskal-Wallis rank sum test	Site

Multiple comparisons with nesting:

ID	Subject	Site
Spl1	subject1	Hand
Spl2	subject2	Hand
Spl3	subject3	Hand
Spl4	subject1	Foot
Spl5	subject2	Foot
Spl6	subject3	Foot
Spl7	subject1	Mouth
Spl8	subject2	Mouth
Spl9	subject3	Mouth

	Statistics	Model
Comparing sites	Friedman test	Site \| Subject

Additive model:

ID	Treatment	Gender
Spl1	Treated	M
Spl2	Treated	M
Spl3	Treated	M
Spl4	Treated	F
Spl5	Treated	F
Spl6	Treated	F
Spl7	Control	M
Spl8	Control	M
Spl9	Control	M
Spl10	Control	F
Spl11	Control	F
Spl12	Control	F

	Statistics	Model
Assessing effects of treatment and gender	ANOVA	Treatment+Gender

Proportions
For each sample, display the proportions of each observation (as percent) in the form of a stacked barchart. This gives an overview of the primary dataset. This analysis is performed before applying any selected transformation.
Diversity
Perform a diversity analysis of the primary dataset. This will display rarefaction curves for the selected diversity metric(s). Rarefaction curves are used to estimate the diversity in function of the sampling depth. A rarefaction curve showing an asymptotic behavior is considered as an indicator of sufficient sampling depth to observe the sample diversity. This analysis is performed before applying any selected transformation.
- metric
  Chose a diversity metric. Choices include the richness as well as metrics included in the R vegan, ineq and fossil packages. The richness simply represents the number of different observations that are seen within a particular sample.
- Compare groups
  If selected, diversity between groups will be assessed using statistics and model specified in the experimental design.
perMANOVA
Perform a permutational multivariate analysis of the variance using the Adonis method from R package vegan. This method uses a distance matrix based on the primary dataset.
- Distance metric
  Chose a distance metric. Choices include metrics available in the R vegan package as well as distances based on correlation coefficients.
- Model
  For compatibility purpose, this model may differ from the model specified in the experimental design section. This model must be in the form of an R formula compatible with the Adonis function of the R vegan package. All terms of the formula must refer to column headers of the map.
- Strata
  This is used to specify any nesting in the model. the strata must be compatible with the Adonis function of the R vegan package.
PCA
Perform a principal component analysis (PCA) on the primary dataset. PCA is a commonly used dimensionality reduction method that projects a set of observations onto a smaller set of components that capture most of the variance between samples.
PCoA
Perform a principal coordinate analysis (PCoA) on the primary dataset. The PCoA is a form of dimensionality reduction based on a distance matrix.
- Distance metric
  Chose a distance metric. Choices include metrics available in the R vegan package as well as distances based on correlation coefficients.
Heatmap
This displays the primary dataset on a heatmap of proportions. The columns (samples) and rows (observations) of the heatmap are re-ordered using a hierarchical clustering. Each observation will be individually compared between groups of samples based on the specified experimental design and associated p-values will be displayed on the side of the heatmap. Individual correlations between observations from the primary dataset and observations from the secondary dataset will also be displayed on the side of the heatmap when available. Samples will be color-coded according to the experimental design.
Changes
This performs a differential analysis. Each observation will be individually compared between groups of samples based on the specified experimental design. Fold-change between groups as well as the associated p-values and mean abundance will be displayed on MA plots and volcano plots.
Correlation network
This builds a network with nodes representing observations and edges representing the correlations. The network will be based on the primary dataset and will include the secondary dataset if available. Observations belonging to the primary and the secondary dataset will be represented using different node shapes. Nodes will be colored to include a number of additional information.
- Correlation method
  Chose a correlation method. Choices include the Pearson correlation and the Spearman correlation.
Clustering
This applies a clustering algorithm to separate samples into categories. The categories will be added as a new column to the map. Categories will automatically be compared with ANOVA in relevant analysis.
- Algorithm
  Chose a clustering algorithm. Choices include the k-means algorithm, the k-medoids algorithm (also known as the partitioning around medoids of pam algorithm) and versions of the k-medoids based on distance matrices.
Similarity network
This builds a network with nodes representing samples and edges representing a similarity score. If a secondary dataset is available, three networks will be build: one based on the primary dataset, one based on the secondary dataset and a third based on the fusion of these two networks. The similarity network fusion (SNF) is based on the R SNFtool package. A categorical clustering based on each network is applied.
- Metric for primary dataset
  Chose the similarity metric to apply on the primary dataset.
- Metric for secondary dataset
  Chose the similarity metric to apply on the secondary dataset.
- Clustering algorithm
  Chose the clustering algorithm to assign samples to categories based on each similarity network.

Versions

Genocrunch data analysis tools are based on R libraries. Used libraries are mentionned in the analysis report. The version of the libraries are available on the Versions page via the Infos menu of the topbar).

Analysis report

A detailed analysis report is generated for each analysis. It is available directly after creating/updating a new analysis via the My Analysis button of the topbar or via the corresponding icon () in the analysis index (for signed-in users only). The analysis report includes detailed descriptions of the tools and methods used, possible warning and error logs as well as intermediate files, final files and interactive figures as they become available. Information for the pre-processing and data transformation are available for the primary dataset and the optional secondary dataset in their respective sections. Figures and descriptions for each analysis are available in separate sections. If multiple binning levels were specified, figures for each levels can be accessed via the associated Level button.

Exporting figures and data

Figures and their corresponding legends and data can be exported via the associated Export menu.

Standard export
Includes a vector image in the SVG format as well as a figure legend in HTML format. Data in a tab-delimited text format is also proposed for some figures.
Advanced export
Includes raw data in JSON format. R-generated PDFs are also proposed for some figures.

Bugs report

During computation (while the analysis is running), the standard error (stderr) stream is redirected into a bugs report that can be downloaded using the Bugs Report button located at the bottom-right of the details page in the analysis report.

Usage limits

Time limits

Analyses are by default kept for 2 days folowing their last update. However, analyses owned by registered users are kept for 365 days folowing their last update. Old analyses are automatically deleted from the server after their reach these limits.

Storage limits

The storage quota is set to 0 Bytes per user.