Glutton Tutorial

Back to Glutton home

Listing species in Ensembl databases
Building reference databases
Adding and removing samples from a project
Aligning contigs to reference
Scaffolding and output alignments

Listing species in Ensembl databases

Glutton supports all Ensembl databases that include gene trees information. Using Glutton we can list all species available in Ensembl main:

glutton list | less

As well as species from Ensembl-genomes. For example: we can change the database to Ensembl Metazoa:

glutton list --database metazoa | less

By default, Glutton uses Ensembl’s biomart interface programmatically, but this only provides access to the current release. Glutton can additionally access Ensembl’s MySQL databases directly to download older releases (nb: this feature should be considered experimental and has not been tested on all versions of the Ensembl database schema. While it is known to work for at least the last few years of releases from Ensembl-main, it does not work for the Ensembl-genomes databases):

glutton list --method sql | less

Building reference databases

Glutton’s ‘build’ command creates a database of evolutionary multiple sequence alignments by extracting CDS sequences and orthology information for a specified species (you have to explicitly specify which Ensembl project the species is found in):

glutton build --species drosophila_melanogaster --database metazoa

Glutton databases are called GLT files and, by default, are called SPECIES_RELEASE.glt. References can be output with user-specified filenames:

glutton build --species drosophila_melanogaster --database metazoa --output reference.glt

Sometimes your computer cluster does not have internet access. Glutton allows for all the data to be downloaded first without any additional processing:

glutton build --species drosophila_melanogaster --download

And resumed later. Here we specify the number of CPUs Glutton can use. By default, Glutton uses all CPUS:

glutton build --threads 10 drosophila_melanogaster_31.glt

Old releases can be build by specifying the release number and setting the download method to “sql” (only works for species in Ensembl-main):

glutton build --species gallus_gallus --release 80 --method sql

Adding and removing samples from a project

Glutton requires all samples to be added to a project (essentially a named directory) and that project is then aligned to a reference database. Samples can be added and removed from projects using the “setup” subcommand. Here a new project called “my_project” is created when a sample (composed of a sample id, a FASTA file of assembled contigs, a species name and, optionally, a BAM file) is added to that project:

glutton setup --add --project ./my_project --sample SAMPLE_ID --contigs SAMPLE.FASTA --species SPECIES_NAME --bam SAMPLE.bam

If two samples with the same sample id are added to a project, the second sample will overwrite the first one. We can remove a sample from the project by using the sample id:

glutton setup --remove --project ./my_project --sample SAMPLE_ID

And list samples contained in the project:

glutton setup --list --project ./my_project | less -S

Align contigs to a reference

Glutton’s “align” subcommand aligns the contigs from each sample to a given reference database. The current implementation uses BLASTX to map each contig to a gene. PAGAN is then used to extend each gene family alignment from the reference database that contains genes contigs were assigned to:

glutton align --project ./my_project --reference drosophila_melanogaster_31.glt --threads 10

The “align” subcommand can be time consuming, so we have made it restartable. If the user kills Glutton with Ctrl-C and then reruns the same command, the alignments will be continued from where they left off.

Scaffolding and consensus alignments

The “scaffold” subcommand postprocesses all sequence alignments generated by the align subcommand and generates scaffolds and multiple sequence alignments per gene family and reference gene. The user must state which assembler was used to generate contigs for Glutton to respect inferred splicing isoforms, otherwise all contigs are assumed to be independent:

glutton scaffold --project ./my_project --reference drosophila_melanogaster_31.glt --assembler soapdenovotrans

If the project was called “my_project”, then this directory will contain two subdirectories called “alignments” and “postprocessing”. The “postprocessing” directory is of most interest to the end-user and included further directories for scaffolds, gene family alignments and alignments per reference gene:

ls -dl my_project/postprocessing/*
drwxr-xr-x 2 ajmedlar 1160867 249856 Apr  2 23:01 my_project/postprocessing/gene_msa
drwxr-xr-x 2 ajmedlar 1160867 135168 Apr  2 22:49 my_project/postprocessing/genefamily_msa
drwxr-xr-x 2 ajmedlar 1160867   4096 Apr  2 22:45 my_project/postprocessing/scaffolds