Listing species in Ensembl databases
Building reference databases
Adding and removing samples from a project
Aligning contigs to reference
Scaffolding and output alignments
Listing species in Ensembl databases
Glutton supports all Ensembl databases that include gene trees information. Using Glutton we can list all species available in Ensembl main:
glutton list | less
As well as species from Ensembl-genomes. For example: we can change the database to Ensembl Metazoa:
glutton list --database metazoa | less
By default, Glutton uses Ensembl’s biomart interface programmatically, but this only provides access to the current release. Glutton can additionally access Ensembl’s MySQL databases directly to download older releases (nb: this feature should be considered experimental and has not been tested on all versions of the Ensembl database schema. While it is known to work for at least the last few years of releases from Ensembl-main, it does not work for the Ensembl-genomes databases):
glutton list --method sql | less
Building reference databases
Glutton’s ‘build’ command creates a database of evolutionary multiple sequence alignments by extracting CDS sequences and orthology information for a specified species (you have to explicitly specify which Ensembl project the species is found in):
glutton build --species drosophila_melanogaster --database metazoa
Glutton databases are called GLT files and, by default, are called SPECIES_RELEASE.glt. References can be output with user-specified filenames:
glutton build --species drosophila_melanogaster --database metazoa --output reference.glt
Sometimes your computer cluster does not have internet access. Glutton allows for all the data to be downloaded first without any additional processing:
glutton build --species drosophila_melanogaster --download
And resumed later. Here we specify the number of CPUs Glutton can use. By default, Glutton uses all CPUS:
glutton build --threads 10 drosophila_melanogaster_31.glt
Old releases can be build by specifying the release number and setting the download method to “sql” (only works for species in Ensembl-main):
glutton build --species gallus_gallus --release 80 --method sql
Adding and removing samples from a project
Glutton requires all samples to be added to a project (essentially a named directory) and that project is then aligned to a reference database. Samples can be added and removed from projects using the “setup” subcommand. Here a new project called “my_project” is created when a sample (composed of a sample id, a FASTA file of assembled contigs, a species name and, optionally, a BAM file) is added to that project:
glutton setup --add --project ./my_project --sample SAMPLE_ID --contigs SAMPLE.FASTA --species SPECIES_NAME --bam SAMPLE.bam
If two samples with the same sample id are added to a project, the second sample will overwrite the first one. We can remove a sample from the project by using the sample id:
glutton setup --remove --project ./my_project --sample SAMPLE_ID
And list samples contained in the project:
glutton setup --list --project ./my_project | less -S
Align contigs to a reference
Glutton’s “align” subcommand aligns the contigs from each sample to a given reference database. The current implementation uses BLASTX to map each contig to a gene. PAGAN is then used to extend each gene family alignment from the reference database that contains genes contigs were assigned to:
glutton align --project ./my_project --reference drosophila_melanogaster_31.glt --threads 10
The “align” subcommand can be time consuming, so we have made it restartable. If the user kills Glutton with Ctrl-C and then reruns the same command, the alignments will be continued from where they left off.
Scaffolding and consensus alignments
The “scaffold” subcommand postprocesses all sequence alignments generated by the align subcommand and generates scaffolds and multiple sequence alignments per gene family and reference gene. The user must state which assembler was used to generate contigs for Glutton to respect inferred splicing isoforms, otherwise all contigs are assumed to be independent:
glutton scaffold --project ./my_project --reference drosophila_melanogaster_31.glt --assembler soapdenovotrans
If the project was called “my_project”, then this directory will contain two subdirectories called “alignments” and “postprocessing”. The “postprocessing” directory is of most interest to the end-user and included further directories for scaffolds, gene family alignments and alignments per reference gene:
ls -dl my_project/postprocessing/* drwxr-xr-x 2 ajmedlar 1160867 249856 Apr 2 23:01 my_project/postprocessing/gene_msa drwxr-xr-x 2 ajmedlar 1160867 135168 Apr 2 22:49 my_project/postprocessing/genefamily_msa drwxr-xr-x 2 ajmedlar 1160867 4096 Apr 2 22:45 my_project/postprocessing/scaffolds