Because G-OnRamp is based on the Galaxy platform, the first step to learning how to use G-OnRamp is to acquire some basic familiarity with Galaxy. The “Overview of Galaxy” presentation in the Learning Materials section will give you the necessary basic information about Galaxy. The two screencasts from Galaxy linked here provide an introduction to getting data and comparing genomics features. If you want to learn more about Galaxy, visit the Learn Galaxy page on the Galaxy Wiki.
We have developed a comprehensive Galaxy workflow that produces multiple complementary datasets to facilitate the annotation of any eukaryotic genome. The entire workflow is shown below.
The G-OnRamp workflow is divided into four sub-workflows: sequence similarity, repeat regions, RNA-Seq, and gene predictions. These sub-workflows will produce the input datasets for the Hub Archive Creator, which will create the UCSC Genome Browser Assembly Hub.
- BLAST alignment: the genome assembly (in FASTA format) is the input dataset for the NCBI BLAST+ tool makeblastdb, which creates a nucleotide database for BLAST searches. The NCBI BLAST+ tblastn tool searches this nucleotide database against a collection of protein query sequences from an informant species. The blastXmlToPsl and pslToBigPsl tools are used to convert the tblastn search results to the BigPsl format required by the Hub Archive Creator.
- BLAT alignment: RNA GenBank records is the input dataset for the gbToFasta tool, which converts RNA records to FASTA format. The genome assembly (in FASTA format) and RNA records (in FASTA format) are the input datasets for the UCSC BLAT alignment tool, which searches this genome assembly against a collection of RNA query sequences from an informant species. The UCSC pslCDnaFilter tool is used to select near best in genome alignments for each given cDNA and non-comparative, based only on the quality of an individual alignment. The UCSC pslCheck tool is used to validate the PSL output. The UCSC pslPosTarge tool flips psl strands so target is positive and implicit. The pslToBigPsl tool converts the BLAT search results to the BigPsl format required by the Hub Archive Creator.
TrfBig partitions the genome assembly into smaller chunks and then runs Tandem Repeats Finder (TRF) on each chunk to identify tandem repeats within each genomic region. Note that the output of TRF is in BED4+12 format.
RNA-Seq reads are mapped against the genome assembly by HISAT2, and StringTie assembles the mapped RNA-Seq reads into potential transcripts. The “junctions extract” subprogram in Regtools reports the locations of putative introns based on the spliced RNA-Seq reads in the BAM file. The RNA-Seq read coverage track was created by the “Convert BAM to BigWig” tool.
Gene models from three gene predictors (Augustus, GlimmerHMM, and SNAP) were produced using species-specific parameters if they were available. The gene prediction results are converted into the bigGenePred format by the Hub Archive Creator.
Below is a glossary of the tools that we use in the Homology, RNA-Seq, Repeat Regions, and Gene Predictions sub-workflows:
- BLAST alignment
NCBI BLAST+ makeblastdb: creates BLAST database from one or more FASTA files and/or BLAST databases.
NCBI BLAST+ tblastn: searches a translated nucleotide database using a protein query. Note that one should use the makeblastdb tool to convert the genome assembly into a BLAST database prior to performing a tblastn search.
blastXmlToPsl: converts BLAST output in XML format to the PSL format. pslToBigPsl: transforms a file in PSL format to the BigPsl format.
- BLAT alignment
gbToFasta: converts RNA records to FASTA format.
UCSC BLAT alignment tool: searches the genome assembly against a collection of RNA query sequences from an informant species.
UCSC pslCDnaFilter: selects near best in genome alignments for each given cDNA and non-comparative, based only on the quality of an individual alignment.
UCSC pslCheck: validates the PSL output.
UCSC pslPosTarge: flips psl strands so target is positive and implicit.
pslToBigPsl: transforms a file in PSL format to the BigPsl format.
HISAT: a fast and sensitive spliced alignment program for mapping RNA-seq reads. See the HISAT2 website for more information.
StringTie: a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. See the StringTie website for more information.
Convert Bam to BigWig: this tool calculates the alignment coverage from a BAM alignment file and converts the result into a BigWig file.
Regtools: extracts splice junctions from an RNA-Seq BAM file. For more information, check the link: https://regtools.readthedocs.io/en/latest/
TrfBig: partitions a genome assembly into smaller chunks and then uses Tandem Repeats Finder (TRF) to identify tandem repeats within each chunk
Augustus: a gene prediction program for eukaryotes written by Mario Stanke and Oliver Keller. For more information check the link: http://bioinf.uni-greifswald.de/augustus/
Multi_Fasta_GlimmerHmm: a gene finder based on a Generalized Hidden Markov Model (GHMM). For more information check the link: https://ccb.jhu.edu/software/glimmerhmm/
SNAP: is a general purpose gene finding program suitable for both eukaryotic and prokaryotic genomes. SNAP is an acronym for Semi-HMM-based Nucleic Acid Parser. For more information, check the link: http://korflab.ucdavis.edu/software.html
Hub Archive Creator: this Galaxy tool converts a genome assembly and the results produced by different bioinformatics tools into an Assembly Hub so that the assembly and its evidence tracks can be visualized on the UCSC Genome Browser. For more information, check the links below:
JBrowse Archive Creator: this Galaxy tool converts a genome assembly and the results produced by different bioinformatics tools into an Assembly Hub so that the assembly and its evidence tracks can be visualized on the JBrowse. For more information, check the links below:
For detailed G-OnRamp tutorials, see training materials from the previous workshops.