Version 1.0: Ultrafast KEGG Ortholog Detection Directly from Short Reads

Motivations

The biggest challenges in conventional RNA-seq workflow are the transcriptome de novo assembly and annotation, which are time-consuming, computational resource very expensive, e.g., running assembler and annotation tool usually require a high performance server with more than 100 GB's RAM and dozens of CPUs, which are not always for many labs focus on non-model organisms, and could take several days or even weeks to finish

In addition, given the fact that most research questions in non-model organisms are only focus on protein-coding genes and the underlying pathways. It is known that protein-coding genes are only a small proportion of the whole transcriptome. Furthermore, of these protein-coding genes, only a proportion is assigned to KEGG pathways, e.g., only 33.67% (8438), 29.48% (4933) and 23.86% (6050) protein-coding genes are involved in KEGG pathways for mouse, chicken and zebrafish, respectively. Therefore, many genes in annotation databases are considered as unrelated or uninformative for many non-model organisms.

Furthermore, reconstruct transcripts first and subsequently searching and identifying their homologies in a protein database are not straight-forward, and many intermediate steps and requirement of programming skills in the whole complex pipeline of conventional workflow have posed additional challenges for many researchers.

Solutions

The requirement of multiple software, high-computational cost and time-consuming motivated us to think a straightforward, assembly-free, all-in-one took to tackle these challenges. Given the fact that majority of RNA-seq reads are from mRNA and are intron-free, this inspired us to propose directly translating RNA-seq reads into all possible amino acid (AA) sequences with six reading frames (ORFs), and comparing them in a protein database consisting of only protein-coding genes (orthologs) to identify their possible functional homologies.

Therefore, we developed an ultra-fast, assembly-free, all-in-one tool Seq2Fun, based on a modern data structure full-text in minute space (FM) index and burrow wheeler transformation (BWT), to functional quantification of RNA-seq reads for non-model organisms without transcriptome assembly and genome references.

Seq2Fun directly takes raw RNA-seq reads as input, and subsequently conducts quality check, followed by translated search and finally generate ortholog abundance table.

Figure 2: The workflow of directly translated search. It skips the transcriptome de novo assembly and directly conducts a translated search for each read.