(*contributed equally to this work) In Press, 发布时间: 2025年12月03日 DOI: 10.21769/BioProtoc.5555 浏览次数: 198
评审: Kif Liakath-AliSuresh PantheeAnonymous reviewer(s)
Abstract
Functional enrichment analysis is essential for understanding the biological significance of differentially expressed genes. Commonly used tools such as g:Profiler, DAVID, and GOrilla are effective when applied to well-annotated model organisms. However, for non-model organisms, particularly for bacteria and other microorganisms, curated functional annotations are often scarce. In such cases, researchers often rely on homology-based approaches, using tools like BLAST to transfer annotations from closely related species. Although this strategy can yield some insights, it often introduces annotation errors and overlooks unique species-specific functions. To address this limitation, we present a user-friendly and adaptable method for creating custom annotation R packages using genomic data retrieved from NCBI. These packages can be directly imported as libraries into the R environment and are compatible with the clusterProfiler package, enabling effective gene ontology and pathway enrichment analysis. We demonstrate this approach by constructing an R annotation package for Mycobacterium tuberculosis H37Rv, as an example. The annotation package is then utilized to analyze differentially expressed genes from a subset of RNA-seq dataset (GSE292409), which investigates the transcriptional response of M. tuberculosis H37Rv to rifampicin treatment. The chosen dataset includes six samples, with three serving as untreated controls and three exposed to rifampicin for 1 h. Further, enrichment analysis was performed on genes to demonstrate changes in response to the treatment. This workflow provides a reliable and scalable solution for functional enrichment analysis in organisms with limited annotation resources. It also enhances the accuracy and biological relevance of gene expression interpretation in microbial genomics research.
Key features
• Comprehensive SQLite database with gene information and detailed annotation of all organisms in NCBI.
• Customized R annotation package built for Mycobacterium tuberculosis H37Rv by extracting species-specific records from the SQLite database using the taxonomic identifier.
• Gene ontology and KEGG enrichment analysis on significantly expressed genes from the RNA-seq dataset GSE292409 by importing the customized annotation package as an R library.
Keywords: Gene enrichment analysisBackground
Gene enrichment analysis is a method to understand the functional pattern of a group of genes that are differentially expressed. While primary transcriptome analysis gives an overview of gene expression by quantifying RNA-Seq reads mapping to different genomic regions, gene enrichment analysis adds functional significance [1]. By aggregating signals from groups of genes, this method becomes critically important, as individual genes alone do not adequately reflect the functional dynamics of gene expression [2]. To date, there is a plethora of free resources available for defining the ontology and function based on statistical thresholds. These tools enable users to provide a list of genes and select the suitable organism and certain statistical parameters to design an ontology and pathway enrichment map. However, these tools often lack support for comprehensive, broad-spectrum analyses. Moreover, most do not allow seamless integration or reuse of results across different analytical workflows [3]. For larger datasets, issues pertaining to computational time and result accuracy also persist. Additionally, the need to rely on external tools for gene name conversion and result customization presents a further limitation.
Most gene ontology and pathway enrichment tools, such as g: Profiler, DAVID, GOrilla, BINGO, Enrichr, GOnet, ShinyGO, and KOBAS, provide enrichment analysis for well-annotated model organisms [4–11]. However, these tools offer limited support for organisms that are not commonly used as models, primarily due to the absence of comprehensive functional annotations. As a result, conducting enrichment analysis for such organisms remains a significant challenge. In many cases, when gene lists from these organisms are submitted, the tools may either reject the input or fail to return meaningful results. To overcome this limitation, researchers often follow a multi-step approach. This involves extracting the sequences of differentially expressed genes, identifying homologous sequences through similarity searches against reference databases of closely related species, and assigning gene ontology terms based on the best matching hits. The assigned ontology terms are then subjected to enrichment analysis using tools that accept ontology term input, since many platforms depend on gene identifiers from well-characterized species. This method has become a commonly used strategy to perform functional profiling in organisms for which detailed annotations are not available in standard gene ontology resources. Despite its practical use, there is no widely accepted protocol or standardized method for functional annotation and enrichment analysis in organisms that lack sufficient reference data. In an effort to address this gap, we explored open-source platforms, community discussions, and public repositories to identify reliable and accessible methods capable of producing biologically relevant results that are consistent with those obtained for model organisms. This search led us to the in-house development of a custom annotation strategy. We constructed organism-specific annotation packages using publicly available genomic data from NCBI and Expasy. These annotation packages can be imported as libraries into the R environment and are fully compatible with other established packages from Bioconductor. They allow researchers to perform gene ontology and pathway enrichment analysis with ease and flexibility [12,13]. We found this approach to enhance reproducibility and analytical depth, offering a dependable framework for functional analysis using well-established tools and customizable workflows.
Software and datasets
| Type | Software/dataset/resource | Version | Date | License | Access (free or paid) |
| Data | RNA sequencing dataset (GEO ID: GSE292409), which captures the transcriptional response of various Mycobacterium tuberculosis strains following drug treatment at multiple time points. Out of 83 samples, 6 samples were chosen for demonstrative purposes. Three of these samples, which did not receive any drug treatment, and were designated as the control group. The remaining three samples were exposed to rifampicin for a duration of one hour and were categorized as the treatment group. | - | 19-03-2025 | - | Free |
| Software | SRA-Toolkit | 3.2.1 | 18-03-2025 | Public domain (United States Copyright Act) | Free |
| Software | FastQC | 0.12.1 | Released after 01-03-2023 | GNU GENERAL PUBLIC LICENSE version 3 | Free |
| Software | MultiQC | 1.28 | 21-03-2025 | GNU General Public License v3 (GPL-3.0-or-later) | Free |
| Software | TrimGalore | 0.6.10 | 02-02-2023 | GNU General Public License v3.0 (GPL-3.0) | Free |
| Software | Cutadapt | 5.0 | 13-12-2024 | MIT License | Free |
| Software | BWA | 0.7.19-r1273 | 23-03-2025 | GNU General Public License v3.0 | Free |
| Software | samtools | 1.21 | 12-09-2024 | MIT License | Free |
| Software | featureCounts from Subread package | 2.0.8 | 04-11-2024 | MIT License | Free |
| Software | R | 4.3.3 | 29-02-2024 | GNU General Public License v3.0 (GPL-3.0) | Free |
| Software | AnnotationForge | 1.44.0 | 24-10-2023 | Artistic-2.0 license | Free |
| Software | clusterProfiler | 4.10.1 | 08-03-2024 | Artistic-2.0 license | Free |
| Software | ggplot2 | 3.5.1 | 22-04-2024 | MIT + file LICENSE | Free |
| Operating system | Linux, Ubuntu (64-) | 20.04.6 | 23-03-2023 | GNU General Public License (GPL) | Free |
| Hardware | CPU: 24 core dual CPU (Intel® Xeon(R) Gold 6240R) RAM: 512 GB Storage: 2 TB SSD recommended for generating the SQLite database and intermediate files used in the construction of the R annotation package Network: High-speed internet recommended for downloading large datasets | - | - | - | - |
Procedure
文章信息
稿件历史记录
提交日期: Sep 10, 2025
接收日期: Nov 23, 2025
在线发布日期: Dec 3, 2025
版权信息
© 2026 The Author(s); This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/).
如何引用
Eden M., I. S. T. and Vetrivel, U. (2026). Enhanced RNA-Seq Expression Profiling and Functional Enrichment in Non-model Organisms Using Custom Annotations. Bio-protocol 16(9): e5555. DOI: 10.21769/BioProtoc.5555.
分类
生物信息学与计算生物学
生信
您对这篇实验方法有问题吗?
在此处发布您的问题,我们将邀请本文作者来回答。同时,我们会将您的问题发布到Bio-protocol Exchange,以便寻求社区成员的帮助。
提问指南
+ 问题描述
写下详细的问题描述,包括所有有助于他人回答您问题的信息(例如实验过程、条件和相关图像等)。
Share
Bluesky
X
Copy link
