基于模板的蛋白质结构精确预测的机器学习序列比对

Shuichiro Makigaki; Takashi Ishida

doi:10.21769/BioProtoc.3600

Improve Research Reproducibility A Bio-protocol resource

提交稿件
订阅
登录
/
注册
- 个人主页
- 编辑个人信息
- 修改密码
- 退出
CN
- EN - English
- CN - 中文

Peer-reviewed

Sequence Alignment Using Machine Learning for Accurate Template-based Protein Structure Prediction

基于模板的蛋白质结构精确预测的机器学习序列比对

Shuichiro Makigaki email

Takashi Ishida email

发布: 2020年05月05日第10卷第9期 DOI: 10.21769/BioProtoc.3600 浏览次数: 5657

评审: Prashanth N SuravajhalaJayaraman ValadiL N Chavali

PDF

Q&A

引用

Cited by

参见作者原研究论文

The authors used this protocol in:

Cover of Bioinformatics, featuring study using the protocol.

Jan 2020

Bio-protocol welcomes Protocols in Bioinformatics and Computational Biology

实验方案合集

Cell Imaging - A Special Collection for Cell Bio 2023

相关实验方案

用于全面分析细胞、细胞外囊泡和血浆 RNA 中编码和非编码 RNA 生物型的 TGIRT-seq 方法

Hengyi Xu [...] Alan M. Lambowitz

2021年12月05日 7284 阅读

用于结构研究的纳米盘跨膜蛋白组装优化：综合手册

Fernando Vilela [...] Dorit Hanein

2024年11月05日 3076 阅读

结合电子转移解离与氢交换质谱的蛋白质结构特征分析方法

Rupam Bhattacharjee and Jayant B. Udgaonkar

2025年06月20日 1772 阅读

Abstract

Template-based modeling, the process of predicting the tertiary structure of a protein by using homologous protein structures, is useful when good templates can be available. Indeed, modern homology detection methods can find remote homologs with high sensitivity. However, the accuracy of template-based models generated from the homology-detection-based alignments is often lower than that from ideal alignments. In this study, we propose a new method that generates pairwise sequence alignments for more accurate template-based modeling. Our method trains a machine learning model using the structural alignment of known homologs. When calculating sequence alignments, instead of a fixed substitution matrix, this method dynamically predicts a substitution score from the trained model.

Keywords: Template-based modeling (基于模板的蛋白质结构预测)

Homology modeling (同源建模)

Sequence alignment (序列比对)

Machine learning (机器学习)

k-Nearest Neighbor (k-近邻)

Background

Proteins are key molecules in biology, biochemistry and pharmaceutical sciences. To reveal the functions of proteins, it is essential to understand the relationships between proteins' structure and function. Protein structures can be determined by experimental; the protein structures are often registered to and accessible in the Protein Databank (PDB) (wwPDB consortium, 2018). However, despite improvements in experimental methods for determining protein structures, the speed at which amino acid sequences can be revealed has overtaken our ability to ascertain the corresponding proteins' structures (Muhammed et al. 2019). Therefore, protein structure prediction remains essential.

As one of various methods for protein structure prediction, template-based or homology modeling predicts structures based on templates and their sequence alignment to a target protein. Template structures are the structures of homologous proteins, often found by homology detection methods. Currently, template-based modeling methods are the most practical because the predicted models are often accurate if we can find good templates and protein sequence alignments. These accurate models by template-based modeling can be used for computer-aided drug design (CADD).

Indeed, recent homology search methods have been able to detect remote homologs (Boratyn et al., 2012; Zimmermann et al., 2018). Although, sometimes sufficiently accurate structure models cannot be obtained because the quality of the sequence alignment generated by homology detection program is poor. If a more accurate model is required, researchers must manually edit alignments to improve their quality before modeling. In structural alignment, the structural difference between a target protein structure and a template protein structure is minimized; thus, sequence alignments generated by structural alignment are almost ideal for template-based modeling. Often, the sequence alignments generated by the homology detection methods are dissimilar to those generated by structural alignment, especially for remote homologs. Thus far, a method’s ability to detect remote homologs has been prioritized because models cannot be generated without a template. However, to achieve higher-accuracy template-based modeling, the improvement of sequence alignment generation is a critical open problem. This problem has been mentioned in several studies (Kopp et al., 2007) in which researchers have tried to improve alignments manually based on their knowledge of biology; fully automated methods are still required.

Recently, machine learning methods have demonstrated power in various fields (Lyons et al., 2014; Cao et al., 2016; Wang, Peng, et al., 2016; Wei and Zou, 2016; Manavalan and Lee, 2017; Wang, Sun, et al., 2017). Machine learning also seems effective in tackling the problem of alignment generation for homology modeling. However, this topic has not been studied because it is challenging to treat alignment generation as a classification or regression problem.

For the problem, we proposed a new sequence alignment generation protocol based on a machine learning that learns the structural alignments of known homologs (Makigaki and Ishida, 2019). We use a dynamic programming algorithm during aligning sequences to dynamically predict a substitution score from the k-Nearest Neighbor (k-NN) model instead of a fixed substitution matrix or profile comparison. Machine learning is used in this substitution score prediction process.

The proposed method is valuable for researchers who use template-based modeling with remote homologs whose sequence identity is not high. In this paper, we show the overview of our method as a procedure, and more detailed usage of our tool and some examples are available in the source code repository (https://github.com/shuichiro-makigaki/exmachina).

Equipment

Computer
> 128 GiB RAM and > 150 GiB free storage are recommended
Linux (> 3.10) or SUSE Linux Enterprise Server 12

Software

PSI-BLAST (> v2.9)
To generate PSSM of an amino acid sequence
Download URL: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download (Last access date: 2020-02-22)
Installation document URL: https://www.ncbi.nlm.nih.gov/books/NBK279690/
TM-align (> v20190822) (Zhang and Skolnick, 2005)
To generate structural alignment of homologs
Download and installation document URL: https://zhanglab.ccmb.med.umich.edu/TM-align/ (Last access date: 2020-02-22)
Implementation: Source code and installation document are available in the source code repository.
Download URL: https://github.com/shuichiro-makigaki/exmachina/archive/master.zip
Installation Procedure: https://github.com/shuichiro-makigaki/exmachina#how-to-use
(Last access date: 2020-02-22)
Python 3.6: Required python packages are listed in the repository.
FLANN (Muja and Lowe, 2009): k-Nearest Neighbor implementation. The installation procedure also contains the FLANN installation document.
Structural Classification of Proteins (SCOP) database
The SCOP database classifies proteins by class, folds, superfamily (SF), family and domain based on manually curated function/structure classifications and contains redundant sequences. Thus, we used the SCOP40 database instead, which contains only domains whose sequence identity is < 40% to avoid overfitting and reduce execution time.
Download URL: https://scop.berkeley.edu/astral/pdbstyle/ver=1.75 (Last access date: 2020-02-22)
UniRef (The UniProt Consortium, 2016) database
For Position Specific Scoring Matrix (PSSM) generation, we used three-iteration PSI-BLAST (Altschul et al., 1997) with the UniRef90 database.
Download URL: https://www.uniprot.org/downloads#unireflink (Last access date: 2020-02-22)

Procedure

English

中文翻译

文章信息

版权信息

如何引用

Makigaki, S. and Ishida, T. (2020). Sequence Alignment Using Machine Learning for Accurate Template-based Protein Structure Prediction. Bio-protocol 10(9): e3600. DOI: 10.21769/BioProtoc.3600.

Download Citation in RIS Format

分类

您对这篇实验方法有问题吗？

在此处发布您的问题，我们将邀请本文作者来回答。同时，我们会将您的问题发布到Bio-protocol Exchange，以便寻求社区成员的帮助。

发布问题

0 Q&A

提交稿件