欢迎您, 登录 | 注册

首页 | English

X
加载中

This protocol describes how to build a gene network based on the graphical Gaussian model (GGM) from large scale microarray data. GGM uses partial correlation coefficient (pcor) to infer co-expression relationship between genes. Compared to the traditional Pearson’ correlation coefficient, partial correlation is a better measurement of direct dependency between genes. However, to calculate pcor requires a large number of observations (microarray slides) greatly exceeding the number of variables (genes). This protocol uses a regularized method to circumvent this obstacle, and is capable of building a network for ~20,000 genes from ~2,000 microarray slides. For more details, see Ma et al. (2007). For help regarding the script, please contact the author.

Thanks for your further question/comment. It has been sent to the author(s) of this protocol. You will receive a notification once your question/comment is addressed again by the author(s).
Meanwhile, it would be great if you could help us to spread the word about Bio-protocol.

X

Gene Networks Based on the Graphical Gaussian Model
高斯图模型构建基因网络

系统生物学 > 相互作用组 > 基因网络
作者: Shisong Ma
Shisong MaAffiliation: Department of Plant Biology & Genome Center, University of California, Davis, USA
For correspondence: sma@ucdavis.edu
Bio-protocol author page: a23
Vol 2, Iss 4, 2/20/2012, 5786 views, 1 Q&A
DOI: https://doi.org/10.21769/BioProtoc.119

[Abstract] This protocol describes how to build a gene network based on the graphical Gaussian model (GGM) from large scale microarray data. GGM uses partial correlation coefficient (pcor) to infer co-expression relationship between genes. Compared to the traditional Pearson’ correlation coefficient, partial correlation is a better measurement of direct dependency between genes. However, to calculate pcor requires a large number of observations (microarray slides) greatly exceeding the number of variables (genes). This protocol uses a regularized method to circumvent this obstacle, and is capable of building a network for ~20,000 genes from ~2,000 microarray slides. For more details, see Ma et al. (2007). For help regarding the script, please contact the author.

[Abstract] 本协议主要描述了如何根据大量的微阵列数据构建一个基于图形化高斯模型(GGM)的基因网络。GGM采用偏相关系数(pcor)来推测基因间的共表达关系。与传统的皮尔逊相关系数相比,偏相关性可更好的衡量基因之间的直接依赖。然而,要计算pcor需要远大于变量(基因)数目的大量的资料(微阵列芯片)。本文用一种正则化的方法绕过这一障碍,构建了一个来自于~2000个微阵列芯片的~20,000个基因的网络。详情可参照Ma et al. 2007。欲求脚本,可联系Shisong Ma。

Data and Software

  1. Data
    Large-scale microarray data:
    The microarray data should be derived from the same platform, preferably from Affymetrix slides. Some good examples are: Affymetrix Arabidopsis ATH1 Genome Array, Affymetrix Human Genome U133 Plus 2.0 Array, and Affymetrix Mouse Genome 430 2.0 Array. A recommended place to search for this type of data is at the gene expression omnibus from NCBI (http://www.ncbi.nlm.nih.gov/geo/). The number of slides should be larger than 1,000.
  2. Software
    1. R (http://www.r-project.org/)
    2. The GeneNet package for R:
      (http://www.uni-leipzig.de/~strimmer/lab/software/genenet/index.html)
    3. Cytoscape (http://www.cytoscape.org/)
    4. Perl and C++ software environment

Equipment

  1. Personal computer: Intel Core2 E6420 processor (or similar processing capability)

Procedure

  1. Preparation of the microarray data
    1. Download the microarray data from your favorite database, and format it into a single table of expression intensities, with every row representing a gene and every column representing a microarray experiment. A good example can be found here for Arabidopsis transcriptomes: http://affy.arabidopsis.info/narrays/help/usefulfiles.html. You can use the file titled super bulk gene download.
    2. Remove any columns (experiments) containing large number of ‘null’ measurements, and then do the same for any genes containing ‘null’ measurements.
    3. Normalize the expression intensities between experiments using the quantile normalization method.

  2. Random sampling and partial correlation calculation
    1. Randomly pick 2,000 genes from the large expression table and make a small expression table for these 2,000 genes. A Perl script can be written to do this step.
    2. Using the GeneNet package to calculate partial correlation between these 2,000 randomly selected genes. The GeneNet package should be lauched within the R environment, and the specific function to be used is ‘ggm.estimate.pcor’ with the default settings.
    3. Save the resulting partial correlation matrix, together with the gene ids for the 2,000 genes.
    4. Repeat the step from 1 to 3 at least 1,999 times. The more the better. After these calculations, most of the gene pairs should be sampled >10 times, each time with a calculated pcor.
    5. Determine the final pcor values for every gene pair, so that pcor value with the smallest absolute values will be kept. This should be done via consolidating the resulted pcor matrix. This should be done with a C++ script.

  3. Network building and analysis
    1. To test the significance of the resulted pcors, the function ‘ggm.test.edges’ in GeneNet can be used. From all the pcors, ~2,000,000 can be randomly selected and fed into the function, so that a pValue for significance can be calculated.
    2. Depending on the pValue, a cutoff for the pcors can be set. A good estimation would be 0.1, 0.08, and 0.05. Any pcor with absolute value larger than the cutoffs can be retained.
    3. A Pearson’ correlation coefficient filter should be applied. Gene pairs with Pearson’ correlation coefficient value between -0.3 and 0.3 should be removed.
    4. After the pcor selection and Pearson correlation coefficient filters, the remaining gene pairs are said to have interaction between each other, and can be used to build a gene network using Cytoscape software. The network analysis can be done with the Cytoscape software itself.

Acknowledgments

This protocol was developed by the author in Hans Bohnert’s lab, Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. The work was supported by grants from the National Science Foundation Plant Genome Project (DBI-0223905) and University of Illinois at Urbana-Champaign institutional grants.

References

  1. Ma, S., Gong, Q. and Bohnert, H. J. (2007). An Arabidopsis gene network based on the graphical Gaussian model. Genome Res 17(11): 1614-1625.
  2. Schafer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4: Article32.

数据和软件

 

. 数据:

1.            大规模的微阵列数据

微阵列数据应来自于同一平台,最好是Affymetrix芯片。如:Affymetrix拟南ATH1基因组阵列,人类基因组U1332.0阵列及小鼠基因组430 2.0阵列。NCBI的基因表达数据库是查找这些数据的好地方(http://www.ncbi.nlm.nih.gov/geo/)。晶片的数量应大于1000.

. 软件:

2.            R(http://www.r-project.org/)

3.            The GeneNet package for R.  (http://www.uni-leipzig.de/~strimmer/lab/software/genenet/index.html)

4.            Cytoscape (http://www.cytoscape.org/)

5.            PerlC++软件包

 

设备

 

1.            计算机

 

步骤

 

1.            微阵列数据的准备

1)      从数据库中下载微阵列数据,将他们整理到一个表达量的表格里,每行代表一个基因,每列代表一个微阵列实验。以拟南转录为例:http://affy.arabidopsis.info/narrays/help/usefulfiles.html。下载命名为超级批量基因的文件;

2)      删除所有含有大量“null”的列(实验)和所有含有“null”的基因;

3)      通过Quantile Normalization将微阵列间的表达量标准化。

2.            随机抽样及偏相关计算

1)      从上述表达基因表中随机选取2000个基因制成一新表,脚本注明。

2)      GeneNet package软件计算这2000个随机选择的基因之间的偏相关性。GeneNet packageR软件包激活,特异的功能被默认为ggm.estimate.pcor

3)      保存生成的偏相关系数矩阵及2000个基因的基因标识;

4)      重复步骤1~3至少1999次,越多越好。经过计算,大多数的基因对取样>10次,每次对应一个pcor

5)      定每个基因对最终的pcor值,并以最小绝对值被保留。利用C++程序将结果合并成pcor矩阵。

3.            网络构建和分析

1)      可利用GeneNet软件里的ggm.test.edges检测偏相关系数的显著性。约有~2000000偏相关系数值可供随机选择和进行功能反馈,从而计算出显著性p值;

2)      根据p值,设定最小偏相关系数值。一般为0.10.080.05。保留绝对值大于设定值的所有偏相关系数;

3)      使用皮尔逊相关系数过滤器。删除皮尔逊相关系数值在-0.30.3之间的基因对;

4)      pcor筛选和皮尔逊相关系数过滤器后,认为剩余的基因对之间有相互作用,可通过Cytoscape软件构建基因网络,并通过其进行网络分析。

 

参考:

 

1.        Ma S., Gong Q., Bohnert H.J. (2007). An Arabidopsis gene network based on the graphical Gaussian model. Genome Research 17(11): 1614-25. 

2.        Schafer J., Strimmer K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4: Article32. 

 

English
中文翻译

免责声明

为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。

X


How to cite this protocol: Ma, S. (2012). Gene Networks Based on the Graphical Gaussian Model. Bio-protocol 2(4): e119. DOI: 10.21769/BioProtoc.119; Full Text



可重复性反馈:

  • 添加图片
  • 添加视频

我们的目标是让重复别人的实验变得更轻松,如果您已经使用过本实验方案,欢迎您做出评价。我们鼓励上传实验图片或视频与小伙伴们(同行)分享您的实验心得和经验。(评论前请登录)

问题&解答:

  • 添加图片
  • 添加视频

(提问前,请先登陆)bio-protocol作为媒介平台,会将您的问题转发给作者,并将作者的回复发送至您的邮箱(在bio-protocol注册时所用的邮箱)。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片或者视频的形式来说明遇到的问题。由于本平台用Youtube储存、播放视频,作者需要google 账户来上传视频。


登陆 | 注册
5/12/2015 2:07:30 AM  

Prashanth Suravajhala
Bioclues.org; Bioinformatics.Org

This was a very useful protocol indeed. Yes, to a larger extent! Whence proposing a six point classification scoring schema for predicting the function of hypothetical proteins, we wondered if two interacting proteins shown in our proposed hypothome (interactOME of HYPOTHetical proteins) could coexpress. The transcriptomic profiles were checked albeit we used a GUI based web models to find inferences from this protocol.

Reply

Please login to post your questions/comments. Your questions will be directed to the authors of the protocol. The authors will be requested to answer your questions at their earliest convenience. Once your questions are answered, you will be informed using the email address that you register with bio-protocol.
You are highly recommended to post your data (images or even videos) for the troubleshooting. For uploading videos, you may need a Google account because Bio-protocol uses YouTube to host videos.

Login | Register

引用格式
分享
Twitter Twitter
LinkedIn LinkedIn
Google+ Google+
Facebook Facebook