分类目录归档:biodatabase

谷歌开放1万对自闭症基因图谱

http://www.autismspeaks.org/

 

谷歌和自闭症研究机构Autism Speaks计划合作,谷歌将通过云技术储存1万对自闭症儿童及其兄弟姐妹、家长的完整基因图谱,来加速该疾病的研究。

这将是有史以来最大规模的基因图谱收集工作。数据库有望于一年之内对外开放。所有符合资格的研究人员都可以通过云技术进行远程访问,谷歌将提供分析所需要的工具。

研究基因图谱则是攻克诸如自闭症,老年痴呆和癌症的重要途径。通过对自闭症患者的基因进行研究,科学界已经发现自闭症不止一种,而是很多种。对整条基因图谱进行研究能够更好地了解疾病发作诱因、哪些患者发病几率更高以及今后的治疗途径。

基因图谱需要耗费巨大磁盘容量予以保存。一对基因图谱需要大约100G,而普通电脑硬盘最多只能容纳下10张光碟的容量。这让很多大学和科研机构都无法承受。

3000株水稻基因序列公开发表

近日,3000株水稻基因组测序文章在GigaScience正式发表,且整套水稻基因序列以可引用形式在该杂志的附属开放获取数据库GigaDB中公开。
由华大基因和生物医学中心共同创办的GigaScience是一个开放获取且数据公开的大数据期刊。 “3000株水稻基因组项目”所产出和公开的大量遗传信息将最终被应用到智能育种实践中。不同植株间存在自然变异,而了解这些不同性状发生的遗传机制,将 有助于成功培育出可高度适应不同环境的杂交品种。
现代育种实践延续至今,通常依据表观性状特征来指导候 选植株杂交,并期望杂交后代能够表现出结合和改良的预期性状,如抗旱、抗病虫、产量提高和营养价值增加。然而,当两个植株进行杂交后,对应的遗传组成常常 使育种学家的期望落空,因为未知的基因相互作用会限制、修饰或改变所选择的性状特征的形成。因此,常常需要试验、发现错误和多个连续的育种阶段。
这个过程需要投入大量的精力与人力。因此,这些向全球植物育种学家和科学家免费公开的数据,将极大地推动水稻的基因型/表型之间关联的研究,同时对加深植物生物学的理解提供了丰富的资源。

 

Rice, Oryza sativa L., is the staple food for half the world’s population. By 2030, the production of         rice must increase by at least 25% in order to keep up with global population growth         and demand. Accelerated genetic gains in rice improvement are needed to mitigate the         effects of climate change and loss of arable land, as well as to ensure a stable global         food supply.

Findings

We resequenced a core collection of 3,000 rice accessions from 89 countries. All 3,000         genomes had an average sequencing depth of 14×, with average genome coverages and         mapping rates of 94.0% and 92.5%, respectively. From our sequencing efforts, approximately         18.9 million single nucleotide polymorphisms (SNPs) in rice were discovered when aligned         to the reference genome of the temperate japonica variety, Nipponbare. Phylogenetic analyses based on SNP data confirmed differentiation         of the O. sativa gene pool into 5 varietal groups – indica, aus/boro, basmati/sadri, tropical japonica and temperate japonica.

Conclusions

Here, we report an international resequencing effort of 3,000 rice genomes. This data         serves as a foundation for large-scale discovery of novel alleles for important rice         phenotypes using various bioinformatics and/or genetic approaches. It also serves         to understand the genomic diversity within O. sativa at a higher level of detail. With the release of the sequencing data, the project         calls for the global rice community to take advantage of this data as a foundation         for establishing a global, public rice genetic/genomic database and information platform         for advancing rice breeding technology for future rice improvement.

Keywords: Oryza sativa; Genetic resources; Genome diversity; Sequence variants; Next generation sequencingData description         Purpose of data acquisition

For much of the world’s poor, rice (O. sativa L.) is the cereal that provides the majority of daily calories in their staple diet.         Rice is also known for its tremendous within-species genetic diversity and varietal         group differentiation [1,2]. Rice productivity has more than doubled in recent decades, resulting primarily from         the Green Revolution and continued breeding efforts since the 1960s. However, in order         to meet the demands imposed by the projected increase in global population, the world’s         rice production has to increase by 25% or more by 2030 [3]. This increase has to be achieved under less land, less water and under more severe         environmental stresses due to climate change. Thus, accelerated genetic gains are         needed in the next few decades to improve yield potential and stability, and grain         quality of rice. This requires more complete knowledge of the genetic diversity in         the O. sativa gene pool, associations of diverse alleles with important rice traits, and systematic         exploitation of this rich genetic diversity by integrating knowledge-based tools into         rice improvement using innovative breeding strategies [46].
To date, a few studies on rice have been undertaken to discover allelic variants through         next generation sequencing (NGS) [79]. Unfortunately, these studies have been unable to provide a complete picture of the         total genetic diversity within the O. sativa gene pool, due to either the small sample size of sequenced accessions [7], or the low-coverage sequencing depth of the genomes [8,9]. Here, we report an international effort to extend significantly our understanding         of the total genetic diversity within the O. sativa gene pool by re-sequencing 3,000 O. sativa genomes using IIllumina-based NGS. Our ultimate goal is to establish, through collective         efforts by the international scientific community, a public rice database containing         genetic and genomic information suitable for advancing rice breeding technology.

Selection of germplasm

A total of 3,000 germplasm accessions were chosen for sequencing, including 2,466         accessions from the International Rice Genebank Collection (IRGC) at the International         Rice Research Institute (IRRI), and 534 accessions from the China National Crop Gene         Bank (CNCGB) in the Institute of Crop Sciences, Chinese Academy of Agricultural Sciences         (CAAS). The 2,466 accessions (in Additional file 1: Table S1A ) contributed by IRRI represent a panel that was randomly selected from         a core collection of 12,000 O. sativa accessions that was established by a semi-stratified selection scheme from more than         101,000 rice accessions in the IRGC; taking into account factors, such as the country         of origin, eco-cultural type and varietal grouping with even coverage of the name         space while limiting potential duplicates from each country, and complemented by specific,         nominated entries from IRRI and the Centre de Coopération Internationale en Recherche         Agronomique pour le Développement (Cirad). The 534 accessions (in Additional file         1: Table S1B) contributed by CAAS included a mini-core collection of 246 accessions         selected from a core collection of 932 accessions established in the same way from         the 61,470 O. sativa accessions preserved in the CNCGB [10], plus 288 accessions selected based on their isozyme diversity [1], and used as parental lines in the international rice molecular breeding network         [2]. Together, the sampled 3,000 rice accessions came from 89 different countries/regions,         77.1% of which are from the centers of rice genetic diversity -Southeast Asia (33.9%),         South Asia (25.6%) and China (17.6%) (Figure 1).

Additional file 1: Table S1A. Information for the 2,466 rice accessions from the International Rice Genebank Collection         at the International Rice Research Institute. Table S1B. Information for the 534 rice accessions from the China National Crop Genebank and         the CAAS working collections.
Format: XLSX                  Size: 389KB Download file

Figure 1. Geographical distribution of the 3,000 sampled rice accessions from 89 countries (see Additional file1: Tables S1A and S1B). The numbers in the parentheses after each region are the numbers of the countries         in the region.

Genetic stocks derived from the O. sativa accessions were generated for each of the sampled 3,000 rice accessions by one or         more cycles of single-seed descent purification under field or screen-house conditions.         New accession numbers were assigned to seeds derived from one or more rounds of multiplication         starting from a single plant of each source accession. As of March 2013, new accession         numbers have been assigned to 1,958 of the IRRI accessions. Purified seeds of the         sequenced accessions are (or will be available) from the IRGC or CNCGB as genetic         stocks. Information on obtaining seeds from the IRGC can be found at [11] and from the CNCGB at [12].

Sequencing

Genomic DNA was prepared from bulk harvested leaves of a single young plant for each         sampled accession by a modified CTAB method either at IRRI or at CAAS. Genomic DNA         samples were then shipped to BGI-Shenzhen and were used to construct Illumina index         libraries following the manufacturer’s protocol. Following quality control, at least         3 μg genomic DNA of each sample was randomly fragmented by sonication and size-fractionated         by electrophoresis, and DNA fragments of approximately 500 bp were purified. Purified         500 bp DNA fragments from each of the 24 accessions were labeled independently using         distinct 6 bp nucleotide multiplex identifiers, followed by pooling prior to library         construction for NGS. Each sequencing library was sequenced in six or more lanes on         the HiSeq2000 platform and 90 bp paired-end reads were generated. Subsequently, the         reads from each sample were extracted based on their unique nucleotide multiplex identifiers         as 83 bp reads (90 – 6 – 1, where 1 is the ligation base “T”). To ensure high quality,         raw data was filtered by deleting reads having adapter contamination or containing         more than 50% low quality bases (quality value ≤ 5).

Data generation and analyses      Read alignment and variant identification

The clean reads were mapped to the temperate japonica Nipponbare reference genome – the unified-build release Os-Nipponbare-Reference-IRGSP-1.0         (IRGSP-1.0) [13], using the BWA software with default parameters except for “aln -m 10000 -o 1 -e         10 -t 4”. The alignment results were then merged and indexed as BAM files [14,15]. SNP calling was based on alignment using the Genome Analysis Toolkit 2.0-35 (GATK)         and Picard package V1.71 [16]. To minimize the number of mismatched bases for SNP and InDel calling, all reads         from each accession were further cleaned by:
(1) deleting the reads that are unmapped to the reference in the alignment result;
(2) deleting duplicate reads;
(3) conducting alignment by the IndelRealigner package in GATK; and
(4) recalibrating realignments using the BaseRecalibrator package in GATK.
SNP and InDel calling for each sample were performed independently using the UnifiedGenotyper         package in GATK with a minimum phred-scaled confidence threshold of 50, and a minimum         phred-scaled confidence threshold for emitting variants at 10. To ensure the quality         of variant calling, the conditions for every site in a genome were set at >20 for         mapping quality, >50 for variant quality and >2 for the number of supporting reads         for every base.
SNP and InDel calling at the population level (i.e., for all sequenced genomes concurrently)         was performed using the UnifiedGenotyper package in the GATK pipeline with 50 for         the minimum phred-scaled confidence threshold for variant calling, 30 for the minimum         phred-scaled confidence threshold for variant emitting, >20 for the mapping quality,         MAF >0.001 for every SNP, and >2 sequence depth for genotypes in every sample. Five         independent, randomly selected sets of 200,000 SNPs with minimum missing data were         then selected for phylogenetic analysis.
For each of these five sets, distance matrices using the p-distances model were calculated,         and Neighbor Joining trees were constructed with 1,000 bootstraps using the TreeBeST         software [17]. Consensus trees were exported as Newick format and imported into DarWIN v5.0.158         for topology visualization [18]. For each of the five consensus trees, prior information on variety group designation         (based on SSR or isozyme classification) was used to define assignment to one of the         five groups – indica, aus/boro, basmati/sadri, japonica (tropical or temperate). Groupings assigned for each of the five trees were compared         using a majority rule criterion (i.e., a minimum of three trees to support the assignment).         Those accessions that failed this test were labeled as intermediate types.

Findings

Using IRGSP-1.0 as the reference, the 3,000 sequenced genomes had an average depth         of ~14×, ranging from ~4× to greater than 60×, and yielded a combined total of approximately         17 TB of high quality sequence data. Of the 3,000 entries, 2,322 accessions had >10×         sequence depths. When aligned with IRGSP-1.0 using the BWA software, the average genome         coverage and mapping rate were 94.0% and 92.5%, respectively. BWA alignment followed         by variant calling using GATK identified approximately 18.9 million single nucleotide         polymorphisms (SNPs) (Table 1). The distribution of the identified SNPs across different chromosomes varies considerably,         with chromosomes 4, 1 and 11 having the highest numbers of SNPs and chromosomes 9,         10 and 5 having the lowest. Most SNPs were detected in intergenic regions and introns,         based on comparison with gene annotations provided by MSU v7 [13,19]. Only 18.24% of the detected SNPs occur in exons, of which ~40% are synonymous.

Table 1. Characteristics of the single nucleotide polymorphisms (SNPs) identified in the 3,000 rice genomes when aligned to the reference japonica Nipponbare genome IRGSP1.0

The phylogenetic analyses revealed clear differentiation of the 3,000 accessions into         two major groups – indica and japonica, two small varietal groups – the aus/boro and basmati/sadri types, plus a small group         (134) of intermediate (admixed) types (Figure 2). The indica group represented the largest and most diverse group comprising 1,760 (58.2%) accessions         in five major subgroups of diverse origins. The japonica group contains 843 (27.9%) accessions, which had two well-differentiated subgroups         – 388 temperate japonicas and 455 tropical japonicas. The aus/boro group is composed of 215 accessions and is more closely related to         indica, while the aromatic basmati/sadri group is more closely related to japonica and consists of 68 accessions primarily from South Asia.

Figure 2. Classification of 3,000 rice accessions into five distinct varietal groups based on            5 sets of 200,000 random sets from the 18.9 million discovered SNP variants.

Availability and requirements         Data availability

The sequencing data of the 3,000 rice genomes project (3K RGP) is now deposited in         the GigaScience database (GigaDB) and has a citable digital object identifier (DOI) [20]. The dataset consists of separate directories for sequences from each of the 3,000         rice genomes. These directories are named by the DNA_UNIQUE_IDs given in Additional         file 1: Tables S1A and S1B. If the DNA_UNIQUE_ID contains a space, the space is replaced         by an underscore. Each directory contains from 12 to 40 Fastq (fq) files of trimmed,         filtered reads that are compressed using GNU zip (gzip, .gz). The dataset consists         of about 15.4 terabytes (Tb) of files. Individual data files can be downloaded using         tools such as File Transfer Protocol (FTP). In order to obtain the complete dataset,         use of FTP is not possible due to the time required for file transfer and bandwidth         consumed; other tools will be needed.
Dataset name: The 3,000 rice genomes project data
Operating system: Platform-independent, UNIX/Linux preferred
License: Creative Commons 0 (CC0) public domain dedication (https://creativecommons.org/publicdomain/zero/1.0 webcite)

Data requirements

After download or acquiring, depending on the task, from 8 Gb (reference-guided alignment         and variant calling) to 16 Gb (de novo genome assembly) or more main memory is needed and from 16 to 64 Gb or more swap         space allocated for each pipeline; computation will require from 7 hours (alignment         and calling) to 3 days (assembly) per core per pipeline.

Discussion

This 3,000 rice genomes dataset provides an unprecedented resource for rice genomic         research. With access to the genome sequences of the 3,000 accessions representing         various varietal types of diverse origins and availability of additional high-quality         rice reference genomes, further comparisons can be made among the 3,000 genomes and         reference genomes of different rice types. These analyses are expected to uncover         the within-species diversity and genome-level population structure of O. sativa in great detail. Thus, we hope that this data note will be the beginning of a new         round of accelerated discoveries in rice science. Here, we would like to call for         an international effort to analyze and mine the dataset. The expected information         explosion from follow-up studies of the project will provide a foundation to revolutionize         rice genetics and breeding research. Ultimately, this could lead to a more thorough         understanding of the molecular, cellular and physiological machineries/networks responsible         for the growth and development of rice plants and their responses to various abiotic         and biotic stresses.
This data note is accompanied by a ‘Commentary’ article, where the intent and plans         for the projected uses of the 3,000 rice genomes dataset are further expanded [21]. Through the public release of this dataset, we encourage the global science community         to analyze the data and to contribute in building a public rice genetic/genomic database         and information platform that will accelerate rice breeding.

Availability of supporting data

The data set supporting the results of this article is available in the GigaScience GigaDB Database [20]. Information on SNP variants will be available on analysis of the population-level         genome diversity of the 3,000 rice genomes. Raw sequence data is also available from         the SRA at PRJEB6180.

The 3,000 rice genomes project: participants and affiliations         Participants by institute      CAAS1

Zhikang Li* Email: zhkli1953@126.com or lizhikang@caas.cn
Bin-Ying Fu Email: fubinying@caas.cn
Yong-Ming Gao Email: gaoyongming@caas.cn
Wen-Sheng Wang Email: wangwensheng02@caas.cn
Jian-Long Xu Email: xujianlong@caas.cn
Fan Zhang Email: zhangfan03@caas.cn
Xiu-Qing Zhao Email: zhaoxiuqing@caas.cn
Tian-Qing Zheng Email: zhentainaqing@caas.cn
Yong-Li Zhou Email: zhouyongli@caas.cn

BGI2

Gengyun Zhang* Email: zhanggengyun@genomics.cn
Shuaishuai Tai Email: taishuaishuai@genomics.org.cn
Jiabao Xu Email: xujiabao@genomics.org.cn
Wushu Hu Email: huwushu@genomics.org.cn
Ming Yang Email: yangming@genomics.org.cn
Yongchao Niu Email: niuyongchao@genomics.org.cn
Miao Wang Email: wangmiao@genomics.org.cn
Yanhong Li Email: liyanhong@genomics.org.cn
Lianle Bian Email: bianlianle@genomics.org.cn
Xuelian Han Email: hanxuelian@genomics.org.cn
Xin Liu Email: liuxin@genomics.org.cn
Bo Wang Email: wangbo@genomics.org.cn

IRRI3

Kenneth L. McNally* Email: k.mcnally@irri.org
Ma. Elizabeth B. Naredo Email: e.naredo@irri.org
Sheila Mae Q. Mercado Email: s.mercado@irri.org
Myla Christy Rellosa Email: m.rellosa@irri.org
Renato A. Reaño Email: r.reano@irri.org
Grace Lee S. Capilit Email: g.capilit@irri.org
Flora C. de Guzman Email: f.deguzman@irri.org
Jauhar Ali Email: j.ali@irri.org
N. Ruaraidh Sackville Hamilton Email: r.hamilton@irri.org
Ramil P. Mauleon Email: r.mauleon@irri.org
Nickolai N. Alexandrov Email: n.alexandrov@irri.org
Hei Leung Email: h.leung@irri.org

Abbreviations

3K RGP: 3,000 rice genomes project; BGI: Beijing Genomics Institute Shenzhen; CAAS:         Chinese Academy of Agricultural Sciences; Cirad: Centre de Coopération Internationale         en Recherche Agronomique pour le Développement; CNCGB: China National Crop Gene Bank;         GATK: Genome Analysis Toolkit; IRGC: International Rice Genebank Collection; IRRI:         International Rice Research Institute; NGS: Next generation sequencing.

外显子测序发现罕见病致病基因

罕见疾病基因发现中心(Findingof Rare Disease Genes ,FORGE)的研究人员通过对246名罕见病患者进行全基因组外显子测序,发现了引发疾病的146个突变位点和67个异常基因。他们将这一发现发表在《美国人类遗传学杂志 》 上。

四家科研机构——多伦多基因组学应用中心、温哥华基因组研发中心、麦吉尔大学、魁北克德基因组发现中心,共同参与该项目中的全基因组外显子测序工作。研究 人员先用Agilent 的sureselect目标富集系统对基因组中的外显子进行捕获,再用illumina的HiSeq 2000进行测序,最终获得全基因组外显子的基因序列。

“大部分致病突变都与临床确诊的疾病有关,这些突变可以深入解释一些疾病的临床表现”,对于这一发现,来自安大略省儿童医院和FORGE组织的首席科学家Kym Boycott 教授感到非常惊讶。

如今FORGE组织已经和国际上其他罕见病研究中心CARE for RARE 开展合作研究,该项目将继续用全基因组外显子测序的方法来发现罕见病的致病基因,并研发出对应的治疗方案。

Kym Boycott 教授在一份声明中说到:“我们在立项的时候,当时我们预测该项目完成后能够解释或者解决50种罕见病症,但是现在我们能解释150种罕见病症”。

加拿大政府在2011年出资460万美金资助FORCE和加拿大儿科癌症基因组协会共同完成这一项目,实施该项目的出发点有两个,一是研究和认识罕见病的 致病原因,二是建立一个全民共享的罕见病基因数据库,通过这个数据库来提升这方面的疾病分析能力和帮助制订一套针对性的治疗方案。

FORCE的管理者和该论文的第一作者 Chandree Beaulieu 在声明中说:“对于这个项目的参与人员来说,他们的回报是多方面的。这些测序结果我们都反馈给参与本项目的家庭,决不会把这些信息封锁在实验室和数据库中。这一做法极大鼓舞了整个研究团队。”

Finding of Rare Disease Genes in Canada (FORGE Canada)

Lead Investigator(s):

Kym Boycott, Jacques Michaud & Jan Friedman

Funding:

$2.9 Million

Institution:

Children’s Hospital of Eastern Ontario Research Institute

Start Date:

April 1, 2011

End Date:

September 30, 2012

Genetic diseases in children, while often rare, have, in aggregate, an enormous impact on the well-being of Canadian families. Surprisingly, the majority of genes causing these conditions are still unknown. FORGE Canada (Finding of Rare Disease Genes) is a national consortium of clinicians and scientists using next-generation sequencing technology to identify genes responsible for a wide spectrum of rare pediatric-onset disorders present in the Canadian population.
The Consortium brings together clinicians from all 21 Clinical Genetics Centres representing every province and internationally-recognized Canadian scientists with expertise in gene identification, with the infrastructure of the Genome Canada Science and Technology (GC S&T) Innovation Centres. International collaborations have been established with clinicians in 16 countries. Two nation-wide requests for proposals have resulted in 175 disorders that met FORGE criteria; 70 of these rare disorders have been selected for study over the 18 months of this project. These disorders range from those affecting single families, to disorders with 20+ patients from across Canada and internationally recruited through the FORGE network. Twenty of these disorders were prioritized for analysis in the first quarter; 9 genes have been identified and analysis is still underway for the remaining 11 disorders. We are establishing a national data coordination centre to streamline and improve existing large-scale sequence analysis tools and our GE3LS team is working toward national ethical guidelines for analyzing sequence data from entire genomes and for sharing results with families.
Gene discoveries made by the FORGE Canada Consortium will have immediate and long-term benefits for the health of Canadians through translation to diagnostic tests, including the development of new methodologies and algorithms for the use of this technology. Within the first three months, we have identified 9 genes; 6 of these are novel genes that were previously not linked to human disease thereby providing insight into the molecular pathogenesis of these disorders. Successful completion of the activities of the FORGE Canada project will yield a coordinated and sustainable Consortium focused on the investigation of the genetic basis of human disease.

OGI supports the development and maintenance of high-impact, publically-available resources emerging from genomics research projects in Ontario – these resources include technology platforms, databases, software, reagents and libraries.  Our aim is to provide Ontario researchers with access to leading-edge, enabling technologies and to maintain domestic resources that can aid genomics research around the world.
Click here to learn about OGI’s Technology Days, an effort to increase the visibility and usage of resources that have been developed by or in partnership with Ontario researchers.
Technology PlatformsThe Centre for Applied Genomics (TCAG)
TCAG provides genomics services to researchers in academic, government, and private sectors all over the world. For more details, click here.
DatabasesAutism Chromosome Rearrangement Database
This resource consists of hand-curated breakpoints and other genomic features relating to autism that derive from publicly available literature: databases and unpublished data. It undergoes continuous updating with data from in-house experiments and published research. It welcomes data and feedback from the research community.
Barcode of Life Data Systems (BOLD)
BOLD is an accessible database that aids in collection, management, analysis, dissemination, and searching of DNA barcodes. It is the definitive global DNA barcode database – created and maintained in Ontario, with researchers from over 25 countries contributing DNA samples. It already contains barcode sequences for over 50,000 species. Approximately three quarters of those have been added by Ontario researchers.
BOLD consists of three components: BOLD-MAS (a repository for DNA barcode records and analytical tools), BOLD-IDS (a species-identification tool that determines taxonomic assignment when possible based on submitted DNA sequences), and BOLD-ECS (for web developers and bioinformaticians to build tools and workflows than can become part of the BOLD framework).
Chromosome 7 Annotation Project
This resource comprises a collection of sequence, gene, and other annotations from all databases (e.g., Celera published, Ensembl, NCBI, RIKEN, and UCSC) as well as unpublished data.
Cystic Fibrosis Mutation Database
This database is a collection of mutations in the cystic fibrosis transmembrane conductance regulator (CFTR) gene that acts as a resource for CF research everywhere. It currently contains more than 1,500 mutations and provides information about individual CFTR mutations and their related phenotypes. This database is augmented and maintained by a research team funded by Genome Canada through OGI.
Database of Genomic Variants (DGV)
DGV provides a comprehensive summary of structural variation resulting from alterations in the human genome. These changes involve segments of DNA larger than 1kb and insertions and deletions in the range 100bp-1kb. The DGV welcomes data on structural variation in the genome from scientific manuscripts.
Human Genome Segmental Duplication Database
This website contains information about segmental duplications in the human genome. The data come from analysis of the May 2004 Assembly of the Human Genome (also known as NCBI Build 35, or UCSC hg17).
Interologous Interaction Database (I2D)
I2D (formerly OPHID) is an on-line resource for exploring known and predicted mammalian and eukaryotic protein–protein interactions. It contains data for more than 430,000 protein interactions in humans and model organisms (fly, mouse, rat, worm, and yeast).
The Dynactome project has contributed more than 4,100 protein interactions to I2D, which has led to 8,880 more interactions through mapping to other organisms in the database. Further, by exploring texts, reviewing literature, and incorporating other high-throughput data sets, the project has given I2D a further 26,210 interactions and 56,810 interlogs.
Non-Human Segmental Duplication Database
This site contains information about segmental duplications in the genomes of chimpanzee, mouse, and rat.
StemBase
This is a publicly available database of Affymetrix DNA microarray and serial analysis of gene expression (SAGE) expression data from samples of human and mouse stem cells and their derivatives.
Structural Genomics Consortium (SGC) Materials and Methods
SGC is a not-for-profit organization that analyses the three-dimensional structure of proteins. It deposits structures (on average, 200 per year) in the Protein Data Bank (PDB), which releases them into the public domain and makes them freely accessible.
To access the PDB from the SGC homepage click on the “Structures” tab. From the “Structure Gallery” either search or scroll down for your protein of interest. The SGC entry will indicate the “PDB Code,” which will transfer you to the PDB entry, and at the bottom of the page there will be a link to the SGC structure file with the same PDB code. Following this link provides basic background information on each target and the analysis of its structure. There is also a link to reagents for the structure, as well as one to a detailed description of the experimental materials and methods that generated the structure. For some protein structures, an associated iSee data pack provides an animated interpretation of the structure and tabs that include protocols. Alternatively, access the PDB and select the tabs for “Materials & Methods” and “Biology & Chemistry.” There you will find purification and crystallization protocols, diffraction data, and other details about your selected structure.
Toronto Yeast Interaction Database and Toronto Yeast Pathway Database
These resources consolidate publicly available data and feature web-services interfaces that the Yeast Integrative Biology project maintains.
SoftwareGeneMANIAGeneMANIA is a comprehensive web-based genomic and data analysis tool intended for simple gene function prediction. The GeneMANIA data warehouse includes over 160 million interactions, from more than 130,000 genes, from six different organisms. GeneMANIA is freely available as open-source software. Additionally, a free online tutorial can be found on the OpenHelix website (http://www.openhelix.com/genemania ). GeneMANIA was developed and is being maintained by a research team funded by Genome Canada through OGI.
Cytoscape Web
Cytoscape Web is an online interface that can be used with GeneMANIA to visualize the composite network that are associated with a set of input genes. This tool is freely available as open-source software and was developed by a research team funded by Genome Canada through OGI.  Cytoscape Web is now actively developed as part of the open source Cytoscape project.
Automated Splice Site Analyses
A web-based software tool for the prediction of the effects of sequence changes that alter mRNA splicing in human disease.  This tool is used by researchers acroos Canada and worldwide, resulting in more than 130 citations to date.
eFISH (electronic fluorescence in situ hybridization)eFISH is a BLAST-based program that facilitates the choice of appropriate clones for FISH and CGH experiments, as well as interpretation of results in which genomic DNA probes are used in hybridization-based experiments.
Network Analysis, Visualization & Graphing TORonto (NAViGaTOR)
NAViGaTOR is a software package for visualization and analysis of protein-protein interaction networks in two or three dimensions (2D or 3D, respectively). It is downloadable free of charge for academic and not-for-profit institutions.
WaterEngage
This on-line global community connects businesses, organizations, scientists, water activists, and young people. It informs and engages youths and the broader public on global water issues and their effects on health. It addresses many issues, including ways in which emerging nanotechnology and biotechnology applications can address waterborne and water-related diseases.
Reagents and LibrariesNorth American Conditional Mouse Mutagenesis (NorCOMM)
NorCOMM develops and distributes a library of lines of mouse embryonic stem (ES) cells that carry single conditional-knockout mutations across the mouse genome. ES cells that it develops become publicly available on a cost-recovery basis. NorCOMM also provides services across Canada in archiving, derivation, genotyping, and phenotyping of mouse ES cells.

Nature重大成果:首张人体蛋白详细图谱

说起第一张人类基因组图谱,那已经是十几年前的事了,当年科学家们破解了24对染色体之后,满以为就此人体的奥秘可以破解了,然而时间过去了许多年,我们对于人体的功能,发育的过程,疾病的发生依然一知半解,这其中的原因也许就在于蛋白的奥秘还蒙着一层厚厚的面纱。

5月底,来自慕尼黑工业大学的一组研究人员编撰了18000多个蛋白,绘制了一张几乎完整的人类蛋白质组图谱,目前这一图谱相关信息可免费获取(ProteomicsDB database,https://www.proteomicsdb.org/)——由慕尼黑工业大学与软件公司SAP共同开发。其中包含了不同细胞,组织,以及体液中多种蛋白类型,蛋白分布和蛋白丰度的相关信息。

这项研究表明,一方面多个不同区域的基本功能过程涉及了上万种蛋白,另外一方面,每个器官中的蛋白图谱都是独一无二的,具有功能特异性。

同时在这项研究中还采用了两种高通量技术,用于确保这项研究的顺利进行,这两项技术就是质谱技术(massspectrometry)和内存计算(in-memory computing,生物通译)。

从DNA蓝图到蛋白:RNA决定拷贝数

一个蛋白是如何由一个基因而来的呢?答案就在于RNA,这个过程需要花费几步令DNA蓝图被转录成RNA拷贝,然后这些信使RNA(mRNA)分子作为蛋白生产的模板,指导氨基酸组装成蛋白。

在这项研究中,研究人员发现每种mRNA都能决定细胞产生的蛋白数量。

这种“copying key”对于每种蛋白来说都是特殊的,“似乎每个mRNA分子都知道其翻译而来的蛋白的单位数量,到底是生成10个蛋白,还是100,1000个蛋白拷贝”,慕尼黑工业大学蛋白质组学和生物分析系主任Bernhard Küster教授解释道。

“现在我们知道了许多蛋白的这一比率,这样就能根据mRNA丰度推断几乎每个组织中的蛋白质丰度,反之亦然。”

新基因-老基因

令研究人员感到惊讶的是,他们发现了上百种由已知基因以外DNA编码的蛋白片段,这些新发现的蛋白可能具有一些新的生物学特性和功能,但其相关性目前还不清楚。

而且到目前为止大约有2000种蛋白,研究人员无法进行定位,这些蛋白根据基因图谱本应该存在的。其中一些蛋白也许只在胚胎发育时期出现,但是似乎还有许多已知基因失去了功能。这些尤其影响了嗅觉受体,这也许意味着现代人类的生存不再依赖于复杂的嗅觉系统了。

“在这里,我们可能看到了进化的脚步”,Küster表示,“人体弃用了一些多余的基因,同时又启用了一些新的基因”。在这种情况下,也许永远都无法确定人体内究竟有多少蛋白质。

蛋白图谱可用于预测药物敏感性

  • 之前的研究表明,特定的蛋白模式可以用于预测某种药物的疗效。在这个最新的研究中,科学家们又分析检测了24种癌症药物,这些药物能对35种与其蛋白图谱具有强烈相关性的肿瘤细胞株产生作用。

    Küster说,“这令我们离患者的个体化治疗又进了一步,如果我们详细了解了肿瘤的蛋白图谱,那么就能更加有靶向性的给药。而且这项研究提出的新观点,也将有助于医学研究人员研发组合药物,用于个体化治疗”。

 

原文检索:

A draft map of the human proteome

The availability of human genome sequence has transformed biomedical research over the past decade. However, an equivalent map for the human proteome with direct measurements of proteins and peptides does not exist yet. Here we present a draft map of the human proteome using high-resolution Fourier-transform mass spectrometry. In-depth proteomic profiling of 30 histologically normal human samples, including 17 adult tissues, 7 fetal tissues and 6 purified primary haematopoietic cells, resulted in identification of proteins encoded by 17,294 genes accounting for approximately 84% of the total annotated protein-coding genes in humans. A unique and comprehensive strategy for proteogenomic analysis enabled us to discover a number of novel protein-coding regions, which includes translated pseudogenes, non-coding RNAs and upstream open reading frames. This large human proteome catalogue (available as an interactive web-based resource at http://www.humanproteomemap.org) will complement available human genome and transcriptome data to accelerate biomedical research in health and disease.

Mass-spectrometry-based draft of the human proteome
Proteomes are characterized by large protein-abundance differences, cell-type- and time-dependent expression patterns and post-translational modifications, all of which carry biological information that is not accessible by genomics or transcriptomics. Here we present a mass-spectrometry-based draft of the human proteome and a public, high-performance, in-memory database for real-time analysis of terabytes of big data, called ProteomicsDB. The information assembled from human tissues, cell lines and body fluids enabled estimation of the size of the protein-coding genome, and identified organ-specific proteins and a large number of translated lincRNAs (long intergenic non-coding RNAs). Analysis of messenger RNA and protein-expression profiles of human tissues revealed conserved control of protein abundance, and integration of drug-sensitivity data enabled the identification of proteins predicting resistance or sensitivity. The proteome profiles also hold considerable promise for analysing the composition and stoichiometry of protein complexes. ProteomicsDB thus enables navigation of proteomes, provides biological insight and fosters the development of proteomic technology.

NA12891 and NA12878 vcf database

NA12878 (child), NA12891 (father), and NA12892 (mother)

########### NA12891 and NA12878 vcf database

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120117_ceu_trio_b37_decoy/

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140123_NA12878_Illumina_Platinum/

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/NA12891/exome_alignment/

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/trio/snps
##########

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/trio/snps/

 

 

ChIPBase超级好用的表观遗传数据库

要研究好lncRNA,miRNA ChIP-Seq数据等等的关联。ChIPBase是个非常好的数据库。它是中大大牛的实验室QuLab做的。里面收集几乎主流的高通量测序数据。看看下图的图示就知道有什么了。

最现实的一个把握:如果您知道转录因子,您想看看它们调控的下游靶基因。这个在UCSC基因组浏览器是不好找的。但是在这里就很容易达成,因为它做了对应。

看看这个好数据库的介绍吧:

microRNAs (miRNAs) and long non-coding RNAs (lncRNAs) and represent two classes of important non-coding RNAs in eukaryotes. Although these non-coding RNAs have been implicated in organismal development and in various human diseases, surprisingly little is known about their transcriptional regulation. Recent advances in chromatin immunoprecipitation with next-generation DNA sequencing (ChIP-Seq) have provided methods of detecting transcription factor binding sites (TFBSs) with unprecedented sensitivity. In this study, we describe ChIPBase (http://deepbase.sysu.edu.cn/chipbase/), a novel database that we have developed to facilitate the comprehensive annotation and discovery of transcription factor binding maps and transcriptional regulatory relationships of miRNAs and lncRNAs from ChIP-Seq data.

 

The current release of ChIPBase includes high-throughput sequencing data that were generated by 543 ChIP-Seq experiments in diverse tissues and cell lines from six organisms. By analysing millions of TFBSs, we identified tens of thousands of TF-lncRNA and TF-miRNA regulatory relationships. Furthermore, we constructed TF->miRNA->mRNAs regulatory networks by integrating CLIP-Seq data and ChIP-Seq data. In addition, we constructed expression profiles of human lncRNAs and mRNAs from RNA-Seq data from 22 normal tissues.

selleckchem 代谢通路及激酶抑制剂数据库

selleckchem 激酶抑制剂,酪氨酸激酶抑制剂,酶抑制剂,蛋白抑制剂,蛋白激酶抑制剂,小分子,磷酸酶抑制剂
代谢通路及抑制剂数据库下载:http://www.gene-seq.com/biodownload/inhibitor_pathway/

抑制剂分类