C#,生信软件实践(06)——DNA数据库GenBank文件的详解介绍及解释器之完整C#源代码

news/2024/10/22 15:33:38/

1 GenBank 

1.1 NCBI——美国国家生物技术信息中心(美国国立生物技术信息中心)


        NCBI(美国国立生物技术信息中心)是在NIH的国立医学图书馆(NLM)的一个分支。它的使命包括四项任务:1. 建立关于分子生物学,生物化学,和遗传学知识的存储和分析的自动系统 ;2.实行关于用于分析生物学重要分子和复合物的结构和功能的基于计算机的信息处理的,先进方法的研究;3. 加速生物技术研究者和医药治疗人员对数据库和软件的使用;4. 全世界范围内的生物技术信息收集的合作努力。NCBI数据库由Nucleotide(核苷酸序列数据库)、 Genome(基因组数据库)、Structure(结构数据库或称分子模型数据库)、Taxonomy(生物学门类数据库)、 PopSet几个子库组成。

        美国国立生物技术信息中心(National Center for Biotechnology Information),是由美国国立卫生研究院(NIH)于1988年创办。创办NCBI的初衷是为了给分子生物学家提供一个信息储存和处理的系统。除了建有GenBank核酸序列数据库(该数据库的数据资源来自全球几大DNA数据库,其中包括日本DNA数据库DDBJ、欧洲分子生物学实验室数据库EMBL以及其它几个知名科研机构)之外,NCBI还可以提供众多功能强大的数据检索与分析工具。目前,NCBI提供的资源有Entrez、Entrez Programming Utilities、My NCBI、PubMed、PubMed Central、Entrez Gene、NCBI Taxonomy Browser、BLAST、BLAST Link (BLink)、Electronic PCR等共计36种功能,而且都可以在NCBI的主页www.ncbi.nlm.nih.gov上找到相应链接,其中多半是由BLAST功能发展而来的。

1.2 GenBank DNA数据库


        GenBank是美国国家生物技术信息中心(National Center for Biotechnology Information ,NCBI)建立的DNA序列数据库,从公共资源中获取序列数据,主要是科研人员直接提供或来源于大规模基因组测序计划( Benson等, 1998)。为保证数据尽可能的完全,GenBank与EMBL(欧洲EMBL-DNA数据库)、DDBJ(日本DNA数据库:DNA Data Bank of Japan)建立了相互交换数据的合作关系。


        GenBank文件就是NCBI支持的主要生信格式。读懂 GenBank 后 EMBL 就很简单了。

        GenBank格式是最早和最古老的生物信息学数据格式之一,最初的发明是为了弥补人类可读的表达方式和可被计算机有效处理的表达方式之间的差距,为人类阅读而优化的,不适合大规模的数据处理。该格式有一个所谓的固定宽度格式,前十个字符组成一列,作为一个标识符,其余的行是与该标识符相对应的信息。

2 GenBank Overview

What is GenBank?
GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.

A GenBank release occurs every two months and is available from the ftp site. The release notes for the current version of GenBank provide detailed information about the release and notifications of upcoming changes to GenBank. Release notes for previous GenBank releases are also available. GenBank growth statistics for both the traditional GenBank divisions and the WGS division are available from each release.

An annotated sample GenBank record for a Saccharomyces cerevisiae gene demonstrates many of the features of the GenBank flat file format.

Access to GenBank
There are several ways to search and retrieve data from GenBank.

Search GenBank for sequence identifiers and annotations with Entrez Nucleotide.
Search and align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool). See BLAST info for more information about the numerous BLAST databases.
Search, link, and download sequences programatically using NCBI e-utilities.
The ASN.1 and flatfile formats are available at NCBI's anonymous FTP server: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1 and ftp://ftp.ncbi.nlm.nih.gov/genbank.

GenBank Data Usage
The GenBank database is designed to provide and encourage access within the scientific community to the most up-to-date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims, and therefore cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in GenBank.

Data Processing, Status and Release
The most important source of new data for GenBank is direct submissions from a variety of individuals, including researchers, using one of our submission tools. Following submission, data are subject to automated and manual processing to ensure data integrity and quality and are subsequently made available to the public. On rare occasions, data may be removed from public view. More details about this process can be found on the NLM GenBank and SRA Data Processing.

Confidentiality
Some authors are concerned that the appearance of their data in GenBank prior to publication will compromise their work. GenBank will, upon request, withhold release of new submissions for a specified period of time. However, if the accession number or sequence data appears in print or online prior to the specified date, your sequence will be released. In order to prevent the delay in the appearance of published sequence data, we urge authors to inform us of the appearance of the published data. As soon as it is available, please send the full publication data--all authors, title, journal, volume, pages and date--to the following address: update@ncbi.nlm.nih.gov

Privacy
If you are submitting human sequences to GenBank, do not include any data that could reveal the personal identity of the source. GenBank assumes that the submitter has received any necessary informed consent authorizations required prior to submitting sequences.

3 GenBank Parser解释器C#源代码

using System;
using System.IO;
using System.Text;
using System.Linq;
using System.Drawing;
using System.Collections;
using System.Collections.Generic;
using System.Runtime.Serialization;

namespace Legal.BIOG
{
    [DataContract]
    public class GENBANK_ELEMENT
    {
        [DataMember(Order = 1)]
        public string Name { get; set; } = "";
        [DataMember(Order = 2)]
        public string Content { get; set; } = "";

        public GENBANK_ELEMENT(int position, string buf)
        {
            Name = buf.Substring(0, position).Trim();
            Content = buf.Substring(position).Trim();
        }

        public GENBANK_ELEMENT(string name, string content)
        {
            Name = name;
            Content = content;
        }
    }

    [DataContract]
    public class GENBANK_REFERENCE
    {
        [DataMember(Order = 1)]
        public string Name { get; set; } = "";
        [DataMember(Order = 2)]
        public List<GENBANK_ELEMENT> Items { get; set; } = new List<GENBANK_ELEMENT>();

        public void Append(int position, string buf)
        {
            Items.Add(new GENBANK_ELEMENT(position, buf));
        }

        public void Append(string name, string content)
        {
            Items.Add(new GENBANK_ELEMENT(name, content));
        }

        public GENBANK_ELEMENT Find(string name)
        {
            return Items.Find(t => t.Name == name);
        }
    }

    [DataContract]
    public class GENBANK_FEATURE
    {
        [DataMember(Order = 1)]
        public string Name { get; set; } = "";
        [DataMember(Order = 2)]
        public string Lines { get; set; } = "";

        public GENBANK_FEATURE(string name, string lines)
        {
            Name = name;
            Lines = lines;
        }

        public List<string> FeatureList
        {
            get
            {
                string[] ra = B.S2L(Lines);
                return ra.ToList();
            }
        }

        /// <summary>
        /// 搜索 FEATURE 项目
        /// 比如:/db_xref=
        /// </summary>
        /// <param name="name">db_xref</param>
        /// <param name="branch">db_xref</param>
        /// <returns></returns>
        public string FindBranch(string name, string branch)
        {
            List<string> list = FeatureList;
            if (Name == name)
            {
                foreach (string s in list)
                {
                    if (s.StartsWith("/" + branch + "="))
                    {
                        return s.Substring(branch.Length + 2);
                    }
                }
            }
            return "";
        }

        public string Position
        {
            get
            {
                List<string> list = FeatureList;
                return (list[0].Contains("..")) ? list[0] : "";
            }
        }

        public List<Point> PositionList
        {
            get
            {
                return Utility.PositionList(Position);
            }
        }
    }

    [DataContract]
    public class GENBANK_Item
    {
        [DataMember(Order = 1)]
        public List<GENBANK_ELEMENT> Descriptions { get; set; } = new List<GENBANK_ELEMENT>();
        [DataMember(Order = 2)]
        public List<GENBANK_REFERENCE> References { get; set; } = new List<GENBANK_REFERENCE>();
        [DataMember(Order = 3)]
        public List<GENBANK_REFERENCE> Source { get; set; } = new List<GENBANK_REFERENCE>();
        [DataMember(Order = 4)]
        public List<GENBANK_FEATURE> Features { get; set; } = new List<GENBANK_FEATURE>();

        public string Find(string name)
        {
            GENBANK_ELEMENT de = Descriptions.Find(t => t.Name == name);
            return (de.Name == name) ? de.Content : "";
        }

        public string Sequence
        {
            get
            {
                GENBANK_ELEMENT sq = Descriptions.Find(t => t.Name == "ORIGIN");
                return (sq.Name == "ORIGIN") ? (sq.Content) : "";
            }
        }
    }

    public class GENBANK_File
    {
        public List<GENBANK_Item> Items { get; set; } = new List<GENBANK_Item>();

        public GENBANK_File(string buf)
        {
            try
            {
                string[] xlines = B.S2L(buf);
                GENBANK_Item item = null;
                for (int i = 0; i < xlines.Length; i++)
                {
                    if (xlines[i].StartsWith("LOCUS"))
                    {
                        if (item != null) { Items.Add(item); item = null; }
                        item = new GENBANK_Item();
                        item.Descriptions.Add(new GENBANK_ELEMENT(12, xlines[i]));
                        continue;
                    }
                    if (xlines[i].StartsWith("DEFINITION") ||
                        xlines[i].StartsWith("ACCESSION") ||
                        xlines[i].StartsWith("VERSION") ||
                        xlines[i].StartsWith("KEYWORDS") ||
                        xlines[i].StartsWith("COMMENT"))
                    {
                        string rs = Utility.ReadMultiLines(ref i, xlines, out string kw, 12, " ");
                        item.Descriptions.Add(new GENBANK_ELEMENT(kw, rs));
                        continue;
                    }
                    else if (xlines[i].StartsWith("SOURCE"))
                    {
                        GENBANK_REFERENCE src = new GENBANK_REFERENCE();
                        src.Name = xlines[i].Substring(12).Trim(); i++;
                        while (true)
                        {
                            string rs = Utility.ReadMultiLines(ref i, xlines, out string kw, 12);
                            src.Append(kw, rs);
                            if (xlines[i + 1].Substring(0, 1) != " ") { break; }
                            i++;
                        }
                        item.Source.Add(src);
                        continue;
                    }
                    else if (xlines[i].StartsWith("REFERENCE"))
                    {
                        GENBANK_REFERENCE rfx = new GENBANK_REFERENCE();
                        rfx.Name = xlines[i].Substring(12).Trim(); i++;
                        while (true)
                        {
                            string rs = Utility.ReadMultiLines(ref i, xlines, out string kw, 12);
                            rfx.Append(kw, rs);
                            if (xlines[i + 1].Substring(0, 1) != " ") { break; }
                            i++;
                        }
                        item.References.Add(rfx);
                        continue;
                    }
                    else if (xlines[i].StartsWith("FEATURES"))
                    {
                        item.Descriptions.Add(new GENBANK_ELEMENT("FEATURES", xlines[i].Substring(12).Trim())); i++;
                        while (true)
                        {
                            string rs = Utility.ReadFeatureLines(ref i, xlines, out string kw, 0, 21);
                            GENBANK_FEATURE ef = new GENBANK_FEATURE(kw, rs);
                            item.Features.Add(ef);
                            if (xlines[i + 1].Substring(0, 1) != " ") { break; }
                            i++;
                        }
                        continue;
                    }
                    else if (xlines[i].StartsWith("//"))
                    {
                        if (item != null) { Items.Add(item); item = null; }
                        continue;
                    }
                    else if (xlines[i].StartsWith("ORIGIN"))
                    {
                        i++;
                        string rs = Utility.ReadSequenceLines(ref i, xlines, 10);
                        item.Descriptions.Add(new GENBANK_ELEMENT("ORIGIN", rs));
                        continue;
                    }
                    else
                    {
                        item.Descriptions.Add(new GENBANK_ELEMENT("UNKNOW", xlines[i]));
                        continue;
                    }
                }
                if (item != null)
                {
                    Items.Add(item);
                }
            }
            catch (Exception ex)
            {
                throw new Exception("GENBANK_File() ERROR: " + ex.Message);
            }
        }

        public static GENBANK_File FromFile(string filename)
        {
            try
            {
                string buf = File.ReadAllText(filename);
                return new GENBANK_File(buf);
            }
            catch (Exception ex)
            {
                throw new Exception("GENBANK_File() ERROR: " + ex.Message);
            }
        }

        public void Write_Json(string filename)
        {
            try
            {
                File.WriteAllText(filename, SimpleJson.SerializeObject(Items));
            }
            catch (Exception ex)
            {
                throw new Exception("GENBANK_File.Write_Json ERROR: " + ex.Message);
            }
        }

        public string Fasta_Sequences()
        {
            StringBuilder sb = new StringBuilder();
            foreach (GENBANK_Item item in Items)
            {
                sb.AppendLine(">" + item.Find("DEFINITION"));
                sb.AppendLine(B.BreakTo(item.Sequence));
                sb.AppendLine("");
            }
            return sb.ToString();
        }

        public string Print_Features()
        {
            StringBuilder sb = new StringBuilder();
            foreach (GENBANK_Item item in Items)
            {
                foreach (GENBANK_FEATURE feature in item.Features)
                {
                    if (feature.FeatureList.Count > 1)
                    {
                        sb.AppendLine(">" + feature.Name + " " + feature.FeatureList[1]);
                        sb.AppendLine(B.BreakTo(Utility.SequenceByPosition(item.Sequence, feature.PositionList)));
                        sb.AppendLine("");
                    }
                }
            }
            return sb.ToString();
        }

        public string Protein()
        {
            StringBuilder sb = new StringBuilder();
            foreach (GENBANK_Item item in Items)
            {
                foreach (GENBANK_FEATURE feature in item.Features)
                {
                    string tr = feature.FindBranch("CDS", "translation");
                    if (tr.Length > 0)
                    {
                        sb.AppendLine(">" + feature.Name + " " + feature.FeatureList[1]);
                        sb.AppendLine(B.BreakTo(tr.Replace(" ", "").Replace("\"", "")));
                        sb.AppendLine("");
                    }
                }
            }
            return sb.ToString();
        }

    }
}
 


http://www.ppmy.cn/news/152346.html

相关文章

《安富莱嵌入式周报》第305期:超级震撼数码管瀑布,使用OpenAI生成单片机游戏代码的可玩性,120通道逻辑分析仪,复古电子设计,各种运动轨迹函数源码实现

往期周报汇总地址&#xff1a;嵌入式周报 - uCOS & uCGUI & emWin & embOS & TouchGFX & ThreadX - 硬汉嵌入式论坛 - Powered by Discuz! 说明&#xff1a; 谢谢大家的关注&#xff0c;继续为大家盘点上周精彩内容。 视频版&#xff1a; https://www.bi…

考研数学:常见的初等函数求导公式以及其对应的积分公式

( x u ) ′ μ x k − 1 ∫ μ x n − 1 d x x μ c \left(x^{u}\right)^{\prime}\mu x^{k-1} \quad \quad \int \mu x^{n-1} \mathrm{d} xx^{\mu}c (xu)′μxk−1∫μxn−1dxxμc ( x m p ) ′ m − p p x m p ∫ m p x m − p p d x x m p x m p c (\sqrt[p]{x^{m}})^…

ODX介绍(1)

ODX目的&#xff1a; Open Diagnostic data eXchange,由ASAM定义的一种诊断和刷写数据的数据格式&#xff0c;方便不同供应商之间&#xff0c;供应商与车厂之间&#xff0c;开发与售后诊断之间交 换数据,即诊断数据库。 ODX文件分类&#xff1a;ODX-CATEGORY 以ODX 2.2.0版…

前向差分、后向差分、中心差分精度,matlab仿真

一、前向差分 前向差分公式&#xff1a;(1)泰勒展开为&#xff1a;(2) 由泰勒展开可以推出 f (x) : (3) 由&#xff08;3&#xff09;可以知道右边第一项是前向差分&#xff0c;而其他项的和是函数f (x)与前向差分的误差&#xff0c;用o(x)表示&#xff0c;得出&#xff1a;&…

Matlab多重积分的两种实现【从六重积分到一百重积分】

问题 今天被问了一个问题&#xff1a; μ ∫ ∫ ∫ ∫ ∫ ∫ f ( x 1 , x 2 , x 3 , x 4 , x 5 , x 6 ) d x 1 d x 2 d x 3 d x 4 d x 5 d x 6 σ 2 ∫ ∫ ∫ ∫ ∫ ∫ [ f ( x 1 , x 2 , x 3 , x 4 , x 5 , x 6 ) − μ ] 2 d x 1 d x 2 d x 3 d x 4 d x 5 d x 6 \begin{ar…

目前市场上的全画幅的数码相机

在短短一年的时间里&#xff0c;我们共迎来了8款全画幅产品&#xff0c;这其中包括了&#xff1a; 1、两款高端全幅单反&#xff1a;&#xff08;佳能1DX和尼康D4&#xff09;&#xff1b; 2、两款专业全副单反&#xff1a; &#xff08;佳能5D3和尼康D800&#xff09;&#xf…

位深度8位什么水平_佳能1DX3视频12位RAW拍摄和8位mp4拍摄的色彩有多大差别

佳能1DXMarkIII 是佳能数码单反相机的旗舰机型。 该产品具有双像素 CMOS 的高速对焦和每秒多达 16 帧的高速连拍等功能,适用于在体育和新闻报道等领域发挥积极作用的专业人士。 虽然这台机器基本上是一个相机,但是除了在摄影方面表现出强大的优势以外,但实际上,视频拍摄功能…

Java中关于ConditionObject的signal()方法的分析

代码块的展示 isHeldExclusively()这个仅持有锁资源的方法&#xff0c;在ReentrantLock中重写进行判断&#xff0c;要是没有持有锁资源那么会返回false&#xff0c;就会出现直接抛异常IllegalMonitorStateException&#xff08;非法监视器状态异常&#xff09;获取排在Conditi…