GATK ReadLikelihoodCalculationEngine接口介绍

devtools/2024/9/20 3:53:27/ 标签: java, 生物信息学

ReadLikelihoodCalculationEngine 是 GATK(Genome Analysis Toolkit)中的一个接口,用于计算不同等位基因(haplotypes 或 alleles)下的测序读数的似然值。这些似然值在变异检测过程中起着关键作用,帮助确定哪些等位基因更可能是真实的遗传变异。

主要功能

ReadLikelihoodCalculationEngine 的主要功能是根据给定的测序读数数据、参考序列以及可能的等位基因来计算每个读数在每种可能的等位基因下的似然值。这些似然值可以帮助识别和过滤出真实的变异。

主要方法

computeReadLikelihoods:计算每个读数在每个候选等位基因下的似然值。输出: 返回 AlleleLikelihoods<GATKRead, Haplotype> 对象,包含读数与等位基因的似然值矩阵。

使用场景

ReadLikelihoodCalculationEngine 在许多 GATK 的变异检测管道中扮演重要角色,特别是在体细胞突变检测(如 Mutect2)或常规的 SNP/Indel 变异检测中。

实现类:

FlowBasedAlignmentLikelihoodEngine源码

package org.broadinstitute.hellbender.tools.walkers.haplotypecaller;import htsjdk.samtools.SAMFileHeader;
import org.apache.commons.math3.util.Precision;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.broadinstitute.hellbender.exceptions.GATKException;
import org.broadinstitute.hellbender.utils.Utils;
import org.broadinstitute.hellbender.utils.genotyper.*;
import org.broadinstitute.hellbender.utils.haplotype.Haplotype;
import org.broadinstitute.hellbender.utils.read.*;
import org.broadinstitute.hellbender.utils.haplotype.FlowBasedHaplotype;import java.util.*;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ForkJoinPool;
import java.util.function.ToDoubleFunction;
import java.util.stream.IntStream;/*** Flow based replacement for PairHMM likelihood calculation. Likelihood calculation all-vs-all flow based reads* and haplotypes.** see {@link #haplotypeReadMatching(FlowBasedHaplotype, FlowBasedRead)}*/
public class FlowBasedAlignmentLikelihoodEngine implements ReadLikelihoodCalculationEngine {public static final double MAX_ERRORS_FOR_READ_CAP = 3.0;public static final double MAX_CATASTROPHIC_ERRORS_FOR_READ_CAP = 2.0;// for the two most common probability values, we pre-compute log10 to save runtime.public static final double COMMON_PROB_VALUE_1 = 0.988;public static final double COMMON_PROB_VALUE_2 = 0.001;public static final double COMMON_PROB_VALUE_1_LOG10 = Math.log10(COMMON_PROB_VALUE_1);public static final double COMMON_PROB_VALUE_2_LOG10 = Math.log10(COMMON_PROB_VALUE_2);private double log10globalReadMismappingRate;private final double expectedErrorRatePerBase;private final boolean symmetricallyNormalizeAllelesToReference;private static final int ALIGNMENT_UNCERTAINTY = 4;final FlowBasedAlignmentArgumentCollection fbargs;private final Logger logger = LogManager.getLogger(this.getClass());private final boolean dynamicReadDisqualification;private final double readDisqualificationScale;private ForkJoinPool    threadPool;/*** Default constructor* @param fbargs - arguments* @param log10globalReadMismappingRate - probability for wrong mapping (maximal contribution of the read to data likelihood)* @param expectedErrorRatePerBase - the expected rate of random sequencing errors for a read originating from its true haplotype.*/public FlowBasedAlignmentLikelihoodEngine(final FlowBasedAlignmentArgumentCollection fbargs,final double log10globalReadMismappingRate,final double expectedErrorRatePerBase,final boolean dynamicReadDisqualification, final double readDisqualificationScale) {this.fbargs = fbargs;this.log10globalReadMismappingRate = log10globalReadMismappingRate;this.expectedErrorRatePerBase = expectedErrorRatePerBase;this.readDisqualificationScale = readDisqualificationScale;this.dynamicReadDisqualification = dynamicReadDisqualification;this.symmetricallyNormalizeAllelesToReference = true;if ( this.fbargs.flowLikelihoodParallelThreads > 0 ) {threadPool = new ForkJoinPool(this.fbargs.flowLikelihoodParallelThreads);}}/*** Read/haplotype likelihood calculation for all samples* @param samples the list of targeted samples.* @param perSampleReadList the input read sets stratified per sample.** @param filterPoorly - if the poorly modeled reads should be removed* @return*/public AlleleLikelihoods<GATKRead, Haplotype> computeReadLikelihoods(final List<Haplotype> haplotypeList,final SAMFileHeader hdr,final SampleList samples,final Map<String, List<GATKRead>> perSampleReadList, final boolean filterPoorly) {return computeReadLikelihoods(haplotypeList, hdr, samples, perSampleReadList,filterPoorly, true);}/*** Read/haplotype likelihood calculation for all samples* @param samples the list of targeted samples.* @param perSampleReadList the input read sets stratified per sample.** @param filterPoorly - if the poorly modeled reads should be removed* @param normalizeLikelihoods - if the likelihood of each read should be normalized to the highest* @returnd*/public AlleleLikelihoods<GATKRead, Haplotype> computeReadLikelihoods(final List<Haplotype> haplotypeList,final SAMFileHeader hdr,final SampleList samples,final Map<String, List<GATKRead>> perSampleReadList,final boolean filterPoorly, final boolean normalizeLikelihoods) {Utils.nonNull(samples, "samples is null");Utils.nonNull(perSampleReadList, "perSampleReadList is null");Utils.nonNull(haplotypeList, "haplotypeList is null");final AlleleList<Haplotype> haplotypes = new IndexedAlleleList<>(haplotypeList);// Add likelihoods for each sample's reads to our resultfinal AlleleLikelihoods<GATKRead, Haplotype> result = new AlleleLikelihoods<>(samples, haplotypes, perSampleReadList);final int sampleCount = result.numberOfSamples();for (int i = 0; i < sampleCount; i++) {computeReadLikelihoods(result.sampleMatrix(i), hdr);}if (normalizeLikelihoods) {result.normalizeLikelihoods(log10globalReadMismappingRate, symmetricallyNormalizeAllelesToReference);}if (filterPoorly) {filterPoorlyModeledEvidence(result, dynamicReadDisqualification, expectedErrorRatePerBase, readDisqualificationScale);}return result;}/** Calculate minimal likelihood that reasonably matching read can get. We divide errors into "expected", e.g.* hmer indels without 0->1/1->0 errors. Those on average occur with expectedErrorRate and "catastrophic' e.g.* 1->0 and 0->1 errors (or large hmer indels). Those occur with probability fbargs.fillingValue.* If the read has more than 3 expected and more than 2 "catastrophic" errors it will probably be deemed unfit to the* haplotype** @param expectedErrorRate  error rate for expected errors.* @return minimal likelihood for the read to be considered not poorly modeled*/@Overridepublic ToDoubleFunction<GATKRead> log10MinTrueLikelihood(final double expectedErrorRate, final boolean capLikelihoods) {final double log10ErrorRate = Math.log10(expectedErrorRate);final double largeEventErrorRate = Math.max(fbargs.fillingValue, 0.000001); // error rate for non-hmer/snv errors that are not seq. errors.final double log10catastrophicErrorRate = Math.log10(largeEventErrorRate);return read -> {final double maxErrorsForRead = capLikelihoods ? Math.max(MAX_ERRORS_FOR_READ_CAP, Math.ceil(read.getLength() * expectedErrorRate)) : Math.ceil(read.getLength() * expectedErrorRate);final double maxCatastrophicErrorsForRead = capLikelihoods ? Math.max(MAX_CATASTROPHIC_ERRORS_FOR_READ_CAP, Math.ceil(read.getLength() * largeEventErrorRate)) :Math.ceil(read.getLength() * largeEventErrorRate);return maxErrorsForRead * log10ErrorRate + maxCatastrophicErrorsForRead * log10catastrophicErrorRate;};}/*** Compute read likelihoods for a single sample* @param likelihoods Single sample likelihood matrix* @param hdr SAM header that corresponds to the sample*/protected void computeReadLikelihoods(final LikelihoodMatrix<GATKRead, Haplotype> likelihoods,final SAMFileHeader hdr) {final List<FlowBasedRead> processedReads = new ArrayList<>(likelihoods.evidenceCount());final List<FlowBasedHaplotype> processedHaplotypes = new ArrayList<>(likelihoods.numberOfAlleles());// establish flow order based on the first evidence. Note that all reads belong to the same sample (group)final FlowBasedReadUtils.ReadGroupInfo rgInfo = (likelihoods.evidenceCount() != 0)? FlowBasedReadUtils.getReadGroupInfo(hdr, likelihoods.evidence().get(0)): null;final String flowOrder = (rgInfo != null)? rgInfo.flowOrder.substring(0, fbargs.flowOrderCycleLength): FlowBasedReadUtils.findFirstUsableFlowOrder(hdr, fbargs);//convert all reads to FlowBasedReads (i.e. parse the matrix of P(call | haplotype) for each read from the BAM)for (int i = 0 ; i < likelihoods.evidenceCount(); i++) {final GATKRead rd = likelihoods.evidence().get(i);final FlowBasedRead fbRead = new FlowBasedRead(rd, flowOrder, rgInfo.maxClass, fbargs);fbRead.applyAlignment();processedReads.add(fbRead);}for (int i = 0; i < likelihoods.numberOfAlleles(); i++){final FlowBasedHaplotype fbh = new FlowBasedHaplotype(likelihoods.alleles().get(i), flowOrder);processedHaplotypes.add(fbh);}if (fbargs.trimToHaplotype) {//NOTE: we assume all haplotypes start and end on the same place!final int haplotypeStart = processedHaplotypes.get(0).getStart();final int haplotypeEnd = processedHaplotypes.get(0).getEnd();for (int i = 0; i < processedReads.size(); i++) {final FlowBasedRead fbr = processedReads.get(i);final int readStart = fbr.getStart();final int readEnd = fbr.getEnd();final int diffLeft = haplotypeStart - readStart;final int diffRight = readEnd - haplotypeEnd;//It is rare that this function is applied, maybe just some boundary cases//in general reads are already trimmed to the haplotype starts and ends so diff_left <= 0 and diff_right <= 0fbr.applyBaseClipping(Math.max(0, diffLeft), Math.max(diffRight, 0), true);}}for (int i = 0; i < likelihoods.numberOfAlleles(); i++){final FlowBasedHaplotype  fbh = processedHaplotypes.get(i);if ( threadPool != null  ) {final int haplotypeIndex = i;Callable<Void>  callable = () -> {IntStream.range(0, likelihoods.evidenceCount()).parallel().forEach(j -> {final double likelihood = fbargs.exactMatching ? haplotypeReadMatchingExactLength(fbh, processedReads.get(j)) : haplotypeReadMatching(fbh, processedReads.get(j));likelihoods.set(haplotypeIndex, j, likelihood);});return null;};try {threadPool.submit(callable).get();} catch (InterruptedException | ExecutionException e) {throw new RuntimeException(e);}} else {for (int j = 0; j < likelihoods.evidenceCount(); j++) {final double likelihood = fbargs.exactMatching ? haplotypeReadMatchingExactLength(fbh, processedReads.get(j)) : haplotypeReadMatching(fbh, processedReads.get(j));likelihoods.set(i, j, likelihood);}}}}/*** Aligns single read to a single haplotype. The alignment process is based on the assumption that the errors are only* changes in the flow calls. Thus, the length of the haplotype in flow space should be equal to the length of the read* in the flow space (on the region where the read aligns).** Example: align read ACGGT* to haplotype       GACG-TA. In flow space we see that the overlapping sequences ACGGT and ACGT are of the same length* and we just need to calculate P(R|H) = \prod P(Ri | Hi) for ith flow. Note that the flow matrix of the flow based* read {@link FlowBasedRead} allows for trivial calculation of this product.** As an additional optimization we note that almost always the interval to be aligned corresponds to the overlap* of the read and the haplotype to the reference. We allow for some uncertainty in this, especially when the read* falls inside the deletion in the haplotype relative to the reference.** @param haplotype FlowBasedHaplotype single haplotype* @param read FlowBasedRead single read (trimmed to the haplotype)* @return a score for the alignment of the read to the haplotype. The higher the score the "better" the alignment*         normally, scores will be negative.* @throws GATKException*/public double haplotypeReadMatching(final FlowBasedHaplotype haplotype, final FlowBasedRead read) throws GATKException {if (read.getDirection() != FlowBasedRead.Direction.REFERENCE ) {throw new GATKException.ShouldNeverReachHereException("Read should be aligned with the reference");}if (!read.isBaseClipped() && fbargs.trimToHaplotype) {throw new GATKException.ShouldNeverReachHereException("Reads should be trimmed to the haplotype");}if (!read.isValid()) {return Double.NEGATIVE_INFINITY;}// the read is assumed to be trimmed to the haplotype by ReadThreadingAssembler.finalizeRegion, the region of the// haplotype to be aligned is estimated by finding the points on the haplotypes that align to the start and the end// of the read on  the referencefinal int haplotypeStart = ReadUtils.getReadIndexForReferenceCoordinate(haplotype.getStart(), haplotype.getCigar(),read.getTrimmedStart()).getLeft();final int haplotypeEnd = ReadUtils.getReadIndexForReferenceCoordinate(haplotype.getStart(), haplotype.getCigar(),read.getTrimmedEnd()).getLeft();final int haplotypeLength = haplotypeEnd - haplotypeStart;final int readLength = read.seqLength();//in case there is a deletion on the haplotype and hte read falls inside the deletion (thus length of the read is// longer than the length of the trimmed haplotype:// Reference:  GACACACACACT// Read:              ACACT// Haplotype   GACAC------T// In this case the read (that is not informative) is aligned inside the deletion and thus if we clip the haplotype// to the start position of the read will not support it (haplotype: T, read: ACACT), so in cases we see that the resulting// haploype length is shorter than the resulting read length we extend the "trimmed haplotype" by this length and// also allow more uncertainty for the actual starting position of the alignmentfinal int uncertainty = Math.max(readLength - haplotypeLength,0);final int leftClip = Math.max(haplotypeStart-uncertainty,0);final int rightClip = Math.max(haplotype.length()-haplotypeEnd-1-uncertainty,0);if ((leftClip >= haplotype.length() ) || ( rightClip >= haplotype.length())) {return Double.NEGATIVE_INFINITY;}int [] leftClipping = haplotype.findLeftClipping(leftClip);int clipLeft = leftClipping[0];final int leftHmerClip = leftClipping[1];leftClipping = haplotype.findRightClipping(rightClip);int clipRight = leftClipping[0];final in

http://www.ppmy.cn/devtools/97239.html

相关文章

【vue3+Typescript】手撸了一个轻量uniapp导航条

最近公共组件写到导航条&#xff0c;本来打算拿已有的改。看了下uniapp市场上已有的组件&#xff0c;一是不支持vue3typescript&#xff0c;二是包装过重。索性自己手撸了一个导航条&#xff0c;不到100行代码全部搞定&#xff0c;因为自己的需求很简单&#xff1a; 1&#xf…

详解Element-UI el-table表格中勾选checkbox(selection)多选删除

本节讲解的是关于组件库中el-table组件多选删除功能的实现。 1.Vue文件内的引用 2.页面数据 3.存储多选数据 4. 处理多选数据 这里通过循环的方式找到数据并对数据删除&#xff0c;这种方式易于理解&#xff0c;但不一定是最优方案

深入Swift内核:编译器诊断信息的奥秘与实践

标题&#xff1a;深入Swift内核&#xff1a;编译器诊断信息的奥秘与实践 在Swift语言的编程世界中&#xff0c;编译器的诊断信息是开发者与编译器沟通的桥梁。它不仅帮助开发者快速定位问题&#xff0c;还提供了解决问题的线索。本文将深入探讨Swift编译器的诊断信息工作原理&…

k8s 进阶实战笔记 | Ingress-traefik(一)

文章目录 traefik认知基本概述基础特性其他ingress对比核心概念和能力 安装部署创建CRD资源RBAC资源创建配置文件部署traefik预期效果 traefik认知 基本概述 ● 官网&#xff1a;https://traefik.cn ● 现代HTTP反向代理、负载均衡工具 ● 它支持多种后台 (Docker, Swarm, Ku…

ImageMagick从pdf导出高清图片

-density 指定dpi -quality 指定压缩率 参考:https://blog.csdn.net/qq_38883889/article/details/121764516 命令行: magick -density 300 -quality 10 1.pdf 1.jpg

【springboot】自定义starter

自定义一个starter&#xff0c;实现获取系统和程序信息。 0. 项目结构 org.springframework.boot.autoconfigure.AutoConfiguration.imports 文件是用来加载自动配置类的&#xff0c;该文件必须放在META-INF/spring/目录下。 1. 创建项目 创建一个普通的maven项目&#xff0c;使…

深入理解LDA主题模型及其在文本分析中的应用

深入理解LDA主题模型及其在文本分析中的应用 在自然语言处理领域,主题模型是一种强大的工具,能够自动发现文档集中的潜在主题。在大规模文本数据分析中,Latent Dirichlet Allocation (LDA) 是最受欢迎的主题模型之一。LDA的核心目标是从文档集中提取不同的主题,并确定每篇…

C语言 ——— 学习并使用calloc和realloc函数

目录 calloc函数的功能 学习并使用calloc函数​编辑 realloc函数的功能 学习并使用realloc函数​编辑 calloc函数的功能 calloc函数的功能和malloc函数的功能类似&#xff0c;于malloc函数的区别只在于calloc函数会再返回地址之前把申请的空间的每个字节初始化为全0 C语言…

STM32 HAL库常用功能封装

关中断 /*** brief 关闭所有中断(但是不包括fault和NMI中断)* param 无* retval 无*/ void sys_intx_disable(void) {__ASM volatile("cpsid i"); }开中断 /*** brief 开启所有中断* param 无* retval 无*/ void sys_intx_enabl…

【MATLAB机器人系统工具箱】【manipulatorRRT规划器】属性和方法解析

启用了连接启发式&#xff08;heuristic&#xff09;后&#xff0c;双向快速扩展随机树&#xff08;RRT&#xff09;算法会在以下情况下忽略 MAXCONNECTIONDISTANCE 的限制&#xff1a;当两棵树&#xff08;起始树和目标树&#xff09;之间的节点距离足够接近时&#xff0c;算法…

计算机Java项目|基于SpringBoot的大学生一体化服务平台的设计与实现

作者主页&#xff1a;编程指南针 作者简介&#xff1a;Java领域优质创作者、CSDN博客专家 、CSDN内容合伙人、掘金特邀作者、阿里云博客专家、51CTO特邀作者、多年架构师设计经验、多年校企合作经验&#xff0c;被多个学校常年聘为校外企业导师&#xff0c;指导学生毕业设计并参…

【1】开源!移植OpenHarmony轻量系统到雅特力AT32F437ZMT MCU

笔者最近将OpenHarmony轻量系统移植到AT32F437 MCU&#xff0c;移植架构采用Board与SoC分离方案&#xff0c;使用arm gcc工具链Newlib C库&#xff0c;并且提供了相应的样例应用代码&#xff08;样例代码持续更新中&#xff09; 移植 基于雅特力科技官方开发板 AT-START-F437 …

flume系列之:定位flume没有关闭某个时间点生成的tmp文件的原因,并制定解决方案

flume系列之:定位flume没有关闭某个时间点生成的tmp文件的原因,并制定解决方案 一、背景二、分析tmp文件三、定位原因四、解决方法一、背景 flume没有关闭生成的tmp文件临时解决方案是批量关闭tmp文件下一步深入定位分析原因二、分析tmp文件 观察tmp文件,发现tmp文件的时间点…

Linux Bridge VLAN

一、Linux Bridge VLAN &#xff08;1&#xff09;是什么&#xff1f; Bridge 是什么 VLAN 是什么 LINUX BRIDGE VLAN又是什么&#xff1f;——> &#xff08;2&#xff09;解决什么问题&#xff1f;【应用场景】 应用背景 已一个实际问题引出 【应用场景】&#xff1a; 【…

武汉流星汇聚:西班牙时尚消费高涨,中国商家借亚马逊平台拓商机

在2024年第二季度的亚马逊西班牙站&#xff0c;一场前所未有的时尚盛宴正悄然上演。销售额同比高增长TOP10品类榜单的揭晓&#xff0c;不仅揭示了西班牙消费者对于时尚品类的狂热追求&#xff0c;更为亚马逊平台上的中国商家开启了一扇通往新蓝海的大门。其中&#xff0c;男士拳…

SSH协议与OpenSSH配置详解(配置密钥对验证实验)

文章目录 SSH 协议与 OpenSSH 配置详解1. SSH 协议概述2. OpenSSH 概述3. 配置SSH&#xff08;sshd_config文件&#xff09;3.1 配置服务监听选项3.2 配置用户登录控制&#xff08;黑白名单&#xff09;3.3 配置登录验证方式&#xff08;密钥对验证&#xff09;3.4 常用的配置项…

PHP中如何限制PDF文件大小的简单示例

例如&#xff0c;如果我们希望限制PDF文件的大小不超过5MB&#xff0c;我们可以将这两个配置项都设置为5M。 upload_max_filesize 5M post_max_size 5M接下来&#xff0c;在PHP脚本中&#xff0c;我们可以通过检查$_FILES全局数组来获取上传文件的大小&#xff0c;并作出相应…

【5.0】vue请求函数和路由

【5.0】vue请求函数和路由 此处是与后端交互发送请求拿到数据&#xff0c;和vue自己中的页面跳转路由 【一】axios使用 【1】安装 终端命令 npm install axios -S【2】基本语法 axios.get(后端地址&#xff08;django&#xff09;).then(res > {console.log(res.data.res…

ArrayList详解

简介 【概述】 List的主要实现类&#xff0c;底层使用Object[]存储&#xff0c;适用于频繁的查找工作&#xff0c;线程不安全。 【特点】 增删慢&#xff1a;每次删除元素&#xff0c;都需要更改数组长度、拷贝以及移动元素位置&#xff1b;查询快&#xff1a;由于数组在内…

SSM学生社团管理系统—计算机毕业设计源码20360

目 录 摘要 1 绪论 1.1 研究背景 1.2 研究意义 1.3论文结构与章节安排 2 学生社团管理系统系统分析 2.1 可行性分析 2.2 系统流程分析 2.2.1 数据增加流程 2.2.2 数据修改流程 2.2.3 数据删除流程 2.3 系统功能分析 2.3.1 功能性分析 2.3.2 非功能性分析 2.4 系…