Shah, N., Nute, M.G., Warnow, T., and Pop, M. (2018). Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics.
杂志Bioinformatics以letter to the editor的形式刊发了来自美国马里兰大学计算机系的Nidhi Shah等人的一篇文章,报道BLAST中-max_target_seqs存在的“bug”(Shah et al., 2018)。
该“bug”就是:-max_target_seqs参数返回的不过是前N个符合给定的E值的hit,而并不一定保证就是E值最低的N个hits。
https://ibook.antpedia.com/x/146750.html
https://www.jianshu.com/p/7eb530bc1a9c
之前大部分人都将这个参数的值设置为1,认为会输出最优匹配的一条,但是作者验证后发现,这是一个错误的用法,它输出的并不是最优匹配的一条结果,而是第一条较好的匹配结果;更糟糕的是,产生的输出取决于序列在数据库中出现的顺序。对于相同的比对任务,使用不同版本的数据库时,即使所有版本都包含相同的最佳匹配结果,但是BLAST却返回不同的结果。而且以不同的方式对数据库进行排序,也会导致在将max_target_seqs参数设置为1时,BLAST返回不同的“top hit”。原文如下:
To enable the efficient processing of large data sets, researchers frequently rely on shortcuts aimed at reducing the number of BLAST results that need to be processed. A common strategy involves using the ‘-max_target_seqs’ parameter of the NCBI BLASTþ suite. According to the BLAST documentation itself (2008), this parameter represents the ‘number of aligned sequences to keep’. This statement is commonly interpreted as meaning that BLAST will return the top N database hits for a sequence query if the value of max_target_seqs is set to N. For example, in a recent article (Wang et al., 2016) the authors explicitly state ‘Setting “max target seqs” as “1,” only the best match result was considered.’
To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest
scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database,not the best hit as one would assume. Worse yet, the output produced
depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions
contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter
to 1.
测试发现,blast没有哪个参数可以返回最优匹配的结果,最好的方式就是通过脚本过滤筛选!
diamond可以返回最优?
开发者Benjamin Buchfink 在个人的推特里刚刚给出答案: