您现在的位置是:首页 > 电脑 > 

建设领结索引失败(tophat2,bowtie2)(Building Bowtie index failure (tophat2, bowtie2))

2025-07-14 05:30:05
建设领结索引失败(tophat2,bowtie2)(Building Bowtie index failure (tophat2, bowtie2)) (注意:标签应该是tophat2和bowtie2,但我没有创建新标签的要点) 问候:我使用Tophat2(命令行)来分析RA-seq数据,并且遇到一些错误。 这是电话: tophat2 -o t
建设领结索引失败(tophat2,bowtie2)(Building Bowtie index failure (tophat2, bowtie2))

(注意:标签应该是tophat2和bowtie2,但我没有创建新标签的要点)

问候:我使用Tophat2(命令行)来分析RA-seq数据,并且遇到一些错误。

这是电话:

tophat2 -o tophat2_results/ -G ref_data/BA000007.2.gtf --transcriptome-index=transcriptome_data/RA_LBG01b_241_filteredQ indices/BA000007.2 data_files/RA_LBG01b_241_filteredQ.fastq

这是错误:

[2015-12-29 12:58:] Checking for Bowtie Bowtie version: 2.2.4.0 [2015-12-29 12:58:] Checking for Bowtie index files (genome).. [2015-12-29 12:58:] Checking for reference FASTA file [2015-12-29 12:58:] Generating SAM header for indices/BA000007.2 [2015-12-29 12:58:] Reading known juncti from GTF file Warning: TopHat did not find any juncti in GTF file [2015-12-29 12:58:] Preparing reads left reads: min. length=12, max. length=42, 20272 kept reads (115 discarded) Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places [2015-12-29 12:58:9] Building transcriptome data files transcriptome_data/RA_LBG01b_241_filteredQ [2015-12-29 12:58:40] Building Bowtie index from RA_LBG01b_241_filteredQ.fa [FAILED] Error: Couldn't build bowtie index with err = 1

版本信息: TopHat v2.1.0 Bowtie2版本2.2.4 Python 2.7.10 :: Anaconda 2.4.0(64位)

系统信息: CentOS版本6.7

我是如何到达这里的,我尝试了什么:

我使用大肠杆菌(登录号:BA000007.2)作为我的参考基因组,可在此处到: http ://gov/nuccore/BA000007.2

我从Ensemble获得了我的GTF文件( ftp:///pub/release-29/bacteria//gtf/bacteria_9_collection/escherichia_coli_o157_h7_str_sakai/ )

我使用bowtie2-build(在tophat2调用之前)制作了我的索引,

bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2

我知道我收到的错误与* .gtf文件第一列中出现的不同名称以及参考fasta文件的名称有关。 如果我理解正确的话,第1列中的每个条目都应该是BA000007.2,其中第1列中的大部分名称都是“染体”。 为了解决这个问题,我做了以下工作:

awk '{FS=OFS="\t"}{print "BA000007.2", $2, $, $4, $5, $6, $7, $8, $9}' pathToGTF/BA000007.2_ensemble.gtf > pathToGTF/BA000007.2.gtf

#请注意,在合并gtf文件开头的注释构建信息(例如,#!genome-build ASM80120v1)会导致awk命令产生不需要的输出

我还将fasta文件的终止从* .fasta更改为* .fa

问题:

我是否正确地解决了由于gtf文件的第1列和fasta文件的名称(BA000007.2,BA000007.2.fa)之间的命名差异而引起的任何问题?

当我在日志目录中浏览输出时,会出现几个错误(ftf_juncs.log中的和类似错误),其中的行以下列行开头:

警告:行的无效起始坐标:BA000007.2 ena gene -194 2502。 +。 gene_id“BAA1757”; gene_version“1”; gene_name“tagA”; gene_source“ena”; gene_biotype“protein_coding”;

gtf文件确实有负数,但genbank文件中没有(在vim中快速搜索)。 这可能是错误的根源吗? 我注释掉了特定的行并将它们从文件中删除 - 这两种方法仍然会导致错误。

是否有任何容易看到的可能导致“ 无法构建错误= 1的领结索引”错误

我一直坚持这一两天,所以任何帮助,不胜感激。

(ote: tags should be tophat2 and bowtie2 but I do not have the points to create new tags)

Greetings: I am using Tophat2 (command line) to analyze RA-seq data and I am encountering some errors.

Here is the call:

tophat2 -o tophat2_results/ -G ref_data/BA000007.2.gtf --transcriptome-index=transcriptome_data/RA_LBG01b_241_filteredQ indices/BA000007.2 data_files/RA_LBG01b_241_filteredQ.fastq

Here is the error:

[2015-12-29 12:58:] Checking for Bowtie Bowtie version: 2.2.4.0 [2015-12-29 12:58:] Checking for Bowtie index files (genome).. [2015-12-29 12:58:] Checking for reference FASTA file [2015-12-29 12:58:] Generating SAM header for indices/BA000007.2 [2015-12-29 12:58:] Reading known juncti from GTF file Warning: TopHat did not find any juncti in GTF file [2015-12-29 12:58:] Preparing reads left reads: min. length=12, max. length=42, 20272 kept reads (115 discarded) Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places [2015-12-29 12:58:9] Building transcriptome data files transcriptome_data/RA_LBG01b_241_filteredQ [2015-12-29 12:58:40] Building Bowtie index from RA_LBG01b_241_filteredQ.fa [FAILED] Error: Couldn't build bowtie index with err = 1

Version Information: TopHat v2.1.0 Bowtie2 version 2.2.4 Python 2.7.10 :: Anaconda 2.4.0 (64-bit)

System Information: CentOS Release 6.7

How I got here and what have I tried:

I am using E. coli (Accession: BA000007.2) for my reference genome which can be found here: http://gov/nuccore/BA000007.2

I obtained my GTF file from Ensemble (ftp:///pub/release-29/bacteria//gtf/bacteria_9_collection/escherichia_coli_o157_h7_str_sakai/)

I made my indices using bowtie2-build (before tophat2 call)

bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2

I am aware that the error I am receiving is affiliated with different names appearing in the first column in the *.gtf file and the name of the reference fasta file. If I understand this correctly, every entry in the 1st column should be BA000007.2 where most of the names in the 1st column where "Chromosome". To fix this, I did the following:

awk '{FS=OFS="\t"}{print "BA000007.2", $2, $, $4, $5, $6, $7, $8, $9}' pathToGTF/BA000007.2_ensemble.gtf > pathToGTF/BA000007.2.gtf

#Please note the commented build information (e.g., #!genome-build ASM80120v1) at the beginning of ensemble gtf file would create undesirable output from the awk command has been addressed

I also changed the termination of the fasta file from *.fasta to *.fa

Questi:

Did I properly put the kibosh on any problems arising from differences in naming between the 1st column of the gtf file and the name of the fasta file (BA000007.2, BA000007.2.fa)?

When I peruse output in the logs directory, there are several errors ( & similar errors in ftf_juncs.log) with lines beginning with:

Warning: invalid start coordinate at line: BA000007.2 ena gene -194 2502 . + . gene_id "BAA1757"; gene_version "1"; gene_name "tagA"; gene_source "ena"; gene_biotype "protein_coding";

There are indeed negative numbers in the gtf files, but not in the genbank file (quick search in vim). Could this be the source of the error? I commented out the specific lines and deleted them from the file -- both approaches still result in the error.

Is there anything readily seen that could be causing the "Couldn't build bowtie index with err = 1" error?

I have been stuck on this for a couple of days so any help is greatly appreciated.

最满意答案

我到了问题的根源。 它是引用fasta文件中的头文件。 最初的标题是:

>gi|4711801|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DA, complete genome

应该在哪里

>BA000007

所以...如果fasta文件被称为abc12.fa,那么fasta文件中的标题必须> abc12。 gtf文件中的第一列也必须是abc12。

请注意,我在所有通话中将基数从BA000007.2更改为BA000007,并且我重命名了名称中没有.2的所有文件。 它可能仍然适用于.2,但我没有测试它(“ basename是任何索引文件的名称,但不包括第一个句号。 ”[tophat manual])(谢谢AM)。 最后,我将fasta文件从* .fasta更名为* .fa。

I found the source of the problem. It was the header in the referential fasta file. The initial header was:

>gi|4711801|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DA, complete genome

Where is should have been

>BA000007

So...if the fasta file is called abc12.fa, then the header in the fasta file must be >abc12. The first column in the gtf file must also be abc12.

Please note that I changed the base from BA000007.2 to BA000007 in all of my calls, and I renamed all files without the .2 in the name. It may still work with the .2, but I did not test it out ("The basename is the name of any of the index files up to but not including the first period." [tophat manual]) (Thank you AM). Lastly, I renamed in fasta files from *.fasta to *.fa.

#感谢您对电脑配置推荐网 - 最新i3 i5 i7组装电脑配置单推荐报价格的认可,转载请说明来源于"电脑配置推荐网 - 最新i3 i5 i7组装电脑配置单推荐报价格

本文地址:http://www.dnpztj.cn/diannao/61877.html

相关标签:无
上传时间: 2023-04-21 00:31:44
留言与评论(共有 5 条评论)
本站网友 新生儿破伤风
15分钟前 发表
] Reading known juncti from GTF file Warning
本站网友 镇江贷款
18分钟前 发表
] Preparing reads left reads
本站网友 尿疗法
23分钟前 发表
short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places [2015-12-29 12
本站网友 44533
17分钟前 发表
awk '{FS=OFS="\t"}{print "BA000007.2"