建设领结索引失败(tophat2,bowtie2)(Building Bowtie index failure (tophat2, bowtie2))
(注意:标签应该是tophat2和bowtie2,但我没有创建新标签的要点)
问候:我使用Tophat2(命令行)来分析RA-seq数据,并且遇到一些错误。
这是电话:
tophat2 -o tophat2_results/ -G ref_data/BA000007.2.gtf --transcriptome-index=transcriptome_data/RA_LBG01b_241_filteredQ indices/BA000007.2 data_files/RA_LBG01b_241_filteredQ.fastq这是错误:
[2015-12-29 12:58:] Checking for Bowtie Bowtie version: 2.2.4.0 [2015-12-29 12:58:] Checking for Bowtie index files (genome).. [2015-12-29 12:58:] Checking for reference FASTA file [2015-12-29 12:58:] Generating SAM header for indices/BA000007.2 [2015-12-29 12:58:] Reading known juncti from GTF file Warning: TopHat did not find any juncti in GTF file [2015-12-29 12:58:] Preparing reads left reads: min. length=12, max. length=42, 20272 kept reads (115 discarded) Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places [2015-12-29 12:58:9] Building transcriptome data files transcriptome_data/RA_LBG01b_241_filteredQ [2015-12-29 12:58:40] Building Bowtie index from RA_LBG01b_241_filteredQ.fa [FAILED] Error: Couldn't build bowtie index with err = 1版本信息: TopHat v2.1.0 Bowtie2版本2.2.4 Python 2.7.10 :: Anaconda 2.4.0(64位)
系统信息: CentOS版本6.7
我是如何到达这里的,我尝试了什么:
我使用大肠杆菌(登录号:BA000007.2)作为我的参考基因组,可在此处到: http ://gov/nuccore/BA000007.2
我从Ensemble获得了我的GTF文件( ftp:///pub/release-29/bacteria//gtf/bacteria_9_collection/escherichia_coli_o157_h7_str_sakai/ )
我使用bowtie2-build(在tophat2调用之前)制作了我的索引,
bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2我知道我收到的错误与* .gtf文件第一列中出现的不同名称以及参考fasta文件的名称有关。 如果我理解正确的话,第1列中的每个条目都应该是BA000007.2,其中第1列中的大部分名称都是“染体”。 为了解决这个问题,我做了以下工作:
awk '{FS=OFS="\t"}{print "BA000007.2", $2, $, $4, $5, $6, $7, $8, $9}' pathToGTF/BA000007.2_ensemble.gtf > pathToGTF/BA000007.2.gtf#请注意,在合并gtf文件开头的注释构建信息(例如,#!genome-build ASM80120v1)会导致awk命令产生不需要的输出
我还将fasta文件的终止从* .fasta更改为* .fa
问题:
我是否正确地解决了由于gtf文件的第1列和fasta文件的名称(BA000007.2,BA000007.2.fa)之间的命名差异而引起的任何问题?
当我在日志目录中浏览输出时,会出现几个错误(ftf_juncs.log中的和类似错误),其中的行以下列行开头:
警告:行的无效起始坐标:BA000007.2 ena gene -194 2502。 +。 gene_id“BAA1757”; gene_version“1”; gene_name“tagA”; gene_source“ena”; gene_biotype“protein_coding”;
gtf文件确实有负数,但genbank文件中没有(在vim中快速搜索)。 这可能是错误的根源吗? 我注释掉了特定的行并将它们从文件中删除 - 这两种方法仍然会导致错误。
是否有任何容易看到的可能导致“ 无法构建错误= 1的领结索引”错误 ?我一直坚持这一两天,所以任何帮助,不胜感激。
(ote: tags should be tophat2 and bowtie2 but I do not have the points to create new tags)
Greetings: I am using Tophat2 (command line) to analyze RA-seq data and I am encountering some errors.
Here is the call:
tophat2 -o tophat2_results/ -G ref_data/BA000007.2.gtf --transcriptome-index=transcriptome_data/RA_LBG01b_241_filteredQ indices/BA000007.2 data_files/RA_LBG01b_241_filteredQ.fastqHere is the error:
[2015-12-29 12:58:] Checking for Bowtie Bowtie version: 2.2.4.0 [2015-12-29 12:58:] Checking for Bowtie index files (genome).. [2015-12-29 12:58:] Checking for reference FASTA file [2015-12-29 12:58:] Generating SAM header for indices/BA000007.2 [2015-12-29 12:58:] Reading known juncti from GTF file Warning: TopHat did not find any juncti in GTF file [2015-12-29 12:58:] Preparing reads left reads: min. length=12, max. length=42, 20272 kept reads (115 discarded) Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places [2015-12-29 12:58:9] Building transcriptome data files transcriptome_data/RA_LBG01b_241_filteredQ [2015-12-29 12:58:40] Building Bowtie index from RA_LBG01b_241_filteredQ.fa [FAILED] Error: Couldn't build bowtie index with err = 1Version Information: TopHat v2.1.0 Bowtie2 version 2.2.4 Python 2.7.10 :: Anaconda 2.4.0 (64-bit)
System Information: CentOS Release 6.7
How I got here and what have I tried:
I am using E. coli (Accession: BA000007.2) for my reference genome which can be found here: http://gov/nuccore/BA000007.2
I obtained my GTF file from Ensemble (ftp:///pub/release-29/bacteria//gtf/bacteria_9_collection/escherichia_coli_o157_h7_str_sakai/)
I made my indices using bowtie2-build (before tophat2 call)
bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2I am aware that the error I am receiving is affiliated with different names appearing in the first column in the *.gtf file and the name of the reference fasta file. If I understand this correctly, every entry in the 1st column should be BA000007.2 where most of the names in the 1st column where "Chromosome". To fix this, I did the following:
awk '{FS=OFS="\t"}{print "BA000007.2", $2, $, $4, $5, $6, $7, $8, $9}' pathToGTF/BA000007.2_ensemble.gtf > pathToGTF/BA000007.2.gtf#Please note the commented build information (e.g., #!genome-build ASM80120v1) at the beginning of ensemble gtf file would create undesirable output from the awk command has been addressed
I also changed the termination of the fasta file from *.fasta to *.fa
Questi:
Did I properly put the kibosh on any problems arising from differences in naming between the 1st column of the gtf file and the name of the fasta file (BA000007.2, BA000007.2.fa)?
When I peruse output in the logs directory, there are several errors ( & similar errors in ftf_juncs.log) with lines beginning with:
Warning: invalid start coordinate at line: BA000007.2 ena gene -194 2502 . + . gene_id "BAA1757"; gene_version "1"; gene_name "tagA"; gene_source "ena"; gene_biotype "protein_coding";
There are indeed negative numbers in the gtf files, but not in the genbank file (quick search in vim). Could this be the source of the error? I commented out the specific lines and deleted them from the file -- both approaches still result in the error.
Is there anything readily seen that could be causing the "Couldn't build bowtie index with err = 1" error?I have been stuck on this for a couple of days so any help is greatly appreciated.
最满意答案
我到了问题的根源。 它是引用fasta文件中的头文件。 最初的标题是:
>gi|4711801|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DA, complete genome应该在哪里
>BA000007所以...如果fasta文件被称为abc12.fa,那么fasta文件中的标题必须> abc12。 gtf文件中的第一列也必须是abc12。
请注意,我在所有通话中将基数从BA000007.2更改为BA000007,并且我重命名了名称中没有.2的所有文件。 它可能仍然适用于.2,但我没有测试它(“ basename是任何索引文件的名称,但不包括第一个句号。 ”[tophat manual])(谢谢AM)。 最后,我将fasta文件从* .fasta更名为* .fa。
I found the source of the problem. It was the header in the referential fasta file. The initial header was:
>gi|4711801|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DA, complete genomeWhere is should have been
>BA000007So...if the fasta file is called abc12.fa, then the header in the fasta file must be >abc12. The first column in the gtf file must also be abc12.
Please note that I changed the base from BA000007.2 to BA000007 in all of my calls, and I renamed all files without the .2 in the name. It may still work with the .2, but I did not test it out ("The basename is the name of any of the index files up to but not including the first period." [tophat manual]) (Thank you AM). Lastly, I renamed in fasta files from *.fasta to *.fa.
#感谢您对电脑配置推荐网 - 最新i3 i5 i7组装电脑配置单推荐报价格的认可,转载请说明来源于"电脑配置推荐网 - 最新i3 i5 i7组装电脑配置单推荐报价格
推荐阅读
留言与评论(共有 5 条评论) |
本站网友 新生儿破伤风 | 15分钟前 发表 |
] Reading known juncti from GTF file Warning | |
本站网友 镇江贷款 | 18分钟前 发表 |
] Preparing reads left reads | |
本站网友 尿疗法 | 23分钟前 发表 |
short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places [2015-12-29 12 | |
本站网友 44533 | 17分钟前 发表 |
awk '{FS=OFS="\t"}{print "BA000007.2" |