Find download processing sequence (NCBI) from the multiple sequence alignment ---- record a less successful attempt

First, the problem presented

Streptomyces genus found in hrdb gene promoter (hrdbp) conserved sequences, hoping to deduce the -10 region and -35 region.

Second, the process

1, 15-20 download hrdb gene promoter sequence, and processed to form a fasta file

1.1 to coelicolor A3 (2) as the source of hrdb gene, by blast find the highest score of the first 50 sequences. Download Download file Hit Table (txt) format, the file header will tell you what is displayed in each column.

Next, open the file with excel, first of all alignment length less than 1500bp delete, find subject acc.ver, s.start and s.end these three, three wait to use it to generate url.

Example url: https: //www.ncbi.nlm.nih.gov/nuccore/LT629768.1 report = fasta & from = 6444177 & to = 6445864, the generated code is as follows url?:

 1 #读入数据
 2 fo = open('D:\\temporary\\hrdb_related\\ZECB16FT01N-Alignment.csv','r')
 3 ls=[]
 4 for line in fo:
 5     line = line.replace('\n','')
 6     ls.append(line.split(','))
 7 fo.close()
 8 #写出成url
 9 fo1 = open('D:\\temporary\\hrdb_related\\output.csv',' W ' )
 10  for Line in LS:
 . 11      IF Line [-1] == ' 0 ' : # 0 indicates a forward sequence, can be used s.start-400 s.stsrt + 5, and the start and end do position
 12 is          fo1.write ( ' https://www.ncbi.nlm.nih.gov/nuccore/ ' + Line [. 1] + \
 13 is                ' ? = FASTA & Report from = ' + Line [. 9] + ' & = to ' Line + [. 11] + ' \ n- ' )
 14 fo1.close ()

1.2, had wanted to obtain the corresponding sequence of twenty url directly with reptiles, but found that the corresponding sequence is encrypted in the Internet search a bit, it seems related with the asynchronous loading. Will open direct manual for each url, download the appropriate sequences, so I got 20 fasta files, each containing a sequence.

1.3, the next I'm trying 20 fasta files into one file, multiple sequence for later comparison. I heard a cat command can be resolved under linux, so I want to send these files to my linux virtual machine, try WinSCP3, but no connection is successful, NAT mode can be networked, but not even the change to bridge mode on the network, so WinSCP3 did not successfully connect to my linux virtual machine. Later it was found under the DOS WIN10 can achieve the same function by the copy command. Just need to put all the file names to be merged with a plus sign up, a little bit of trouble. It is implemented in code.

1 import os
2 for (dirname,subdir,subfile) in os.walk(r'D:\temporary\hrdb_related') :
3     for f in subfile:
4         print(f+'+',end='')

To get here fasta file can be used for multiple sequence alignment of the.

2, meme analysis and mega analysis

In the mega Rideau sequence alignment is found upstream of the start codon 200bp are very conservative; analysis by meme out of three conserved regions, with the results compared to clustax missed upstream 100-150bp, but some actually very conservative it seems meme has some limitations.

Here the less successful attempts over, just upstream of the start codon 200bp get very conservative conclusion, and this has long been reported. And there is no corresponding sequences in -10 region and -35 region.

3, in addition to the above analysis, I have tried a direct prediction promoter of online software. Streptomyces may be due to GC content is too high, forecasting sub-prediction software does not start out with a special bacteria or extremely inaccurate, but could not find the right online software.

Guess you like

Origin www.cnblogs.com/s-qw/p/12089150.html