my pattern is
Forward primer
CGAAGCCTGGGGTGCCCGCGATTT Plus 24 1 24 71.81 66.67 4.00 2.00
Reverse primer
AAATCGGTCCCATCACCTTCTTAT Minus 24 420 397 59.83 41.67 5.00 2.00
Product length
420
Products on potentially unintended templates
>CP049108.1 Mycobacterium tuberculosis strain 5005 chromosome, complete genome product length = 495
Forward primer 1 CGAAGCCTGGGGTGCCCGCGATTT 24
Template 1930054 ........................ 1930077
Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24
Template 1930548 ........................ 1930525
product length = 2946
Forward primer 1 CGAAGCCTGGGGTGCCCGCGATTT 24
Template 1927603 .......C....C..T..T...G. 1927626
Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24
Template 1930548 ........................ 1930525
>CP046728.2 Mycobacterium tuberculosis strain TCDC11 chromosome, complete genome product length = 420
Forward primer 1 CGAAGCCTGGGGTGCCCGCGATTT 24
Template 2150761 ........................ 2150784
Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24
Template 2151180 ........................ 2151157
product length = 2595
Forward primer 1 CGAAGCCTGGGGTGCCCGCGATTT 24
Template 2148586 .......C....C..T..T...G. 2148609
Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24
Template 2151180 ........................ 2151157
>CP047258.1 Mycobacterium tuberculosis strain TCDC3 chromosome product length = 345
Forward primer 1 CGAAGCCTGGGGTGCCCGCGATTT 24
Template 2166300 ........................ 2166323
Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24
Template 2166644 ........................ 2166621
What I need is
>CP049108.1 = 495 1930054 1930548
>CP046728.2 = 420 2150761 2151180
>CP047258.1 = 345 2166300 2166644
I am microbiologist and Python beginner. I tried
import re
file = open(r"C:\\Users\\Lab\\Desktop\\amplicons\\ETRA", "r")
handle = file.read()
file.close()
pattern1 = re.compile(r'>.{5,10}\.\d')
matches1 = pattern1.finditer(handle)
for match1 in matches1:
print(match1.group(0))
but I need specific terms coming after my accession number too (accession number is >CP049108.1 for an example). I will adapt your knowledge to my other work too.
appreciate your help and thank you in advance
Here's what I came up with- >([\w\d]*?\.\d*?) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)
Let's see an example with only one set of data, you can feed in the whole data you pasted and still get results as long as you have global
flag set to True
(it is set to True
by default in python
)
>CP049108.1 Mycobacterium tuberculosis strain 5005 chromosome, complete genome product length = 495 Forward primer 1
CGAAGCCTGGGGTGCCCGCGATTT 24 Template 1930054 ........................ 1930077Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24 Template
1930548 ........................ 1930525
The first group will be - CP049108.1
The second group will be - 495
The third group will be - 1930054
The fourth (and final) group will be - 1930548
Ofcourse, now you can restructure the whole data to be as you want it to be, if you're reading the data from a text file, you may use this code snippet-
import re
with open('test.txt', 'r') as file:
content = file.read()
pattern = re.compile(r'>([\w\d]*?\.\d*?) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)')
for match in pattern.finditer(content):
output = '>{} = {} {} {}'.format(match.group(1), match.group(2), match.group(3), match.group(4))
print(output)
If I feed in exactly the data set you provided to test.txt
, I get this output-
>CP049108.1 = 495 1930054 1930548
>CP046728.2 = 420 2150761 2151180
>CP047258.1 = 345 2166300 2166644
Regex Explanation
>(\w+\.\d+) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)
Let's analyze the first line first-
>(\w+\.\d+) .+= (\d+)\n
First this matches the
CP049108
, stops until a.
(dot) is found and then matches the next digits, in this case -1
, stops until a=
is reached. It'll then combine those to getCP049108.1
in a single capture groupLater it will grab the digits right after the
=
and go to the next line, in this case it's495
Time for the second line -
.+\n
Yeah, the second line is just ignored
Now, the third line -
.*?(\d+).+\n{2}
It ignores everything up until it reaches the first set of digits, grabs those and skips to the next next line (2 new lines). In this case the result is
1930054
Now, the fourth line -
.+\n
This is also ignored
Finally, the last line -
.*?(\d+)
This works exactly the same as the 3rd line, the result is
1930548
Check out the demo!