How to use Python REGEX to rethrive specific terms in a pattern

Chooseel Bunsuwansakul :

my pattern is

Forward primer
CGAAGCCTGGGGTGCCCGCGATTT Plus 24 1 24 71.81 66.67 4.00 2.00 

Reverse primer
AAATCGGTCCCATCACCTTCTTAT Minus 24 420 397 59.83 41.67 5.00 2.00 

Product length
420 


Products on potentially unintended templates


>CP049108.1 Mycobacterium tuberculosis strain 5005 chromosome, complete genome product length = 495
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        1930054  ........................  1930077

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        1930548  ........................  1930525


product length = 2946
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        1927603  .......C....C..T..T...G.  1927626

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        1930548  ........................  1930525


>CP046728.2 Mycobacterium tuberculosis strain TCDC11 chromosome, complete genome product length = 420
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2150761  ........................  2150784

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2151180  ........................  2151157


product length = 2595
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2148586  .......C....C..T..T...G.  2148609

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2151180  ........................  2151157


>CP047258.1 Mycobacterium tuberculosis strain TCDC3 chromosome product length = 345
Forward primer  1        CGAAGCCTGGGGTGCCCGCGATTT  24
Template        2166300  ........................  2166323

Reverse primer  1        AAATCGGTCCCATCACCTTCTTAT  24
Template        2166644  ........................  2166621

What I need is

>CP049108.1 = 495   1930054 1930548 
>CP046728.2 = 420   2150761 2151180
>CP047258.1 = 345   2166300 2166644

I am microbiologist and Python beginner. I tried

import re
file = open(r"C:\\Users\\Lab\\Desktop\\amplicons\\ETRA", "r")
handle = file.read()
file.close()

pattern1 = re.compile(r'>.{5,10}\.\d')
matches1 = pattern1.finditer(handle)

for match1 in matches1:
    print(match1.group(0))

but I need specific terms coming after my accession number too (accession number is >CP049108.1 for an example). I will adapt your knowledge to my other work too.

appreciate your help and thank you in advance

Chase :

Here's what I came up with- >([\w\d]*?\.\d*?) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)

Let's see an example with only one set of data, you can feed in the whole data you pasted and still get results as long as you have global flag set to True (it is set to True by default in python)

>CP049108.1 Mycobacterium tuberculosis strain 5005 chromosome, complete genome product length = 495 Forward primer 1
CGAAGCCTGGGGTGCCCGCGATTT 24 Template 1930054 ........................ 1930077

Reverse primer 1 AAATCGGTCCCATCACCTTCTTAT 24 Template
1930548 ........................ 1930525

The first group will be - CP049108.1

The second group will be - 495

The third group will be - 1930054

The fourth (and final) group will be - 1930548

Ofcourse, now you can restructure the whole data to be as you want it to be, if you're reading the data from a text file, you may use this code snippet-

import re

with open('test.txt', 'r') as file:
    content = file.read()

pattern = re.compile(r'>([\w\d]*?\.\d*?) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)')

for match in pattern.finditer(content):
    output = '>{} = {} {} {}'.format(match.group(1), match.group(2), match.group(3), match.group(4))
    print(output)

If I feed in exactly the data set you provided to test.txt, I get this output-

>CP049108.1 = 495   1930054 1930548
>CP046728.2 = 420   2150761 2151180
>CP047258.1 = 345   2166300 2166644

Regex Explanation

>(\w+\.\d+) .+= (\d+)\n.+\n.*?(\d+).+\n{2}.+\n.*?(\d+)

  • Let's analyze the first line first- >(\w+\.\d+) .+= (\d+)\n

    First this matches the CP049108, stops until a .(dot) is found and then matches the next digits, in this case - 1, stops until a = is reached. It'll then combine those to get CP049108.1 in a single capture group

    Later it will grab the digits right after the = and go to the next line, in this case it's 495

  • Time for the second line - .+\n

    Yeah, the second line is just ignored

  • Now, the third line - .*?(\d+).+\n{2}

    It ignores everything up until it reaches the first set of digits, grabs those and skips to the next next line (2 new lines). In this case the result is 1930054

  • Now, the fourth line - .+\n

    This is also ignored

  • Finally, the last line - .*?(\d+)

    This works exactly the same as the 3rd line, the result is 1930548

Check out the demo!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=27747&siteId=1