[python] Take the intersection of two bed files

written in front

In fact, there is already a ready-made tool for intersecting bed files: bedtools intersect, and related parameters have been interpreted before: intersect command parameters of bedtools .

The python script written here is purely wheel-making, and it runs very slowly and needs to be improved.

In addition, the GhitGPT test was carried out, which is quite interesting.

Implementation code

The implementation idea is: write the positions of the two beds into a dictionary, and then obtain the intersection and complement of the two.

Script naming: inter_bed.py

# inter_bed.py
from collections import defaultdict
import sys

bed1 = sys.argv[1]
bed2 = sys.argv[2]
output_dir = sys.argv[3]

def bed_dict(bed):
    dic = defaultdict(list)
    with open(bed, 'r') as f:
        for line in f:
            chrom, start, end = line.strip().split('\t')[:3]
            dic[chrom].extend(list(range(int(start), int(end)+1)))  # 【修改:2023.5.13】
    return dic


def inters(b1, b2, outdir):
    bedd1 = bed_dict(b1)
    with open(b2, 'r') as f, open(outdir+'/inter.bed', 'w') as pf1, open(outdir+'/p1only.bed', 'w') as pf2, open(outdir+'/p2only.bed', 'w') as pf3:
        for line in f:
            chrom, start, end = line.strip().split('\t')[:3]
            poss1 = set(bedd1[chrom])
            poss2 = set(range(int(start), int(end)+1))  # 【修改:2023.5.13】
            inter = poss1.intersection(poss2)
            p1only = poss1.difference(poss2)
            p2only = poss2.difference(poss1)
            if inter:
                inter_lst_split = split_num_l(inter)
                inter_lst_mg = mg_str_lst(inter_lst_split)
                for i in inter_lst_mg:
                    pf1.write('{m}\t{s}\n'.format(m=chrom, s=i))
            if p1only:
                p1o_lst_split = split_num_l(p1only)
                p1o_lst_mg = mg_str_lst(p1o_lst_split)
                for i in p1o_lst_mg:
                    pf2.write('{m}\t{s}\n'.format(m=chrom, s=i))
            if p2only:
                p2o_lst_split = split_num_l(p2only)
                p2o_lst_mg = mg_str_lst(p2o_lst_split)
                for i in p2o_lst_mg:
                    pf3.write('{m}\t{s}\n'.format(m=chrom, s=i))
            
    
def split_num_l(num_lst):
    """merge successive num, sort lst(ascending or descending): 'as' or 'des'
    eg: [1, 3,4,5,6, 9,10] -> [[1], [3, 4, 5, 6], [9, 10]]
    """
    num_lst_tmp = [int(n) for n in num_lst]
    sort_lst = sorted(num_lst_tmp)  # ascending
    len_lst = len(sort_lst)
    i = 0
    split_lst = []
    
    tmp_lst = [sort_lst[i]]
    while True:
        if i + 1 == len_lst:
            break
        next_n = sort_lst[i+1]
        if sort_lst[i] + 1 == next_n:
            tmp_lst.append(next_n)
        else:
            split_lst.append(tmp_lst)
            tmp_lst = [next_n]
        i += 1
    split_lst.append(tmp_lst)
    return split_lst


def mg_str_lst(mylst):
    """[[1], [3, 4, 5, 6], [9, 10]] -> ['1', '3~6', '9~10']
    """
    mg_l = []
    for num_l in mylst:
        if len(num_l) == 1:
            mg_l.append(str(num_l[0]) + '\t' + str(num_l[0]))
        else:
            mg_l.append(str(num_l[0]) + '\t' + str(num_l[-1]))
    return mg_l


if __name__ == '__main__':
    inters(bed1, bed2, output_dir)

For example, there are two bed files:

$ cat tst1.bed 
1	100	2000
2	10	50

$ cat tst2.bed 
1	120	2030
2	20	90

Instructions

python inter_bed.py tst1.bed tst2.bed outdir

After execution, there are 3 files in the outdir directory inter.bed, p1only.bed, p2only.bed: bed intersection, bed1 unique site, and bed2 unique site.

$ head inter.bed  p1only.bed  p2only.bed
==> inter.bed <==
1	120	1999
2	20	49

==> p1only.bed <==
1	100	119
2	10	19

==> p2only.bed <==
1	2000	2029
2	50	89

Among them, the inter.bed file is the same as that obtained by the bedtools intersect command:

bedtools intersect -a tst1.bed -b tst2.bed

ChitGPT test

I asked two questions and entered bedtools directly. The default should be to ask what "xxx" is, and the answer is more professional. Also asked about getting intersection with bedtools, not only gave examples, but also explained the parameters (this is indeed much more straightforward than the results of engine searches).
chitGPT-2
And for the same problem, it is possible to give some descriptions:
chitGPT-2


Modification log:

【2023.5.13】Thanks to netizens for reminding the problem of the script inter_bed.py. When using range(start, end) to get the position, the last value end cannot be obtained, so modify it to range(start, end+1), and the result is the same as bedtools intersect. Added a small test for GhitGPT.


Guess you like

Origin blog.csdn.net/sinat_32872729/article/details/127400882