Article Directory
written in front
In fact, there is already a ready-made tool for intersecting bed files: bedtools intersect
, and related parameters have been interpreted before: intersect command parameters of bedtools .
The python script written here is purely wheel-making, and it runs very slowly and needs to be improved.
In addition, the GhitGPT test was carried out, which is quite interesting.
Implementation code
The implementation idea is: write the positions of the two beds into a dictionary, and then obtain the intersection and complement of the two.
Script naming: inter_bed.py
# inter_bed.py
from collections import defaultdict
import sys
bed1 = sys.argv[1]
bed2 = sys.argv[2]
output_dir = sys.argv[3]
def bed_dict(bed):
dic = defaultdict(list)
with open(bed, 'r') as f:
for line in f:
chrom, start, end = line.strip().split('\t')[:3]
dic[chrom].extend(list(range(int(start), int(end)+1))) # 【修改:2023.5.13】
return dic
def inters(b1, b2, outdir):
bedd1 = bed_dict(b1)
with open(b2, 'r') as f, open(outdir+'/inter.bed', 'w') as pf1, open(outdir+'/p1only.bed', 'w') as pf2, open(outdir+'/p2only.bed', 'w') as pf3:
for line in f:
chrom, start, end = line.strip().split('\t')[:3]
poss1 = set(bedd1[chrom])
poss2 = set(range(int(start), int(end)+1)) # 【修改:2023.5.13】
inter = poss1.intersection(poss2)
p1only = poss1.difference(poss2)
p2only = poss2.difference(poss1)
if inter:
inter_lst_split = split_num_l(inter)
inter_lst_mg = mg_str_lst(inter_lst_split)
for i in inter_lst_mg:
pf1.write('{m}\t{s}\n'.format(m=chrom, s=i))
if p1only:
p1o_lst_split = split_num_l(p1only)
p1o_lst_mg = mg_str_lst(p1o_lst_split)
for i in p1o_lst_mg:
pf2.write('{m}\t{s}\n'.format(m=chrom, s=i))
if p2only:
p2o_lst_split = split_num_l(p2only)
p2o_lst_mg = mg_str_lst(p2o_lst_split)
for i in p2o_lst_mg:
pf3.write('{m}\t{s}\n'.format(m=chrom, s=i))
def split_num_l(num_lst):
"""merge successive num, sort lst(ascending or descending): 'as' or 'des'
eg: [1, 3,4,5,6, 9,10] -> [[1], [3, 4, 5, 6], [9, 10]]
"""
num_lst_tmp = [int(n) for n in num_lst]
sort_lst = sorted(num_lst_tmp) # ascending
len_lst = len(sort_lst)
i = 0
split_lst = []
tmp_lst = [sort_lst[i]]
while True:
if i + 1 == len_lst:
break
next_n = sort_lst[i+1]
if sort_lst[i] + 1 == next_n:
tmp_lst.append(next_n)
else:
split_lst.append(tmp_lst)
tmp_lst = [next_n]
i += 1
split_lst.append(tmp_lst)
return split_lst
def mg_str_lst(mylst):
"""[[1], [3, 4, 5, 6], [9, 10]] -> ['1', '3~6', '9~10']
"""
mg_l = []
for num_l in mylst:
if len(num_l) == 1:
mg_l.append(str(num_l[0]) + '\t' + str(num_l[0]))
else:
mg_l.append(str(num_l[0]) + '\t' + str(num_l[-1]))
return mg_l
if __name__ == '__main__':
inters(bed1, bed2, output_dir)
For example, there are two bed files:
$ cat tst1.bed
1 100 2000
2 10 50
$ cat tst2.bed
1 120 2030
2 20 90
Instructions
python inter_bed.py tst1.bed tst2.bed outdir
After execution, there are 3 files in the outdir directory inter.bed, p1only.bed, p2only.bed
: bed intersection, bed1 unique site, and bed2 unique site.
$ head inter.bed p1only.bed p2only.bed
==> inter.bed <==
1 120 1999
2 20 49
==> p1only.bed <==
1 100 119
2 10 19
==> p2only.bed <==
1 2000 2029
2 50 89
Among them, the inter.bed file is the same as that obtained by the bedtools intersect command:
bedtools intersect -a tst1.bed -b tst2.bed
ChitGPT test
I asked two questions and entered bedtools directly. The default should be to ask what "xxx" is, and the answer is more professional. Also asked about getting intersection with bedtools, not only gave examples, but also explained the parameters (this is indeed much more straightforward than the results of engine searches).
And for the same problem, it is possible to give some descriptions:
Modification log:
【2023.5.13】Thanks to netizens for reminding the problem of the script inter_bed.py. When using range(start, end) to get the position, the last value end cannot be obtained, so modify it to range(start, end+1), and the result is the same as bedtools intersect. Added a small test for GhitGPT.