General NER dataset format conversion to JSON format Daquan

foreword

Recently, I am doing tasks related to the general extraction of large models. It is necessary to convert all data sets into the same format to facilitate the construction of instruction fine-tuning data sets. When processing data, it is necessary to convert NER datasets in different formats into easy-to-handle json format data, which is a very complicated task. In the field of NER, there is no unified format specification. The blogger collected nearly 30 NER datasets and concluded that the common NER dataset formats include BIO, BIEO, BIO in excel format, data tag separation, embedded json, etc. There may only be two or three datasets in each format, and encoding them individually would take a lot of effort and slow down the work. Although there are many open source data sets that have been processed into json format on github, this does not cover all NER data sets. It is better to teach people to fish than to teach them to fish. This article will summarize the common data set formats in the NER field. And provide the code to convert the data set into json format for readers to pick up. In addition, the processed data set can be downloaded here . If it is helpful to you, please like and encourage the blogger~


1. Overview of NER datasets

There are many types of datasets in the NER domain, and the common dataset formats are shown in the following table:
insert image description here

Here I first analyze the pros and cons of different data set formats, and why I choose the json format.

1.1 Embedded json

The first is the representative Boson-NER data set with embedded json, and the entity information is marked in the text. Although the entity type is very clear, but the location information is not marked, and for a sample, it is impossible to know how many entity types there are, and it is difficult to directly Obtain.

1.2 BIO

The types of BIO can be further subdivided. One is that in a txt file, there is only one token in a line, and a token is followed by its type. The overall content needs to be read vertically. The other form is based on the original text after each token. Coupled with entity types, such types are easier to read, but more complicated to handle.

1.3 Layered JSON

The hierarchical json format is more in line with the final format that needs to be processed uniformly. Each sample is nested by a pair of curly braces, which includes text content and label content, but the label, mention, and location information are nested layer by layer, which is inconvenient to extract.

1.4 BIEOs

The format of BIEO is similar to the format of BIO, except that there is an additional special symbol of end, and special characters need to be considered separately when processing.

1.5 Data Label Separation

The data format with data label separation is the worst. You can neither directly see the entity nor obtain the information of the entity location, but its advantage is that it is more convenient to handle. You only need to compare the corresponding locations of the two files to extract the corresponding entity and location information.

1.6 standard json

The standard here is the json data format type that I will eventually convert to. Take the following data sample as an example: you
insert image description here
can see that each sample includes sentence and entity collection, sentence is the content of the sample, and entity collection contains the mention of each entity , type and location information, this data format is the format I think is best processed, and it is also the format processed by the code in this blog.

2. BIO_to_JSON

Primitive data type:

相 O
比 O
之 O
下 O
, O
青 B-ORG
岛 I-ORG
海 I-ORG
牛 I-ORG
队 I-ORG
和 O
广 B-ORG
州 I-ORG
松 I-ORG
日 I-ORG
队 I-ORG
的 O
雨 O
中 O
之 O
战 O
虽 O
然 O
也 O
是 O
0 O
∶ O
0 O
, O
但 O
乏 O
善 O
可 O
陈 O
。 O

code:

import json
import sys
import os
sys.path.append("..")

def bio_to_json(input_files, output_files, label_output_file):
    label_set = set()

    for input_file, output_file in zip(input_files, output_files):
        data = []
        with open(input_file, 'r', encoding='utf-8', errors='ignore') as f:
            lines = f.readlines()
            sentence = ""
            entities = []
            entity_name = ""
            entity_type = ""
            start_position = 0
            for line in lines:
                if line == '\n':
                    # if there's an entity already being processed, append it to entities
                    if entity_name:
                        entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    # append the processed sentence to data
                    data.append({
    
    'sentence': sentence, 'entities': entities})
                    sentence = ""
                    entities = []
                else:
                    print(line)
                    word, tag = line.rstrip('\n').split('	')
                    if tag.startswith('B'):
                        # if there's an entity already being processed, append it to entities
                        if entity_name:
                            entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = word
                        entity_type = tag.split('-')[1]
                        label_set.add(entity_type)  # add this entity type to the set
                        start_position = len(sentence)
                    elif tag.startswith('I'):
                        entity_name += word
                    else:
                        # if there's an entity already being processed, append it to entities
                        if entity_name:
                            entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    sentence += word
            # for the last entity of the last sentence
            if entity_name:
                entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
            if sentence:
                data.append({
    
    'sentence': sentence, 'entities': entities})

        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)

    with open(label_output_file, 'w', encoding='utf-8') as f:
        json.dump(list(label_set), f, ensure_ascii=False, indent=4)



currPath = os.path.join("datasets", "Weibo")
input_files = [os.path.join(currPath, "train.txt"), os.path.join(currPath, "test.txt"), os.path.join(currPath, "dev.txt")]
output_files = [os.path.join(currPath, "train.json"), os.path.join(currPath, "test.json"), os.path.join(currPath, "dev.json")]
label_output_file = os.path.join(currPath, "label.json")
bio_to_json(input_files, output_files, label_output_file)

Generate json format:

{
    
    
    "sentence": "相比之下,青岛海牛队和广州松日队的雨中之战虽然也是0∶0,但乏善可陈。",
    "entities": [
        {
    
    
            "name": "青岛海牛队",
            "type": "机构",
            "pos": [
                5,
                10
            ]
        },
        {
    
    
            "name": "广州松日队",
            "type": "机构",
            "pos": [
                11,
                16
            ]
        }
    ]
},

3. BIEO_to_JSON

Primitive data type:

中 B-GPE
国 E-GPE
将 O
加 O
快 O
人 O
才 O
市 O
场 O
体 O
系 O
建 O
设 O

code:

      
import json

def bieo_to_json(input_files, output_files, label_output_file):
    num = 0

    label_set = set()

    for input_file, output_file in zip(input_files, output_files):
        data = []
        with open(input_file, 'r', encoding='utf-8') as f:
            lines = f.readlines()
            sentence = ""
            entities = []
            entity_name = ""
            entity_type = ""
            start_position = 0
            for line in lines:
                if line == '\n':
                    # if there's an entity already being processed, append it to entities
                    if entity_name:
                        entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    # append the processed sentence to data
                    data.append({
    
    'sentence': sentence, 'entities': entities})
                    num += 1
                    sentence = ""
                    entities = []
                else:
                    word, mid, tag = line.rstrip('\n').split('	')
                    if tag.startswith('B'):
                        entity_name = word
                        entity_type = tag.split('-')[1]
                        label_set.add(entity_type)  # add this entity type to the set
                        start_position = len(sentence)
                    elif tag.startswith('I'):
                        entity_name += word
                    elif tag.startswith('E'):
                        entity_name += word
                        entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    elif tag.startswith('S'):
                        entity_name = word
                        entity_type = tag.split('-')[1]
                        label_set.add(entity_type)  # add this entity type to the set
                        start_position = len(sentence)
                        entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    else:
                        # if there's an entity already being processed, append it to entities
                        if entity_name:
                            entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    sentence += word
            # for the last entity of the last sentence
            if entity_name:
                entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
            if sentence:
                data.append({
    
    'sentence': sentence, 'entities': entities})
                num += 1

        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)

    with open(label_output_file, 'w', encoding='utf-8') as f:
        json.dump(list(label_set), f, ensure_ascii=False, indent=4)

    print(num)


import sys
import os
sys.path.append("..")


currPath = os.path.join( "datasets", "CCKS2017-NER")
input_files = [os.path.join(currPath, "train.txt"), os.path.join(currPath, "test.txt")]
output_files = [os.path.join(currPath, "train.json"), os.path.join(currPath, "test.json")]
label_output_file = os.path.join(currPath, "label.json")
bieo_to_json(input_files, output_files, label_output_file)
    

Generate json format:

{
    
    
    "sentence": "中国将加快人才市场体系建设。",
    "entities": [
        {
    
    
            "name": "中国",
            "type": "国家",
            "pos": [
                0,
                2
            ]
        }
    ]
},

4. BMEO_to_JSON

Primitive data type:

高 B-NAME
勇 E-NAME
: O
男 O
, O
中 B-CONT
国 M-CONT
国 M-CONT
籍 E-CONT
, O
无 O
境 O
外 O
居 O
留 O
权 O
, O

code:

      
import json
import sys
import os
sys.path.append("..")

def bmeo_to_json(input_files, output_files, label_output_file):
    label_set = set()

    for input_file, output_file in zip(input_files, output_files):
        data = []
        with open(input_file, 'r', encoding='utf-8') as f:
            lines = f.readlines()
            sentence = ""
            entities = []
            entity_name = ""
            entity_type = ""
            start_position = 0
            for line in lines:
                if line == '\n':
                    # if there's an entity already being processed, append it to entities
                    if entity_name:
                        entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    # append the processed sentence to data
                    data.append({
    
    'sentence': sentence, 'entities': entities})
                    sentence = ""
                    entities = []
                else:
                    word, tag = line.rstrip('\n').split(' ')
                    if tag.startswith('B'):
                        entity_name = word
                        entity_type = tag.split('-')[1]
                        label_set.add(entity_type)  # add this entity type to the set
                        start_position = len(sentence)
                    elif tag.startswith('M'):
                        entity_name += word
                    elif tag.startswith('E'):
                        entity_name += word
                        entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    elif tag.startswith('S'):
                        entity_name = word
                        entity_type = tag.split('-')[1]
                        label_set.add(entity_type)  # add this entity type to the set
                        start_position = len(sentence)
                        entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    else:
                        # if there's an entity already being processed, append it to entities
                        if entity_name:
                            entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
                        entity_name = ""
                        entity_type = ""
                    sentence += word
            # for the last entity of the last sentence
            if entity_name:
                entities.append({
    
    'name': entity_name, 'type': entity_type, 'pos': [start_position, start_position + len(entity_name)]})
            if sentence:
                data.append({
    
    'sentence': sentence, 'entities': entities})

        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)

    with open(label_output_file, 'w', encoding='utf-8') as f:
        json.dump(list(label_set), f, ensure_ascii=False, indent=4)


currPath = os.path.join( "datasets", "简历-NER")
input_files = [os.path.join(currPath, "train.txt"), os.path.join(currPath, "test.txt"), os.path.join(currPath, "dev.txt")]
output_files = [os.path.join(currPath, "train.json"), os.path.join(currPath, "test.json"), os.path.join(currPath, "dev.json")]
label_output_file = os.path.join(currPath, "label.json")
bmeo_to_json(input_files, output_files, label_output_file)

    

Generate json format:

{
    
    
    "sentence": "高勇:男,中国国籍,无境外居留权,",
    "entities": [
        {
    
    
            "name": "高勇",
            "type": "姓名",
            "pos": [
                0,
                2
            ]
        },
        {
    
    
            "name": "中国国籍",
            "type": "国籍",
            "pos": [
                5,
                9
            ]
        }
    ]
},

5.D_BIO_JSON

Primitive data type:

交行14年用过,半年准备提额,却直接被降到1K,半年期间只T过一次三千,其它全部真实消费,第六个月的时候为了增加评分提额,还特意分期两万,但降额后电话投诉,申请提...
B-BANK I-BANK O O O O O O O O O O B-COMMENTS_N I-COMMENTS_N O O O O O B-COMMENTS_ADJ I-COMMENTS_ADJ O O O O O O O O O O O O O O O O O O O O O B-COMMENTS_N I-COMMENTS_N O O O O O O O O O O B-COMMENTS_N I-COMMENTS_N O O B-COMMENTS_N I-COMMENTS_N O O O O B-PRODUCT I-PRODUCT O O O O B-COMMENTS_ADJ O O O O O O O O O O O O O

code:

import json
import os
import sys
sys.path.append("..")

def d_bio_to_json(text_file, label_file, output_file, output_label_file):
    with open(text_file, 'r', encoding='utf-8') as f_text, open(label_file, 'r', encoding='utf-8') as f_label:
        texts = f_text.read().splitlines()
        labels = f_label.read().splitlines()
    
    num = 0

    data = []
    label_set = set()
    for text, label in zip(texts, labels):
        entities = []
        entity = None
        start_idx = None

        tokens = text.split()
        tags = label.split()

        for i, (token, tag) in enumerate(zip(tokens, tags)):
            if tag.startswith('B'):
                if entity is not None:
                    entities.append(entity)
                entity = {
    
    
                    "name": token,
                    "type": tag[2:],
                    "pos": [i, i + 1]
                }
                start_idx = i
                label_set.add(tag[2:])
            elif tag.startswith('I'):
                if entity is None:
                    entity = {
    
    
                        "name": token,
                        "type": tag[2:],
                        "pos": [i, i + 1]
                    }
                    start_idx = i
                    label_set.add(tag[2:])
                else:
                    entity["name"] += token
                    entity["pos"][1] = i + 1
            elif tag == 'O':
                if entity is not None:
                    entities.append(entity)
                    entity = None

        if entity is not None:
            entities.append(entity)

        sentence = ''.join(tokens)  # 去除空格
        data.append({
    
    
            "sentence": sentence,
            "entities": entities
        })
        num += 1

    with open(output_file, 'w', encoding='utf-8') as f_out:
        json.dump(data, f_out, ensure_ascii=False, indent=4)

    with open(output_label_file, 'w', encoding='utf-8') as f_label:
        json.dump(list(label_set), f_label, ensure_ascii=False, indent=4)

    print(num)

currPath = os.path.join( "datasets", "人民日报2014")
text_file = os.path.join(currPath, "source.txt")
label_file = os.path.join(currPath, "target.txt")
output_file = os.path.join(currPath, "train.json")
output_label_file = os.path.join(currPath, "label.json")
d_bio_to_json(text_file, label_file, output_file, output_label_file)

Generate json format:

{
    
    
  "sentence": "交行14年用过,半年准备提额,却直接被降到1K,半年期间只T过一次三千,其它全部真实消费,第六个月的时候为了增加评分提额,还特意分期两万,但降额后电话投诉,申请提...",
  "entities": [
    {
    
    
      "name": "交行",
      "type": "银行",
      "pos": [
        0,
        2
      ]
    },
    {
    
    
      "name": "提额",
      "type": "金融操作",
      "pos": [
        12,
        14
      ]
    },
    {
    
    
      "name": "降到",
      "type": "形容词",
      "pos": [
        19,
        21
      ]
    },
    {
    
    
      "name": "消费",
      "type": "金融操作",
      "pos": [
        42,
        44
      ]
    },
    {
    
    
      "name": "增加",
      "type": "金融操作",
      "pos": [
        54,
        56
      ]
    },
    {
    
    
      "name": "提额",
      "type": "金融操作",
      "pos": [
        58,
        60
      ]
    },
    {
    
    
      "name": "分期",
      "type": "产品",
      "pos": [
        64,
        66
      ]
    },
    {
    
    
      "name": "降",
      "type": "形容词",
      "pos": [
        70,
        71
      ]
    }
  ]
},

6. BIO_JSON_to_JSON

Primitive data type:

{
    
    "text": "来一首周华健的花心", "labels": ["O", "O", "O", "B-singer", "I-singer", "I-singer", "O", "B-song", "I-song"]}

code:

import json

def bio_json_to_json(input_file, output_file, label_file):
    num = 0
    label_set = set()
    
    with open(input_file, 'r', encoding='utf-8') as f:
        data = f.read().splitlines()

    converted_data = []

    for sample in data:
        sample = json.loads(sample)
        sentence = sample['text']
        labels = sample['labels']
        entities = []
        entity_name = ""
        entity_start = None
        entity_type = None

        for i, label in enumerate(labels):
            if label.startswith('B-'):
                if entity_name:
                    entities.append({
    
    
                        'name': entity_name,
                        'type': entity_type,
                        'pos': [entity_start, i]
                    })
                    label_set.add(entity_type)
                entity_name = sentence[i]
                entity_start = i
                entity_type = label[2:]
            elif label.startswith('I-'):
                if entity_name:
                    entity_name += sentence[i]
            else:
                if entity_name:
                    entities.append({
    
    
                        'name': entity_name,
                        'type': entity_type,
                        'pos': [entity_start, i]
                    })                    
                    label_set.add(entity_type)
                    entity_name = ""
                    entity_start = None
                    entity_type = None

        if entity_name:
            entities.append({
    
    
                'name': entity_name,
                'type': entity_type,
                'pos': [entity_start, len(labels)]
            })
            label_set.add(entity_type)

        converted_data.append({
    
    
            'sentence': sentence,
            'entities': entities
        })
        num += 1

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(converted_data, f, ensure_ascii=False, indent=2)
    
    with open(label_file, 'w', encoding='utf-8') as f_label:
        json.dump(list(label_set), f_label, ensure_ascii=False, indent=4)

    

Generate json format:

{
    
    
  "sentence": "来一首周华健的花心",
  "entities": [
    {
    
    
      "name": "周华健",
      "type": "歌手",
      "pos": [
        3,
        6
      ]
    },
    {
    
    
      "name": "花心",
      "type": "歌曲",
      "pos": [
        7,
        9
      ]
    }
  ]
},

7. JSON_to_JSON

Primitive data type:

{
    
     "text": "呼吸肌麻痹和呼吸中枢受累患者因呼吸不畅可并发肺炎、肺不张等。", "entities": [ {
    
     "start_idx": 0, "end_idx": 2, "type": "bod", "entity: "呼吸肌" }, { "start_idx": 0, "end_idx": 4, "type": "sym", "entity: "呼吸肌麻痹" }, {
    
     "start_idx": 6, "end_idx": 9, "type": "bod", "entity: "呼吸中枢" }, { "start_idx": 6, "end_idx": 11, "type": "sym", "entity: "呼吸中枢受累" }, {
    
     "start_idx": 15, "end_idx": 18, "type": "sym", "entity: "呼吸不畅" }, { "start_idx": 22, "end_idx": 23, "type": "dis", "entity: "肺炎" }, {
    
     "start_idx": 25, "end_idx": 27, "type": "dis", "entity: "肺不张" } ] }

code:

import json
import os
import sys
sys.path.append("..")

def json_to_json(input_files, output_files, label_output_file):
    label_set = set()

    for input_file, output_file in zip(input_files, output_files):
        with open(input_file, 'r', encoding='utf-8') as f_in:
            data = json.load(f_in)

        converted_data = []

        for item in data:
            sentence = item['text']
            entities = []

            for entity in item['entities']:
                start_idx = entity['start_idx']
                end_idx = entity['end_idx']
                entity_type = entity['type']
                entity_name = entity['entity']
                entities.append({
    
    
                    'name': entity_name,
                    'type': entity_type,
                    'pos': [start_idx, end_idx]
                })
                label_set.add(entity_type)

            converted_data.append({
    
    
                'sentence': sentence,
                'entities': entities
            })

        with open(output_file, 'w', encoding='utf-8') as f_out:
            json.dump(converted_data, f_out, ensure_ascii=False, indent=4)

    with open(label_output_file, 'w', encoding='utf-8') as f_label:
        json.dump(list(label_set), f_label, ensure_ascii=False, indent=4)



currPath = os.path.join( "datasets", "CMeEE-V2")
input_files = [os.path.join(currPath, "CMeEE-V2_train.json"), os.path.join(currPath, "CMeEE-V2_test.json"), os.path.join(currPath, "CMeEE-V2_dev.json")]
output_files = [os.path.join(currPath, "train.json"), os.path.join(currPath, "test.json"), os.path.join(currPath, "dev.json")]
label_output_file = os.path.join(currPath, "label.json")
json_to_json(input_files, output_files, label_output_file)
 

Generate json format:

{
    
    
    "sentence": "呼吸肌麻痹和呼吸中枢受累患者因呼吸不畅可并发肺炎、肺不张等。",
    "entities": [
        {
    
    
            "name": "呼吸肌麻痹",
            "type": "疾病",
            "pos": [
                0,
                5
            ]
        },
        {
    
    
            "name": "呼吸中枢",
            "type": "部位",
            "pos": [
                6,
                10
            ]
        },
        {
    
    
            "name": "呼吸中枢受累",
            "type": "症状",
            "pos": [
                6,
                12
            ]
        },
        {
    
    
            "name": "呼吸不畅",
            "type": "症状",
            "pos": [
                15,
                19
            ]
        },
        {
    
    
            "name": "肺炎",
            "type": "疾病",
            "pos": [
                22,
                24
            ]
        },
        {
    
    
            "name": "肺不张",
            "type": "疾病",
            "pos": [
                25,
                28
            ]
        }
    ]
},

8. JSON_to_JSON

Primitive data type:

{
    
    "text": "生生不息CSOL生化狂潮让你填弹狂扫", "label": {
    
    "game": {
    
    "CSOL": [[4, 7]]}}}

code:

      
import json

def nested_json_to_json(input_files, output_files, label_output_file):
    num = 0
    label_set = set()

    for input_file, output_file in zip(input_files, output_files):
        with open(input_file, 'r', encoding='utf-8') as f_in:
            data = f_in.read().splitlines()


        converted_data = []

        for item in data:
            item = json.loads(item)
            sentence = item['text']
            entities = []

            for label, entity in item['label'].items():
                entity_type = label
                entity_name = list(entity.keys())[0]
                start_idx = list(entity.values())[0][0][0]
                end_idx = list(entity.values())[0][0][1]
                entities.append({
    
    
                    'name': entity_name,
                    'type': entity_type,
                    'pos': [start_idx, end_idx]
                })
                label_set.add(entity_type)

            converted_data.append({
    
    
                'sentence': sentence,
                'entities': entities
            })
            num += 1

        with open(output_file, 'w', encoding='utf-8') as f_out:
            json.dump(converted_data, f_out, ensure_ascii=False, indent=4)

    with open(label_output_file, 'w', encoding='utf-8') as f_label:
        json.dump(list(label_set), f_label, ensure_ascii=False, indent=4)

    print(num)



import os
import sys
sys.path.append("..")

currPath = os.path.join( "datasets", "CLUENER")
input_files = [os.path.join(currPath, "CLUENER_train.json"),os.path.join(currPath, "CLUENER_dev.json")]
output_files = [os.path.join(currPath, "train.json"), os.path.join(currPath, "dev.json")]
label_output_file = os.path.join(currPath, "label.json")
nested_json_to_json(input_files, output_files, label_output_file)

    
 

Generate json format:

{
    
    
    "sentence": "生生不息CSOL生化狂潮让你填弹狂扫",
    "entities": [
        {
    
    
            "name": "CSOL",
            "type": "游戏",
            "pos": [
                4,
                7
            ]
        }
    ]
},

Summarize

It is a relatively comprehensive data set format conversion article in the NER field. Almost all data sets can use the above code wheel for format conversion. There may be some special tags. For example, the "B-" format in BIO may be "B_", or The token and the label are not separated by spaces. You only need to modify the code to solve it. I hope this blog can be helpful to readers. If there is a supplementary data format, please contact the blogger~

Guess you like

Origin blog.csdn.net/HERODING23/article/details/131610215