自然语言数据标注方法(脚本)

本数据主要用于评估自然语言单词和程序语言API之间的相关性。

每一个配对中包含一个单词和API,如果两者之间相关性判定为相关则标注为1,如果判定为不相关则标注为0。

判断标准: 主要根据单词的含义和API包含的功能进行判断,如果API包含的功能涉及单词的含义,则可认为单词与API相关。

例如,对于名词“bean”,如果API涉及对于bean的操作或者含有bean的属性等则认为二者相关;对于动词“exchange”,如果API的功能中包含对数据进行接收和发送的动作等,则认为二者相关。

标注数据示例

 根据word的单词,找到对应句子API中,是否有意思相近的词,如果有相近意思单词,rel输出1,否则输出0。

案例主要将current意思相近的词,标注1,否则标注0

 "current"近义词:["current","present","existing","recent","up-to-date","contemporary","present-day","modern","in progress","up to date","dated"]

import pandas as pd
# 查看api列中是否有word列的近义词
# 安装pandas包 将csv文件与test.py放在同一目录下执行

data_map = {
    # 要标注的词 load, load的近义词 get load read import
    # 改改改!!!! 改成相关要修改的近义词,
    "current":["current","present","existing","recent","up-to-date","contemporary","present-day","modern","in progress","up to date","dated"],
    "agent":["agent","go-between","manager","negotiator","mediator","representative","proxy"],
    "cache":["board","store","supply","accumulation","reserve","collection"],
    "mode": ["mode", "pattern", "model"],
    "message": ["message", "uri", "url", "trace", "print","get"]
}

# 改改改!!!! 要标注的文件名
src_name = "18.csv"
# 标注完之后生成的文件名
target_name = "18answer.csv"


def find_rel(arr):
    word = arr[0]
    api = arr[1].upper()
    rel = arr[2]
    for word_alike in data_map[word]:
        if word_alike.upper() in api:
            return 1
    return 0


df = pd.read_csv(src_name)
df["rel"] = df.apply(find_rel, axis=1)
df.to_csv(target_name, index=False, columns=["word", "API", "rel"])

python脚本要修改的地方,已标注成改改改!!!!

猜你喜欢

转载自blog.csdn.net/u013177138/article/details/122104314