[When Artificial Intelligence Meets Security] 10. Threat Intelligence Entity Recognition (1) Detailed explanation of entity recognition based on BiLSTM-CRF

As you probably know, the author will share fewer and fewer articles about cybersecurity in the future. But if you want to learn the application of artificial intelligence and security, you are in for a treat. The author will create a new blog series "When Artificial Intelligence Meets Security" to introduce in detail papers and practices related to artificial intelligence and security, and share various Cases involving malicious code detection, malicious request identification, intrusion detection, adversarial samples, etc. I just want to better help beginners and share new knowledge in a more systematic way. This series of articles will be more focused, more academic, and more in-depth, and will also be a slow growth history of the author. It is indeed difficult to change majors, and system security is also a hard nut, but I will give it a try and see how far I can learn it in the next four years. It will be a long journey, but it will be a journey up the mountain. Enjoy the process and work hard together~

The previous article introduced in detail how to learn the extracted API sequence features and build a deep learning algorithm to classify malicious families. This is also a typical task or work in the security field. This article will explain how to implement threat intelligence entity identification and use the BiLSTM-CRF algorithm to extract ATT&CK-related technical and tactical entities, which is an important support for the construction of security knowledge graphs. This is a basic article. I hope it will be helpful to you. If there are any errors or deficiencies, please forgive me. Take a look and cherish it!

Version Information:

  • keras-contrib V2.0.8
  • keras V2.3.1
  • tensorflow V2.2.0

Common frameworks are shown below:

Insert image description here

Insert image description here

As a novice on network security, the author shares some basic self-study tutorials with everyone, mainly online notes. I hope you like them. At the same time, I hope you can operate and make progress together with me. In the future, I will learn more about AI security and system security and share relevant experiments. In short, I hope this series of articles will be helpful to bloggers. Writing articles is not easy. If the masters don’t like it, please don’t comment. Thank you! If the article is helpful to you, it will be the biggest motivation for my creation. You can like, comment, or privately message me. Let’s work together!

Previous article recommendations:

Author's github resources:


1. ATT&CK data collection

Students who understand threat intelligence should be familiar with Miter's ATT&CK website. This article will collect the attack techniques and tactics data of the APT organization on this website and conduct a threat intelligence entity identification experiment. The URL is as follows:

Insert image description here

The first step is to locate the name of the APT organization through ATT&CK website source code analysis and conduct systematic collection.

Insert image description here

Install the BeautifulSoup extension package. This part of the code is as follows:

Insert image description here

01-get-aptentity.py

#encoding:utf-8
#By:Eastmount CSDN
import re
import requests
from lxml import etree
from bs4 import BeautifulSoup
import urllib.request

#-------------------------------------------------------------------------------------------
#获取APT组织名称及链接

#设置浏览器代理,它是一个字典
headers = {
    
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
url = 'https://attack.mitre.org/groups/'

#向服务器发出请求
r = requests.get(url = url, headers = headers).text

#解析DOM树结构
html_etree = etree.HTML(r)
names = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/text()')
print (names)
print(len(names),names[0])
filename = []
for name in names:
    filename.append(name.strip())
print(filename)

#链接
urls = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/@href')
print(urls)
print(len(urls), urls[0])
print("\n")

At this time, the output result is shown in the figure below, including the name of the APT organization and the corresponding URL.

Insert image description here

The second step is to access the URL corresponding to the APT organization and collect detailed information (text description).

Insert image description here

The third step is to collect the corresponding technical and tactical TTPs information. The source code location is shown in the figure below.

Insert image description here

The fourth step is to write code to complete threat intelligence data collection. 01-spider-mitre.py The complete code is as follows:

#encoding:utf-8
#By:Eastmount CSDN
import re
import requests
from lxml import etree
from bs4 import BeautifulSoup
import urllib.request

#-------------------------------------------------------------------------------------------
#获取APT组织名称及链接

#设置浏览器代理,它是一个字典
headers = {
    
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
url = 'https://attack.mitre.org/groups/'

#向服务器发出请求
r = requests.get(url = url, headers = headers).text
#解析DOM树结构
html_etree = etree.HTML(r)
names = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/text()')
print (names)
print(len(names),names[0])
#链接
urls = html_etree.xpath('//*[@class="table table-bordered table-alternate mt-2"]/tbody/tr/td[2]/a/@href')
print(urls)
print(len(urls), urls[0])
print("\n")

#-------------------------------------------------------------------------------------------
#获取详细信息
k = 0
while k<len(names):
    filename = str(names[k]).strip() + ".txt"
    url = "https://attack.mitre.org" + urls[k]
    print(url)

    #获取正文信息
    page = urllib.request.Request(url, headers=headers)
    page = urllib.request.urlopen(page)
    contents = page.read()
    soup = BeautifulSoup(contents, "html.parser")

    #获取正文摘要信息
    content = ""
    for tag in soup.find_all(attrs={
    
    "class":"description-body"}):
        #contents = tag.find("p").get_text()
        contents = tag.find_all("p")
        for con in contents:
            content += con.get_text().strip() + "###\n"  #标记句子结束(第二部分分句用)
    #print(content)

    #获取表格中的技术信息
    for tag in soup.find_all(attrs={
    
    "class":"table techniques-used table-bordered mt-2"}):
        contents = tag.find("tbody").find_all("tr")
        for con in contents:
            value = con.find("p").get_text()           #存在4列或5列 故获取p值
            #print(value)
            content += value.strip() + "###\n"         #标记句子结束(第二部分分句用)

    #删除内容中的参考文献括号 [n]
    result = re.sub(u"\\[.*?]", "", content)
    print(result)

    #文件写入
    filename = "Mitre//" + filename
    print(filename)
    f = open(filename, "w", encoding="utf-8")
    f.write(result)
    f.close()    
    k += 1

The output results are shown in the figure below, with a total of 100 organizational information organized.

Insert image description here

Insert image description here

The displayed content of each file is as shown below:

Insert image description here

Warm reminder:
Since the layout of the website will continue to change and be optimized, readers need to master the basic methods of data collection and syntax tree positioning to adapt to changes. In addition, readers can try to collect all exercises and even URL jump link content. Readers are asked to try and expand on their own!


2. Data splitting and content statistics

1. Paragraph splitting

In order to expand the data set and better perform NLP processing, we need to segment the text data. The method adopted is:

  • Get the previously defined flag "###"
  • A TXT file is generated every five sentences, and the naming method is "10XX_organization name"

02-dataset-split.py complete code:

#encoding:utf-8
#By:Eastmount CSDN
import re
import os

#------------------------------------------------------------------------
#获取文件路径及名称
def get_filepath(path):
    entities = {
    
    }              #字段实体类别
    files = os.listdir(path)   #遍历路径
    return files

#-----------------------------------------------------------------------
#获取文件内容
def get_content(filename):
    content = ""
    with open(filename, "r", encoding="utf8") as f:
        for line in f.readlines():
            content += line.replace("\n"," ")
    return content
            
#---------------------------------------------------------------------
#自定义分隔符文本分割
def split_text(text):
    pattern = '###'
    nums = text.split(pattern) #获取字符的下标位置
    return nums
    
#-----------------------------------------------------------------------
#主函数
if __name__ == '__main__':
    #获取文件名
    path = "Mitre"
    savepath = "Mitre-Split"
    filenames = get_filepath(path)
    print(filenames)
    print("\n")

    #遍历文件内容
    k = 0
    begin = 1001  #命名计数
    while k<len(filenames):
        filename = "Mitre//" + filenames[k]
        print(filename)
        content = get_content(filename)
        print(content)

        #分割句子
        nums = split_text(content)

        #每隔五句输出为一个TXT文档
        n = 0
        result = ""
        while n<len(nums):
            if n>0 and (n%5)==0: #存储
                savename = savepath + "//" + str(begin) + "-" + filenames[k]
                print(savename)
                f = open(savename, "w", encoding="utf8")
                f.write(result)
                result = ""
                result = nums[n].lstrip() + "### "  #第一句
                begin += 1
                f.close()
            else:               #赋值
                result += nums[n].lstrip() + "### "
            n += 1
        k += 1

It was eventually split into 381 files, located in the "Mitre-Split" folder.

Insert image description here

A single file is shown below:

Insert image description here


2. Sentence splitting

The named entity recognition task needs to be completed before data annotation:

  • Split paragraphs into sentences
  • Separate sentences by words, each line corresponds to a word, and each word corresponds to a subsequent annotation.
  • Key code text.split(" ")

The effect after sentence splitting is shown in the figure below:

Insert image description here

The complete code is shown below and generates the "Mitre-Split-Word" folder.

#encoding:utf-8
#By:Eastmount CSDN
import re
import os

#------------------------------------------------------------------------
#获取文件路径及名称
def get_filepath(path):
    entities = {
    
    }              #字段实体类别
    files = os.listdir(path)   #遍历路径
    return files

#-----------------------------------------------------------------------
#获取文件内容
def get_content(filename):
    content = ""
    with open(filename, "r", encoding="utf8") as f:
        for line in f.readlines():
            content += line.replace("\n"," ")
    return content
            
#---------------------------------------------------------------------
#空格分隔获取英文单词
def split_word(text):
    nums = text.split(" ")
    #print(nums)
    return nums

#-----------------------------------------------------------------------
#主函数
if __name__ == '__main__':
    #获取文件名
    path = "Mitre-Split"
    savepath = "Mitre-Split-Word"
    filenames = get_filepath(path)
    print(filenames)
    print("\n")

    #遍历文件内容
    k = 0
    while k<len(filenames):
        filename = path + "//" + filenames[k]
        print(filename)
        content = get_content(filename)
        content = content.replace("###","\n")

        #分割句子
        nums = split_word(content)
        #print(nums)
        savename = savepath + "//" + filenames[k]
        f = open(savename, "w", encoding="utf8")
        for n in nums:
            if n != "":
                #替换标点符号
                n = n.replace(",", "")
                n = n.replace(";", "")
                n = n.replace("!", "")
                n = n.replace("?", "")
                n = n.replace(":", "")
                n = n.replace('"', "")
                n = n.replace('(', "")
                n = n.replace(')', "")
                n = n.replace('’', "")
                n = n.replace('\'s', "")
                #替换句号
                if ("." in n) and (n not in ["U.S.","U.K."]):
                    n = n.rstrip(".")
                    n = n.rstrip(".\n")
                    n = n + "\n"
                f.write(n+"\n")
        f.close()
        k += 1

3. Data annotation

Data annotation is carried out in a violent way, that is, defining different types of entity names and using BIO for annotation. Annotation is carried out through ATT&CK technical and tactical methods, which can be combined with manual correction later, and more types of entities can be defined at the same time.

  • BIO labeling
Entity name Number of entities Example
APT attack organization 128 APT32、Lazarus Group
Attack vulnerabilities 56 CVE-2009-0927
Regional location 72 America、Europe
Attack the industry 34 companies、finance
Attack techniques 65 C&C、RAT、DDoS
Utilize software 48 7-Zip、Microsoft
operating system 10 Linux、Windows

Common data annotation tools:

  • Image annotation: labelme, LabelImg, Labelbox, RectLabel, CVAT, VIA
  • Semi-automatic ocr labeling: PPOCRLabel
  • NLP labeling tool: labelstudio

The complete code for this part (04-BIO-data-annotation.py) is as follows:

#encoding:utf-8
import re
import os
import csv

#-----------------------------------------定义实体类型-------------------------------------
#APT攻击组织
aptName = ['admin@338', 'Ajax Security Team', 'APT-C-36', 'APT1', 'APT12', 'APT16', 'APT17', 'APT18', 'APT19', 'APT28', 'APT29', 'APT3', 'APT30', 'APT32',
           'APT33', 'APT37', 'APT38', 'APT39', 'APT41', 'Axiom', 'BlackOasis', 'BlackTech', 'Blue Mockingbird', 'Bouncing Golf', 'BRONZE BUTLER',
           'Carbanak', 'Chimera', 'Cleaver', 'Cobalt Group', 'CopyKittens', 'Dark Caracal', 'Darkhotel', 'DarkHydrus', 'DarkVishnya', 'Deep Panda',
           'Dragonfly', 'Dragonfly 2.0', 'DragonOK', 'Dust Storm', 'Elderwood', 'Equation', 'Evilnum', 'FIN10', 'FIN4', 'FIN5', 'FIN6', 'FIN7', 'FIN8',
           'Fox Kitten', 'Frankenstein', 'GALLIUM', 'Gallmaker', 'Gamaredon Group', 'GCMAN', 'GOLD SOUTHFIELD', 'Gorgon Group', 'Group5', 'HAFNIUM',
           'Higaisa', 'Honeybee', 'Inception', 'Indrik Spider', 'Ke3chang', 'Kimsuky', 'Lazarus Group', 'Leafminer', 'Leviathan', 'Lotus Blossom',
           'Machete', 'Magic Hound', 'menuPass', 'Moafee', 'Mofang', 'Molerats', 'MuddyWater', 'Mustang Panda', 'Naikon', 'NEODYMIUM', 'Night Dragon',
           'OilRig', 'Operation Wocao', 'Orangeworm', 'Patchwork', 'PittyTiger', 'PLATINUM', 'Poseidon Group', 'PROMETHIUM', 'Putter Panda', 'Rancor',
           'Rocke', 'RTM', 'Sandworm Team', 'Scarlet Mimic', 'Sharpshooter', 'Sidewinder', 'Silence', 'Silent Librarian', 'SilverTerrier', 'Sowbug', 'Stealth Falcon',
           'Stolen Pencil', 'Strider', 'Suckfly', 'TA459', 'TA505', 'TA551', 'Taidoor', 'TEMP.Veles', 'The White Company', 'Threat Group-1314', 'Threat Group-3390',
           'Thrip', 'Tropic Trooper', 'Turla', 'Volatile Cedar', 'Whitefly', 'Windigo', 'Windshift', 'Winnti Group', 'WIRTE', 'Wizard Spider', 'ZIRCONIUM',
           'UNC2452', 'NOBELIUM', 'StellarParticle']

#特殊名称的攻击漏洞
cveName = ['CVE-2009-3129', 'CVE-2012-0158', 'CVE-2009-4324' 'CVE-2009-0927', 'CVE-2011-0609', 'CVE-2011-0611', 'CVE-2012-0158',
           'CVE-2017-0262', 'CVE-2015-4902', 'CVE-2015-1701', 'CVE-2014-4076', 'CVE-2015-2387', 'CVE-2015-1701', 'CVE-2017-0263']

#区域位置
locationName = ['China-based', 'China', 'North', 'Korea', 'Russia', 'South', 'Asia', 'US', 'U.S.', 'UK', 'U.K.', 'Iran', 'Iranian', 'America', 'Colombian',
                'Chinese', "People’s",  'Liberation', 'Army', 'PLA', 'General', 'Staff', "Department’s", 'GSD', 'MUCD', 'Unit', '61398', 'Chinese-based',
                "Russia's", "General", "Staff", "Main", "Intelligence", "Directorate", "GRU", "GTsSS", "unit", "26165", '74455', 'Georgian', 'SVR',
                'Europe', 'Asia', 'Hong Kong', 'Vietnam', 'Cambodia', 'Thailand', 'Germany', 'Spain', 'Finland', 'Israel', 'India', 'Italy', 'South Asia',
                'Korea', 'Kuwait', 'Lebanon', 'Malaysia', 'United', 'Kingdom', 'Netherlands', 'Southeast', 'Asia', 'Pakistan', 'Canada', 'Bangladesh',
                'Ukraine', 'Austria', 'France', 'Korea']

#攻击行业
industryName = ['financial', 'economic', 'trade', 'policy', 'defense', 'industrial', 'espionage', 'government', 'institutions', 'institution', 'petroleum',
                'industry', 'manufacturing', 'corporations', 'media', 'outlets', 'high-tech', 'companies', 'governments', 'medical', 'defense', 'finance',
                'energy', 'pharmaceutical', 'telecommunications', 'high', 'tech', 'education', 'investment', 'firms', 'organizations', 'research', 'institutes',
                ]

#攻击方法
methodName = ['RATs', 'RAT', 'SQL', 'injection', 'spearphishing', 'spear', 'phishing', 'backdoors', 'vulnerabilities', 'vulnerability', 'commands', 'command',
              'anti-censorship', 'keystrokes', 'VBScript', 'malicious', 'document', 'scheduled', 'tasks', 'C2', 'C&C', 'communications', 'batch', 'script',
              'shell', 'scripting', 'social', 'engineering', 'privilege', 'escalation', 'credential', 'dumping', 'control', 'obfuscates', 'obfuscate', 'payload', 'upload',
              'payloads', 'encode', 'decrypts', 'attachments', 'attachment', 'inject', 'collect', 'large-scale', 'scans', 'persistence', 'brute-force/password-spray',
              'password-spraying', 'backdoor', 'bypass', 'hijacking', 'escalate', 'privileges', 'lateral', 'movement', 'Vulnerability', 'timestomping',
              'keylogging', 'DDoS', 'bootkit', 'UPX' ]

#利用软件
softwareName = ['Microsoft', 'Word', 'Office', 'Firefox', 'Google', 'RAR', 'WinRAR', 'zip', 'GETMAIL', 'MAPIGET', 'Outlook', 'Exchange', "Adobe's", 'Adobe',
                'Acrobat', 'Reader', 'RDP', 'PDFs', 'PDF', 'RTF', 'XLSM', 'USB', 'SharePoint', 'Forfiles', 'Delphi', 'COM', 'Excel', 'NetBIOS',
                'Tor', 'Defender', 'Scanner', 'Gmail', 'Yahoo', 'Mail', '7-Zip', 'Twitter', 'gMSA', 'Azure', 'Exchange', 'OWA', 'SMB', 'Netbios',
                'WinRM']

#操作系统
osName = ['Windows', 'windows', 'Mac', 'Linux', 'Android', 'android', 'linux', 'mac', 'unix', 'Unix']

#计算并输出相关的内容
saveCVE = cveName
saveAPT = aptName
saveLocation = locationName
saveIndustry = industryName
saveMethod = methodName
saveSoftware = softwareName
saveOS = osName

#------------------------------------------------------------------------
#获取文件路径及名称
def get_filepath(path):
    entities = {
    
    }              #字段实体类别
    files = os.listdir(path)   #遍历路径
    return files
    
#-----------------------------------------------------------------------
#获取文件内容
def get_content(filename):
    content = []
    with open(filename, "r", encoding="utf8") as f:
        for line in f.readlines():
            content.append(line.strip())
    return content
            
#---------------------------------------------------------------------
#空格分隔获取英文单词
def data_annotation(text):
    n = 0
    nums = []
    while n<len(text):
        word = text[n].strip()
        if word == "":   #换行 startswith
            n += 1
            nums.append("")
            continue
        
        #APT攻击组织
        if word in aptName:
            nums.append("B-AG")
        #攻击漏洞
        elif "CVE-" in word or 'MS-' in word:
            nums.append("B-AV")
            print("CVE漏洞:", word)
            if word not in saveCVE:
                saveCVE.append(word)
        #区域位置
        elif word in locationName:
            nums.append("B-RL")
        #攻击行业
        elif word in industryName:
            nums.append("B-AI")
        #攻击手法
        elif word in methodName:
            nums.append("B-AM")
        #利用软件
        elif word in softwareName:
            nums.append("B-SI")
        #操作系统
        elif word in osName:
            nums.append("B-OS")
       
        #特殊情况-APT组织
        #Ajax Security Team、Deep Panda、Sandworm Team、Cozy Bear、The Dukes、Dark Halo
        elif ((word in "Ajax Security Team") and (text[n+1].strip() in "Ajax Security Team") and word!="a" and word!="it") or \
              ((word in "Ajax Security Team") and (text[n-1].strip() in "Ajax Security Team") and word!="a" and word!="it") or \
              ((word=="Deep") and (text[n+1].strip()=="Panda")) or \
              ((word=="Panda") and (text[n-1].strip()=="Deep")) or \
              ((word=="Sandworm") and (text[n+1].strip()=="Team")) or \
              ((word=="Team") and (text[n-1].strip()=="Sandworm")) or \
              ((word=="Cozy") and (text[n+1].strip()=="Bear")) or \
              ((word=="Bear") and (text[n-1].strip()=="Cozy")) or \
              ((word=="The") and (text[n+1].strip()=="Dukes")) or \
              ((word=="Dukes") and (text[n-1].strip()=="The")) or \
              ((word=="Dark") and (text[n+1].strip()=="Halo")) or \
              ((word=="Halo") and (text[n-1].strip()=="Dark")):
            nums.append("B-AG")
            if "Deep Panda" not in saveAPT:
                saveAPT.append("Deep Panda")
            if "Sandworm Team" not in saveAPT:
                saveAPT.append("Sandworm Team")
            if "Cozy Bear" not in saveAPT:
                saveAPT.append("Cozy Bear")
            if "The Dukes" not in saveAPT:
                saveAPT.append("The Dukes")
            if "Dark Halo" not in saveAPT:
                saveAPT.append("Dark Halo")     
         
        #特殊情况-攻击行业
        elif ((word=="legal") and (text[n+1].strip()=="services")) or \
              ((word=="services") and (text[n-1].strip()=="legal")):
            nums.append("B-AI")
            if "legal services" not in saveIndustry:
                saveIndustry.append("legal services")
                
        #特殊情况-攻击方法
        #watering hole attack、bypass application control、take screenshots
        elif ((word in "watering hole attack") and (text[n+1].strip() in "watering hole attack") and word!="a" and text[n+1].strip()!="a") or \
              ((word in "watering hole attack") and (text[n-1].strip() in "watering hole attack") and word!="a" and text[n+1].strip()!="a") or \
              ((word in "bypass application control") and (text[n+1].strip() in "bypass application control") and word!="a" and text[n+1].strip()!="a") or \
              ((word in "bypass application control") and (text[n-1].strip() in "bypass application control") and word!="a" and text[n-1].strip()!="a") or \
              ((word=="take") and (text[n+1].strip()=="screenshots")) or \
              ((word=="screenshots") and (text[n-1].strip()=="take")):
            nums.append("B-AM")
            if "watering hole attack" not in saveMethod:
                saveMethod.append("watering hole attack")
            if "bypass application control" not in saveMethod:
                saveMethod.append("bypass application control")
            if "take screenshots" not in saveMethod:
                saveMethod.append("take screenshots")
                
        #特殊情况-利用软件
        #MAC address、IP address、Port 22、Delivery Service、McAfee Email Protection
        elif ((word=="legal") and (text[n+1].strip()=="services")) or \
              ((word=="services") and (text[n-1].strip()=="legal")) or \
              ((word=="MAC") and (text[n+1].strip()=="address")) or \
              ((word=="address") and (text[n-1].strip()=="MAC")) or \
              ((word=="IP") and (text[n+1].strip()=="address")) or \
              ((word=="address") and (text[n-1].strip()=="IP")) or \
              ((word=="Port") and (text[n+1].strip()=="22")) or \
              ((word=="22") and (text[n-1].strip()=="Port")) or \
              ((word=="Delivery") and (text[n+1].strip()=="Service")) or \
              ((word=="Service") and (text[n-1].strip()=="Delivery")) or \
              ((word in "McAfee Email Protection") and (text[n+1].strip() in "McAfee Email Protection")) or \
              ((word in "McAfee Email Protection") and (text[n-1].strip() in "McAfee Email Protection")):
            nums.append("B-SI")
            if "MAC address" not in saveSoftware:
                saveSoftware.append("MAC address")
            if "IP address" not in saveSoftware:
                saveSoftware.append("IP address")
            if "Port 22" not in saveSoftware:
                saveSoftware.append("Port 22")
            if "Delivery Service" not in saveSoftware:
                saveSoftware.append("Delivery Service")
            if "McAfee Email Protection" not in saveSoftware:
                saveSoftware.append("McAfee Email Protection")
   
        #特殊情况-区域位置
        #Russia's Foreign Intelligence Service、the Middle East
        elif ((word in "Russia's Foreign Intelligence Service") and (text[n+1].strip() in "Russia's Foreign Intelligence Service")) or \
             ((word in "Russia's Foreign Intelligence Service") and (text[n-1].strip() in "Russia's Foreign Intelligence Service")) or \
             ((word in "the Middle East") and (text[n+1].strip() in "the Middle East")) or \
             ((word in "the Middle East") and (text[n-1].strip() in "the Middle East")) :
            nums.append("B-RL")
            if "Russia's Foreign Intelligence Service" not in saveLocation:
                saveLocation.append("Russia's Foreign Intelligence Service")
            if "the Middle East" not in saveLocation:
                saveLocation.append("the Middle East")
            
        else:
            nums.append("O")
        n += 1
    return nums
    
#-----------------------------------------------------------------------
#主函数
if __name__ == '__main__':
    path = "Mitre-Split-Word"
    savepath = "Mitre-Split-Word-BIO"
    filenames = get_filepath(path)
    print(filenames)
    print("\n")

    #遍历文件内容
    k = 0
    while k<len(filenames):
        filename = path + "//" + filenames[k]
        print("-------------------------")
        print(filename)
        content = get_content(filename)

        #分割句子
        nums = data_annotation(content)
        #print(nums)
        print(len(content),len(nums))

        #数据存储
        filename = filenames[k].replace(".txt", ".csv")
        savename = savepath + "//" + filename
        f = open(savename, "w", encoding="utf8", newline='')
        fwrite = csv.writer(f)
        fwrite.writerow(['word','label'])
        n = 0
        while n<len(content):
            fwrite.writerow([content[n],nums[n]])
            n += 1
        f.close()
        print("-------------------------\n\n")
        
        #if k>=28:
        #    break
        k += 1

    #-------------------------------------------------------------------------------------------------
    #输出存储的漏洞结果
    saveCVE.remove("CVE-2009-4324CVE-2009-0927")
    saveCVE.sort()
    print(saveCVE)
    print("CVE漏洞:", len(saveCVE))

    saveAPT.sort()
    print(saveAPT)
    print("APT组织:", len(saveAPT))

    saveLocation.sort()
    print(saveLocation)
    print("区域位置:", len(saveLocation))

    saveIndustry.sort()
    print(saveIndustry)
    print("攻击行业:", len(saveIndustry))

    saveSoftware.sort()
    print(saveSoftware)
    print("利用软件:", len(saveSoftware))

    saveMethod.sort()
    print(saveMethod)
    print("攻击手法:", len(saveMethod))

    saveOS.sort()
    print(saveOS)
    print("操作系统:", len(saveOS))

The output result at this time is as shown below:

Insert image description here

Warm reminder:
Please read and think about the correction and optimization process of data annotation. In addition, the BIO end annotation code needs to be adjusted. When we have more accurate annotations, it will benefit all entity recognition research.


4. Data set division

Before entity recognition annotation, we randomly divide the data set into a training set, a test set, and a verification set.

  • Randomly divide and store files in Mitre-Split-Word-BIO into three folders
  • The construction code synthesizes three TXT files, and subsequent code will perform training and testing tasks on these files
    – dataset-train.txt, dataset-test.txt, dataset-val. txt

As shown below:

Insert image description here

The complete code is as follows:

#encoding:utf-8
#By:Eastmount CSDN
import re
import os
import csv

#------------------------------------------------------------------------
#获取文件路径及名称
def get_filepath(path):
    entities = {
    
    }              #字段实体类别
    files = os.listdir(path)   #遍历路径
    return files

#-----------------------------------------------------------------------
#获取文件内容
def get_content(filename):
    content = ""
    fr = open(filename, "r", encoding="utf8")
    reader = csv.reader(fr)
    k = 0
    for r in reader:
        if k>0 and (r[0]!="" or r[0]!=" ") and r[1]!="":
            content += r[0] + " " + r[1] + "\n"
        elif (r[0]=="" or r[0]==" ") and r[1]!="":
            content += "UNK" + " " + r[1] + "\n"
        elif (r[0]=="" or r[0]==" ") and r[1]=="":
            content += "\n"
        k += 1
    return content
    
#-----------------------------------------------------------------------
#主函数
if __name__ == '__main__':
    #获取文件名
    path = "train"
    #path = "test"
    #path = "val"
    filenames = get_filepath(path)
    print(filenames)
    print("\n")
    savefilename = "dataset-train.txt"
    #savefilename = "dataset-test.txt"
    #savefilename = "dataset-val.txt"
    f = open(savefilename, "w", encoding="utf8")

    #遍历文件内容
    k = 0
    while k<len(filenames):
        filename = path + "//" + filenames[k]
        print(filename)
        content = get_content(filename)
        print(content)
        f.write(content)
        k += 1
    f.close()

The running results are shown in the figure below:

Insert image description here


5. CRF-based entity recognition

After writing this part, we can carry out entity recognition research. First, we will use the representative Conditional Random Fields (CRF) model to explain. Readers are asked to understand the principles of CRF on their own.

Insert image description here

1. Install keras-contrib

The CRF model author installed keras-contrib.

In the first step, if the reader directly uses "pip install keras-contrib", an error may be reported, and the remote download may also report an error.

  • pip install git+https://www.github.com/keras-team/keras-contrib.git

The error ModuleNotFoundError: No module named ‘keras_contrib’ will even be reported.

Insert image description here

In the second step, the author downloads the resource from github and installs it locally.

git clone https://www.github.com/keras-team/keras-contrib.git
cd keras-contrib
python setup.py install

The installation is successful as shown below:

Insert image description here

Readers can download code and expansion packs from my resources.


2. Install Keras

You also need to install the keras and TensorFlow extension packages.

Insert image description here

If the download of TensorFlow is too slow, you can set up the Tsinghua University mirror and actually install version 2.2.

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install tensorflow==2.2

Insert image description here

Insert image description here


3. Complete code

The code is as follows, recommended information:

#encoding:utf-8
#By:Eastmount CSDN
import re
import os
import csv
import numpy as np
import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import Model
from keras.layers import Masking, Embedding, Bidirectional, LSTM, Dense
from keras.layers import Input, TimeDistributed, Activation
from keras.models import load_model
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy
from keras import backend as K
from sklearn import metrics

#------------------------------------------------------------------------
#第一步 数据预处理
#------------------------------------------------------------------------
train_data_path = "dataset-train.txt"  #训练数据
test_data_path = "dataset-test.txt"    #测试数据
val_data_path = "dataset-val.txt"      #验证数据
char_vocab_path = "char_vocabs.txt"    #字典文件

special_words = ['<PAD>', '<UNK>']     #特殊词表示

#BIO标记的标签
label2idx = {
    
    "O": 0, "B-AG": 1, "B-AV": 2, "B-RL": 3,
             "B-AI":4, "B-AM": 5, "B-SI": 6, "B-OS": 7 }

# 索引和BIO标签对应
idx2label = {
    
    idx: label for label, idx in label2idx.items()}
print(idx2label)

# 读取字符词典文件
with open(char_vocab_path, "r", encoding="utf8") as fo:
    char_vocabs = [line.strip() for line in fo]
char_vocabs = special_words + char_vocabs
print(char_vocabs)
print("--------------------------------------------\n\n")

# 字符和索引编号对应 {'<PAD>': 0, '<UNK>': 1, 'APT-C-36': 2, ...}
idx2vocab = {
    
    idx: char for idx, char in enumerate(char_vocabs)}
vocab2idx = {
    
    char: idx for idx, char in idx2vocab.items()}
print(idx2vocab)
print("--------------------------------------------\n\n")
print(vocab2idx)
print("--------------------------------------------\n\n")

#------------------------------------------------------------------------
#第二步 读取训练语料
#------------------------------------------------------------------------
def read_corpus(corpus_path, vocab2idx, label2idx):
    datas, labels = [], []
    with open(corpus_path, encoding='utf-8') as fr:
        lines = fr.readlines()
    sent_, tag_ = [], []
    for line in lines:
        if line != '\n':        #断句
            line = line.strip()
            [char, label] = line.split()
            sent_.append(char)
            tag_.append(label)
        else:
            #print(line)
            #vocab2idx[0] => <PAD>
            sent_ids = [vocab2idx[char] if char in vocab2idx else vocab2idx['<UNK>'] for char in sent_]
            tag_ids = [label2idx[label] if label in label2idx else 0 for label in tag_]
            datas.append(sent_ids)
            labels.append(tag_ids)
            sent_, tag_ = [], []
    return datas, labels

#原始数据
train_datas_, train_labels_ = read_corpus(train_data_path, vocab2idx, label2idx)
test_datas_, test_labels_ = read_corpus(test_data_path, vocab2idx, label2idx)

#输出测试结果 1639 1639 923 923
print(len(train_datas_), len(train_labels_), len(test_datas_), len(test_labels_))
print(train_datas_[5])
print([idx2vocab[idx] for idx in train_datas_[5]])
print(train_labels_[5])
print([idx2label[idx] for idx in train_labels_[5]])

#------------------------------------------------------------------------
#第三步 数据填充 one-hot编码
#------------------------------------------------------------------------
MAX_LEN = 100
VOCAB_SIZE = len(vocab2idx)
CLASS_NUMS = len(label2idx)

# padding data
print('padding sequences')
train_datas = sequence.pad_sequences(train_datas_, maxlen=MAX_LEN)
train_labels = sequence.pad_sequences(train_labels_, maxlen=MAX_LEN)

test_datas = sequence.pad_sequences(test_datas_, maxlen=MAX_LEN)
test_labels = sequence.pad_sequences(test_labels_, maxlen=MAX_LEN)
print('x_train shape:', train_datas.shape)
print('x_test shape:', test_datas.shape)
# (1639, 100) (923, 100)

# encoder one-hot
train_labels = keras.utils.to_categorical(train_labels, CLASS_NUMS)
test_labels = keras.utils.to_categorical(test_labels, CLASS_NUMS)
print('trainlabels shape:', train_labels.shape)
print('testlabels shape:', test_labels.shape)
# (1639, 100, 8) (923, 100, 8)

#------------------------------------------------------------------------
#第四步 构建CRF模型
#------------------------------------------------------------------------
EPOCHS = 20
BATCH_SIZE = 64
EMBED_DIM = 128
HIDDEN_SIZE = 64
MAX_LEN = 100
VOCAB_SIZE = len(vocab2idx)
CLASS_NUMS = len(label2idx)
K.clear_session()
print(VOCAB_SIZE, CLASS_NUMS, '\n') #3860 8

#模型构建 CRF
inputs = Input(shape=(MAX_LEN,), dtype='int32')
x = Masking(mask_value=0)(inputs)
x = Embedding(VOCAB_SIZE, 32, mask_zero=False)(x)
x = TimeDistributed(Dense(CLASS_NUMS))(x)
outputs = CRF(CLASS_NUMS)(x)
model = Model(inputs=inputs, outputs=outputs)
model.summary()

flag = "test"
if flag=="train":
    #模型训练
    model.compile(loss=crf_loss, optimizer='adam', metrics=[crf_viterbi_accuracy])
    model.fit(train_datas, train_labels, epochs=EPOCHS, verbose=1, validation_split=0.1)
    score = model.evaluate(test_datas, test_labels, batch_size=BATCH_SIZE)
    print(model.metrics_names)
    print(score)
    model.save("ch_ner_model.h5")
else:
    #------------------------------------------------------------------------
    #第五步 训练模型
    #------------------------------------------------------------------------
    char_vocab_path = "char_vocabs.txt"   #字典文件
    model_path = "ch_ner_model.h5"        #模型文件
    ner_labels = {
    
    "O": 0, "B-AG": 1, "B-AV": 2, "B-RL": 3,
                  "B-AI":4, "B-AM": 5, "B-SI": 6, "B-OS": 7 }
    special_words = ['<PAD>', '<UNK>']
    MAX_LEN = 100
    
    #预测结果
    model = load_model(model_path, custom_objects={
    
    'CRF': CRF}, compile=False)    
    y_pred = model.predict(test_datas)
    y_labels = np.argmax(y_pred, axis=2)         #取最大值
    z_labels = np.argmax(test_labels, axis=2)    #真实值
    word_labels = test_datas                     #真实值
    
    k = 0
    final_y = []       #预测结果对应的标签
    final_z = []       #真实结果对应的标签
    final_word = []    #对应的特征单词
    while k<len(y_labels):
        y = y_labels[k]
        for idx in y:
            final_y.append(idx2label[idx])
        #print("预测结果:", [idx2label[idx] for idx in y])
        z = z_labels[k]
        #print(z)
        for idx in z:    
            final_z.append(idx2label[idx])
        #print("真实结果:", [idx2label[idx] for idx in z])
        word = word_labels[k]
        #print(word)
n         for idx in word:
            final_word.append(idx2vocab[idx])
        k += 1
    print("最终结果大小:", len(final_y),len(final_z))
    
    n = 0
    numError = 0
    numRight = 0
    while n<len(final_y):
        if final_y[n]!=final_z[n] and final_z[n]!='O':
            numError += 1
        if final_y[n]==final_z[n] and final_z[n]!='O':
            numRight += 1
        n += 1
    print("预测错误数量:", numError)
    print("预测正确数量:", numRight)
    print("Acc:", numRight*1.0/(numError+numRight))
    print(y_pred.shape)
    print(len(test_datas_), len(test_labels_))
    print("预测单词:", [idx2vocab[idx] for idx in test_datas_[0]])
    print("真实结果:", [idx2label[idx] for idx in test_labels_[0]])

    #文件存储
    fw = open("Final_CRF_Result.csv", "w", encoding="utf8", newline='')
    fwrite = csv.writer(fw)
    fwrite.writerow(['pre_label','real_label', 'word'])
    n = 0
    while n<len(final_y):
        fwrite.writerow([final_y[n],final_z[n],final_word[n]])
        n += 1
    fw.close()

The constructed model is shown below:

Insert image description here

The running results are as follows. After training is completed, change the flag variable to "test" for testing.

  32/1475 [..............................] - ETA: 0s - loss: 0.0102 - crf_viterbi_accuracy: 0.9997
 416/1475 [=======>......................] - ETA: 5s - loss: 0.0143 - crf_viterbi_accuracy: 0.9982
 736/1475 [=============>................] - ETA: 4s - loss: 0.0147 - crf_viterbi_accuracy: 0.9981
1056/1475 [====================>.........] - ETA: 2s - loss: 0.0141 - crf_viterbi_accuracy: 0.9983
1344/1475 [==========================>...] - ETA: 0s - loss: 0.0138 - crf_viterbi_accuracy: 0.9984
1472/1475 [============================>.] - ETA: 0s - loss: 0.0136 - crf_viterbi_accuracy: 0.9984
['loss', 'crf_viterbi_accuracy']
[0.021301430796362854, 0.9972449541091919]

Insert image description here


6. Entity recognition based on BiLSTM-CRF

The following code is to build a BiLSTM-CRF model to implement entity recognition.

#encoding:utf-8
#By:Eastmount CSDN
import re
import os
import csv
import numpy as np
import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import Model
from keras.layers import Masking, Embedding, Bidirectional, LSTM, Dense
from keras.layers import Input, TimeDistributed, Activation
from keras.models import load_model
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy
from keras import backend as K
from sklearn import metrics

#------------------------------------------------------------------------
#第一步 数据预处理
#------------------------------------------------------------------------
train_data_path = "dataset-train.txt"  #训练数据
test_data_path = "dataset-test.txt"    #测试数据
val_data_path = "dataset-val.txt"      #验证数据
char_vocab_path = "char_vocabs.txt"    #字典文件
special_words = ['<PAD>', '<UNK>']     #特殊词表示

#BIO标记的标签
label2idx = {
    
    "O": 0, "B-AG": 1, "B-AV": 2, "B-RL": 3,
             "B-AI":4, "B-AM": 5, "B-SI": 6, "B-OS": 7 }

# 索引和BIO标签对应
idx2label = {
    
    idx: label for label, idx in label2idx.items()}
print(idx2label)

# 读取字符词典文件
with open(char_vocab_path, "r", encoding="utf8") as fo:
    char_vocabs = [line.strip() for line in fo]
char_vocabs = special_words + char_vocabs

# 字符和索引编号对应 {'<PAD>': 0, '<UNK>': 1, 'APT-C-36': 2, ...}
idx2vocab = {
    
    idx: char for idx, char in enumerate(char_vocabs)}
vocab2idx = {
    
    char: idx for idx, char in idx2vocab.items()}

#------------------------------------------------------------------------
#第二步 读取训练语料
#------------------------------------------------------------------------
def read_corpus(corpus_path, vocab2idx, label2idx):
    datas, labels = [], []
    with open(corpus_path, encoding='utf-8') as fr:
        lines = fr.readlines()
    sent_, tag_ = [], []
    for line in lines:
        if line != '\n':        #断句
            line = line.strip()
            [char, label] = line.split()
            sent_.append(char)
            tag_.append(label)
        else:
            sent_ids = [vocab2idx[char] if char in vocab2idx else vocab2idx['<UNK>'] for char in sent_]
            tag_ids = [label2idx[label] if label in label2idx else 0 for label in tag_]
            datas.append(sent_ids)
            labels.append(tag_ids)
            sent_, tag_ = [], []
    return datas, labels

#原始数据
train_datas_, train_labels_ = read_corpus(train_data_path, vocab2idx, label2idx)
test_datas_, test_labels_ = read_corpus(test_data_path, vocab2idx, label2idx)

#------------------------------------------------------------------------
#第三步 数据填充 one-hot编码
#------------------------------------------------------------------------
MAX_LEN = 100
VOCAB_SIZE = len(vocab2idx)
CLASS_NUMS = len(label2idx)

print('padding sequences')
train_datas = sequence.pad_sequences(train_datas_, maxlen=MAX_LEN)
train_labels = sequence.pad_sequences(train_labels_, maxlen=MAX_LEN)
test_datas = sequence.pad_sequences(test_datas_, maxlen=MAX_LEN)
test_labels = sequence.pad_sequences(test_labels_, maxlen=MAX_LEN)
print('x_train shape:', train_datas.shape)
print('x_test shape:', test_datas.shape)

train_labels = keras.utils.to_categorical(train_labels, CLASS_NUMS)
test_labels = keras.utils.to_categorical(test_labels, CLASS_NUMS)
print('trainlabels shape:', train_labels.shape)
print('testlabels shape:', test_labels.shape)

#------------------------------------------------------------------------
#第四步 构建BiLSTM+CRF模型
#------------------------------------------------------------------------
EPOCHS = 12
BATCH_SIZE = 64
EMBED_DIM = 128
HIDDEN_SIZE = 64
MAX_LEN = 100
VOCAB_SIZE = len(vocab2idx)
CLASS_NUMS = len(label2idx)
K.clear_session()
print(VOCAB_SIZE, CLASS_NUMS, '\n') #3860 8

#模型构建 BiLSTM-CRF
inputs = Input(shape=(MAX_LEN,), dtype='int32')
x = Masking(mask_value=0)(inputs)
x = Embedding(VOCAB_SIZE, EMBED_DIM, mask_zero=False)(x) #修改掩码False
x = Bidirectional(LSTM(HIDDEN_SIZE, return_sequences=True))(x)
x = TimeDistributed(Dense(CLASS_NUMS))(x)
outputs = CRF(CLASS_NUMS)(x)
model = Model(inputs=inputs, outputs=outputs)
model.summary()

flag = "train"
if flag=="train":
    #模型训练
    model.compile(loss=crf_loss, optimizer='adam', metrics=[crf_viterbi_accuracy])
    model.fit(train_datas, train_labels, epochs=EPOCHS, verbose=1, validation_split=0.1)
    score = model.evaluate(test_datas, test_labels, batch_size=BATCH_SIZE)
    print(model.metrics_names)
    print(score)
    model.save("bilstm_ner_model.h5")
else:
    #------------------------------------------------------------------------
    #第五步 训练模型
    #------------------------------------------------------------------------
    char_vocab_path = "char_vocabs.txt"   #字典文件
    model_path = "bilstm_ner_model.h5"        #模型文件
    ner_labels = {
    
    "O": 0, "B-AG": 1, "B-AV": 2, "B-RL": 3,
                  "B-AI":4, "B-AM": 5, "B-SI": 6, "B-OS": 7 }
    special_words = ['<PAD>', '<UNK>']
    MAX_LEN = 100
    
    #预测结果
    model = load_model(model_path, custom_objects={
    
    'CRF': CRF}, compile=False)    
    y_pred = model.predict(test_datas)
    y_labels = np.argmax(y_pred, axis=2)         #取最大值
    z_labels = np.argmax(test_labels, axis=2)    #真实值
    word_labels = test_datas                     #真实值
    
    k = 0
    final_y = []       #预测结果对应的标签
    final_z = []       #真实结果对应的标签
    final_word = []    #对应的特征单词
    while k<len(y_labels):
        y = y_labels[k]
        for idx in y:
            final_y.append(idx2label[idx])
        z = z_labels[k]
        for idx in z:    
            final_z.append(idx2label[idx])
        word = word_labels[k]
        for idx in word:
            final_word.append(idx2vocab[idx])
        k += 1
    print("最终结果大小:", len(final_y),len(final_z))
    
    n = 0
    numError = 0
    numRight = 0
    while n<len(final_y):
        if final_y[n]!=final_z[n] and final_z[n]!='O':
            numError += 1
        if final_y[n]==final_z[n] and final_z[n]!='O':
            numRight += 1
        n += 1
    print("预测错误数量:", numError)
    print("预测正确数量:", numRight)
    print("Acc:", numRight*1.0/(numError+numRight))
    print("预测单词:", [idx2vocab[idx] for idx in test_datas_[0]])
    print("真实结果:", [idx2label[idx] for idx in test_labels_[0]])

The constructed model is shown below:

Insert image description here

Readers are asked to try the comparative experiments and parameter adjustments on their own. We will share the parameter adjustment content when we have time in the future.


7. Summary

This article ends here. I hope it will be helpful to you. I will share it with the classic Bert later. I am really busy in September and October. I am working on my project book and thesis for graduation. I will write a few safety blogs after I finish my work. Thank you for your support and companionship, especially the encouragement and support of your family. Keep up the good work!

  • 1. ATT&CK data collection
  • 2. Data splitting and content statistics
    1. Paragraph splitting
    2. Sentence splitting
  • 3. Data annotation
  • 4. Data set division
  • 5. CRF-based entity recognition
    1. Install keras-contrib
    2. Install Keras
    3 .Full code
  • 6. Entity recognition based on BiLSTM-CRF

The road of life is composed of crossroads, games, entanglements, gains and losses. Gains and losses, gains and losses, different choices, different excitement. Although tired and busy, I was quite satisfied to see Xiao Luoluo and thanked my family for their company. I hope Xiaoluo can grow up happily and healthily. I love you all. Keep working and come on!

Insert image description here

(By:Eastmount 2023-11-14 夜于贵阳 http://blog.csdn.net/eastmount/ )


Guess you like

Origin blog.csdn.net/Eastmount/article/details/134355040