In-depth Bert combat (Pytorch)

@[TOC] (Deep Bert Actual Combat (Pytorch)----WordPiece Embeddings)

https://www.bilibili.com/video/BV1K5411t7MD?p=5
https://www.youtube.com/channel/UCoRX98PLOsaN8PtekB9kWrw/videos
in-depth BERT actual combat (PyTorch) by ChrisMcCormickAI
This is ChrisMcCormickAI in YouTube bert, 8 episode series The second article explains the code of pytorch of WordPiece Embeddings. There is a download address under the YouTube video. If you can’t get over the wall, you can leave the mailbox and I will read it all and send it to you after finishing it.

Load model

Install huggingface implementation

!pip install pytorch-pretrained-bert

import torch
from pytorch_pretrained_bert import BertTokenizer

# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

View vocabulary in Bert

Retrieve the entire "tokens" list and write them into a text file, you can read them carefully.

with open("vocabulary.txt", 'w', encoding='utf-8') as f:
    
    # For each token... 得到每个单词
    for token in tokenizer.vocab.keys():
        
        # Write it out and escape any unicode characters.   
        # 写并且转义为Unicode字符         
        f.write(token + '\n')
  • Print it out and get

  • The first 999 word orders are reserved positions, and their form is similar to[unused957]

  • 1-[PAD] Truncate
    101-[UNK] Unknown characters 102-
    [CLS] The beginning of a sentence, which represents the classification task
    103-[SEP] Separate two input sentences in BERT 104-
    [MASK] MASK mechanism

  • Lines 1000-1996 appear to be a dump of a single character.

  • They do not seem to be sorted by frequency (for example, the letters of the alphabet are sorted in order).

  • The first word is The 1997 position

  • From here, the words seem to be sorted by frequency.
    The first 18 words are complete words, and the 2016 position is ##s, which is probably the most common sub-word.
    The last complete word is 29612 "necessitated"

Single character

The following code prints out all the single-character tokens in the vocabulary and all the single-character tokens preceded by'##'.

It turns out that these are match sets-each independent character has a "##" version. There are 997 single-character tokens.

The following cell traverses the vocabulary and takes out all single character tokens.

one_chars = []
one_chars_hashes = []

# For each token in the vocabulary... 遍历所有单字符
for token in tokenizer.vocab.keys():
    
    # Record any single-character tokens.记录下来
    if len(token) == 1:
        one_chars.append(token)
    
    # Record single-character tokens preceded by the two hashes. 
    # 记录##单字符   
    elif len(token) == 3 and token[0:2] == '##':
        one_chars_hashes.append(token)

Print single character

print('Number of single character tokens:', len(one_chars), '\n')

# Print all of the single characters, 40 per row.

# For every batch of 40 tokens...
for i in range(0, len(one_chars), 40):
    
    # Limit the end index so we don't go past the end of the list.
    end = min(i + 40, len(one_chars) + 1)
    
    # Print out the tokens, separated by a space.
    print(' '.join(one_chars[i:end]))

Print ##Single character

print('Number of single character tokens with hashes:', len(one_chars_hashes), '\n')

# Print all of the single characters, 40 per row.按每行40打印

# Strip the hash marks, since they just clutter the display.去除##
tokens = [token.replace('##', '') for token in one_chars_hashes]

# For every batch of 40 tokens...每批40
for i in range(0, len(tokens), 40):
    
    # Limit the end index so we don't go past the end of the list.限制结束位置
    end = min(i + 40, len(tokens) + 1)
    
    # Print out the tokens, separated by a space.
    print(' '.join(tokens[i:end]))
Number of single character tokens: 997 

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ [ \ ] ^ _ ` a b
c d e f g h i j k l m n o p q r s t u v w x y z {
    
     | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬
® ° ± ² ³ ´ µ ¶ · ¹ º » ¼ ½ ¾ ¿ × ß æ ð ÷ ø þ đ ħ ı ł ŋ œ ƒ ɐ ɑ ɒ ɔ ɕ ə ɛ ɡ ɣ ɨ
ɪ ɫ ɬ ɯ ɲ ɴ ɹ ɾ ʀ ʁ ʂ ʃ ʉ ʊ ʋ ʌ ʎ ʐ ʑ ʒ ʔ ʰ ʲ ʳ ʷ ʸ ʻ ʼ ʾ ʿ ˈ ː ˡ ˢ ˣ ˤ α β γ δ
ε ζ η θ ι κ λ μ ν ξ ο π ρ ς σ τ υ φ χ ψ ω а б в г д е ж з и к л м н о п р с т у
ф х ц ч ш щ ъ ы ь э ю я ђ є і ј љ њ ћ ӏ ա բ գ դ ե թ ի լ կ հ մ յ ն ո պ ս վ տ ր ւ
ք ־ א ב ג ד ה ו ז ח ט י ך כ ל ם מ ן נ ס ע ף פ ץ צ ק ר ש ת ، ء ا ب ة ت ث ج ح خ د
ذ ر ز س ش ص ض ط ظ ع غ ـ ف ق ك ل م ن ه و ى ي ٹ پ چ ک گ ں ھ ہ ی ے अ आ उ ए क ख ग च
ज ट ड ण त थ द ध न प ब भ म य र ल व श ष स ह ा ि ी ो । ॥ ং অ আ ই উ এ ও ক খ গ চ ছ জ
ট ড ণ ত থ দ ধ ন প ব ভ ম য র ল শ ষ স হ া ি ী ে க ச ட த ந ன ப ம ய ர ல ள வ ா ி ு ே
ை ನ ರ ಾ ක ය ර ල ව ා ก ง ต ท น พ ม ย ร ล ว ส อ า เ ་ ། ག ང ད ན པ བ མ འ ར ལ ས မ ა
ბ გ დ ე ვ თ ი კ ლ მ ნ ო რ ს ტ უ ᄀ ᄂ ᄃ ᄅ ᄆ ᄇ ᄉ ᄊ ᄋ ᄌ ᄎ ᄏ ᄐ ᄑ ᄒ ᅡ ᅢ ᅥ ᅦ ᅧ ᅩ ᅪ ᅭ ᅮ
ᅯ ᅲ ᅳ ᅴ ᅵ ᆨ ᆫ ᆯ ᆷ ᆸ ᆼ ᴬ ᴮ ᴰ ᴵ ᴺ ᵀ ᵃ ᵇ ᵈ ᵉ ᵍ ᵏ ᵐ ᵒ ᵖ ᵗ ᵘ ᵢ ᵣ ᵤ ᵥ ᶜ ᶠ ‐ ‑ ‒ – — ―
‖ ‘ ’ ‚ “ ” „ † ‡ • … ‰ ′ ″ › ‿ ⁄ ⁰ ⁱ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁿ ₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ₊ ₍
₎ ₐ ₑ ₒ ₓ ₕ ₖ ₗ ₘ ₙ ₚ ₛ ₜ ₤ ₩ € ₱ ₹ ℓ № ℝ ™ ⅓ ⅔ ← ↑ → ↓ ↔ ↦ ⇄ ⇌ ⇒ ∂ ∅ ∆ ∇ ∈ − ∗
∘ √ ∞ ∧ ∨ ∩ ∪ ≈ ≡ ≤ ≥ ⊂ ⊆ ⊕ ⊗ ⋅ ─ │ ■ ▪ ● ★ ☆ ☉ ♠ ♣ ♥ ♦ ♭ ♯ ⟨ ⟩ ⱼ ⺩ ⺼ ⽥ 、 。 〈 〉
《 》 「 」 『 』 〜 あ い う え お か き く け こ さ し す せ そ た ち っ つ て と な に ぬ ね の は ひ ふ へ ほ ま み
む め も や ゆ よ ら り る れ ろ を ん ァ ア ィ イ ウ ェ エ オ カ キ ク ケ コ サ シ ス セ タ チ ッ ツ テ ト ナ ニ ノ ハ
ヒ フ ヘ ホ マ ミ ム メ モ ャ ュ ョ ラ リ ル レ ロ ワ ン ・ ー 一 三 上 下 不 世 中 主 久 之 也 事 二 五 井 京 人 亻 仁
介 代 仮 伊 会 佐 侍 保 信 健 元 光 八 公 内 出 分 前 劉 力 加 勝 北 区 十 千 南 博 原 口 古 史 司 合 吉 同 名 和 囗 四
国 國 土 地 坂 城 堂 場 士 夏 外 大 天 太 夫 奈 女 子 学 宀 宇 安 宗 定 宣 宮 家 宿 寺 將 小 尚 山 岡 島 崎 川 州 巿 帝
平 年 幸 广 弘 張 彳 後 御 德 心 忄 志 忠 愛 成 我 戦 戸 手 扌 政 文 新 方 日 明 星 春 昭 智 曲 書 月 有 朝 木 本 李 村
東 松 林 森 楊 樹 橋 歌 止 正 武 比 氏 民 水 氵 氷 永 江 沢 河 治 法 海 清 漢 瀬 火 版 犬 王 生 田 男 疒 発 白 的 皇 目
相 省 真 石 示 社 神 福 禾 秀 秋 空 立 章 竹 糹 美 義 耳 良 艹 花 英 華 葉 藤 行 街 西 見 訁 語 谷 貝 貴 車 軍 辶 道 郎
郡 部 都 里 野 金 鈴 镇 長 門 間 阝 阿 陳 陽 雄 青 面 風 食 香 馬 高 龍 龸 fi fl ! ( ) , - . / : ? ~

The above two pieces of code and ##+单字符, and 单字符the same results

# return True
print('Are the two sets identical?', set(one_chars) == set(tokens))

Subwords vs. Whole-words

Print statistics for some words.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (10,5)

# Measure the length of every token in the vocab. 加载每个单词
token_lengths = [len(token) for token in tokenizer.vocab.keys()]

# Plot the number of tokens of each length.
sns.countplot(token_lengths)
plt.title('Vocab Token Lengths')
plt.xlabel('Token Length')
plt.ylabel('# of Tokens')

print('Maximum token length:', max(token_lengths))

Insert picture description here
Count the tokens beginning with'##'.

num_subwords = 0

subword_lengths = []

# For each token in the vocabulary...
for token in tokenizer.vocab.keys():
    
    # If it's a subword...
    if len(token) >= 2 and token[0:2] == '##':
        
        # Tally all subwords
        num_subwords += 1

        # Measure the sub word length (without the hashes)
        length = len(token) - 2

        # Record the lengths.        
        subword_lengths.append(length)

Relative to the amount occupied by the complete vocabulary

vocab_size = len(tokenizer.vocab.keys())

print('Number of subwords: {:,} of {:,}'.format(num_subwords, vocab_size))

# Calculate the percentage of words that are '##' subwords.
prcnt = float(num_subwords) / vocab_size * 100.0

print('%.1f%%' % prcnt)
Number of subwords: 5,828 of 30,522
19.1%

Results of mapping statistics

sns.countplot(subword_lengths)
plt.title('Subword Token Lengths (w/o "##")')
plt.xlabel('Subword Length')
plt.ylabel('# of ## Subwords')

Insert picture description here
You can check the example of misspelling by yourself

'misspelled' in tokenizer.vocab # Right
'mispelled' in tokenizer.vocab # Wrong
'government' in tokenizer.vocab # Right
'goverment' in tokenizer.vocab # Wrong
'beginning' in tokenizer.vocab # Right
'begining' in tokenizer.vocab # Wrong
'separate' in tokenizer.vocab # Right
'seperate' in tokenizer.vocab # Wrong

For abbreviations

"can't" in tokenizer.vocab    # False
"cant" in tokenizer.vocab    # False

Beginning and middle words

For a single character, there are both a single character and a "##" version corresponding to each character. Is the same with subwords?

# For each token in the vocabulary...
for token in tokenizer.vocab.keys():
    
    # If it's a subword...
    if len(token) >= 2 and token[0:2] == '##':
        if not token[2:] in tokenizer.vocab:
            print('Did not find a token for', token[2:])
            break

You can see that the first ##ly returned is in the vocabulary, but ly is not in the vocabulary

Did not find a token for ly

'##ly' in tokenizer.vocab    # True
'ly' in tokenizer.vocab    # False

For the name

Download data

!pip install wget

import wget
import random 

print('Beginning file download with wget module')

url = 'http://www.gutenberg.org/files/3201/files/NAMES.TXT'
wget.download(url, 'first-names.txt')

Encoding, lowercase, output length

# Read them in.
with open('first-names.txt', 'rb') as f:
    names_encoded = f.readlines()

names = []

# Decode the names, convert to lowercase, and strip newlines.
for name in names_encoded:
    try:
        names.append(name.rstrip().lower().decode('utf-8'))
    except:
        continue

print('Number of names: {:,}'.format(len(names)))
print('Example:', random.choice(names))

See how many names are in the BERT vocabulary

num_names = 0

# For each name in our list...
for name in names:

    # If it's in the vocab...
    if name in tokenizer.vocab:
        # Tally it.
        num_names += 1

print('{:,} names in the vocabulary'.format(num_names))

For numbers

# Count how many numbers are in the vocabulary. 统计词汇表中有多少数字
count = 0

# For each token in the vocabulary...
for token in tokenizer.vocab.keys():

    # Tally if it's a number.
    if token.isdigit():
        count += 1
        
        # Any numbers >= 10,000?
        if len(token) > 4:
            print(token)

print('Vocab includes {:,} numbers.'.format(count))

Calculate how many numbers are in 1600-2021

# Count how many dates between 1600 and 2021 are included.
count = 0 
for i in range(1600, 2021):
    if str(i) in tokenizer.vocab:
        count += 1

print('Vocab includes {:,} of 421 dates from 1600 - 2021'.format(count))

Guess you like

Origin blog.csdn.net/qq_42388742/article/details/112984643