(1) The necessity of language analysis:
Suppose your company releases a brand new mobile phone product. The release of new products brought relevant reports and user feedback from different media. Faced with this data, you may wish to understand
What everyone pays attention to is what features
of this mobile phone do you think about this mobile phone,
and which users expressed their willingness to buy.
In the face of massive data, it is obviously impractical to use human resources to analyze these data. In this scenario, linguistic analysis comes in handy.
Letting machines do these analytical tasks instead of humans is exactly what language analysis does.
(2) Common operations of language analysis:
(1) Participle
Chinese word segmentation (Word Segmentation, WS) refers to the segmentation of Chinese character sequences into word sequences. Because in Chinese, words are the most basic units that carry semantics. Word segmentation is the basis for many Chinese natural language processing tasks such as information retrieval, text classification, and sentiment analysis.
For example, the sentence
Premier Li Keqiang of the State Council proposed to support Shanghai to actively explore new mechanisms when he investigated Shanghai Waigaoqiao.
The result of the correct word segmentation is
State Council / Premier / Li Keqiang / Research / Shanghai / Waigaoqiao / Time / Propose / , / Support / Shanghai / Active / Explore / New / Mechanism / .
If the segmentation result given by the word segmentation system is
State Council / Premier / Li Ke / Emphasis / Research / Shanghai…
Since emphasis is also a common word, this participle result is likely to occur. Then, if you want to search for information related to Li Keqiang, it will be difficult for search engines to retrieve the document.
Disambiguation is the main difficulty in word segmentation tasks.
(2) Part-of-speech tagging
Part-of-speech Tagging (POS) is the task of assigning a part-of-speech category to each word in a sentence. The part-of-speech categories here may be nouns, verbs, adjectives, or others. The following sentence is an example of part-of-speech tagging. Among them, v stands for verb, n stands for noun, c stands for conjunction, d stands for adverb, wp stands for punctuation mark.
Different corpora with part-of-speech tagging use different specifications. Here, the language cloud of Harbin Institute of Technology is used as an example to explain:
State Council/ni Premier/n Li Keqiang/nh Research/v Shanghai/ns Waigaoqiao/ns Time/n Proposed/v, /wp Support/v Shanghai/ns Active/a Explore/v New/a Mechanism/n. /wp
Part-of-speech tag set: 863 part-of-speech tag sets are used in LTP, and the meanings of each part of speech are as follows:
(3) Named Entity Recognition
Named Entity Recognition (NER) is the task of locating and identifying entities such as person names, place names, and institution names in word sequences in sentences.
As in the previous example, the result of named entity recognition is:
Premier Li Keqiang (name) of the State Council (name of institution) proposed to support Shanghai (name of place) to actively explore new mechanisms when investigating Shanghai Waigaoqiao (name of place).
Named entity recognition plays an important role in mining entities in text and then analyzing them.
The type of named entity recognition is generally task-specific. LTP provides the identification of the most basic three entity types: person name, place name, and organization name.
Users can easily expand the entity type into entity types such as brand name and software name.
(4) Dependency syntax analysis
Dependency Parsing (DP) reveals its syntactic structure by analyzing the dependencies between components within a language unit.
Intuitively, dependency syntax analysis identifies the grammatical components of "subject, predicate and object" and "fixed form complement" in a sentence, and analyzes the relationship between the components. Still the above example, the analysis result is:
From the analysis results, we can see that the core predicate of the sentence is "proposed", the subject is "Li Keqiang", the proposed object is "support Shanghai...", "when investigating..." is the (time) adverbial of "proposed", " The modifier of "Li Keqiang" is "Premier of the State Council", and the object of "support" is "exploring new mechanisms". With the above syntactic analysis results, we can easily see that the "proposed person" is "Li Keqiang", not "Shanghai" or "Waigaoqiao", even though they are all nouns, and they are more distant from "proposed". close.
Dependency syntax analysis annotation relationship (15 types in total) and their meanings are as follows:
(5) Semantic role annotation
Semantic Role Labeling (SRL) is a shallow semantic analysis technique that labels certain phrases in a sentence as arguments (semantic roles) of a given predicate, such as agent, subject, time and place. It can promote applications such as question answering systems, information extraction and machine translation. Still the above example, the result of semantic role annotation is:
Among them are the three predicates propose, investigate and explore. Taking exploration as an example, the positive is its way (generally represented by ADV), and the new mechanism is its subject (generally represented by A1)
The core semantic roles are A0-5, A0 usually means the agent of the action, A1 usually means the influence of the action, etc. A2-5 will have different semantic meanings according to different predicate verbs. The remaining 15 semantic roles are additional semantic roles, such as LOC for location, TMP for time and so on. The list of additional semantic roles is as follows:
(6) Semantic Dependency Analysis
Semantic Dependency Parsing (SDP) analyzes the semantic associations between language units of a sentence and presents the semantic associations as a dependency structure. The advantage of using semantic dependencies to describe sentence semantics is that there is no need to abstract the vocabulary itself, but to describe the vocabulary through the semantic frame that the vocabulary bears, and the number of arguments is always much less than the number of vocabulary. The goal of semantic dependency analysis is to overcome the constraints of the surface syntactic structure of sentences and directly obtain deep semantic information. For example, the following three sentences express the same semantic information in different expressions, that is, Zhang San implemented an eating action, and the eating action was implemented on an apple.
Semantic dependency analysis is not affected by syntactic structure. Language units with direct semantic associations are directly connected to dependency arcs and marked with corresponding semantic relations.
This is also an important difference between semantic dependency analysis and syntactic dependency analysis.
The above example compares the results of syntactic dependency and semantic analysis, and we can see that there are two significant differences between the two. First, syntactic dependencies pay more attention to the role of non-substantial words (such as prepositions) in sentence structure analysis to some extent, while semantic dependencies are more inclined to establish direct dependency arcs between content words with direct semantic association, and non-substantial words exist as auxiliary markers. . Second, the semantic relationship marked on the two dependency arcs is completely different. The semantic dependency relationship is derived from the argument relationship and can be used to answer questions, such as where do I drink soup, what am I drinking soup with, and who is there Soup, what am I drinking. But syntactic dependencies do not have this ability.
There is also a relationship between semantic dependency and semantic role labeling. Semantic role labeling only focuses on the arguments of the main predicates of a sentence and the relationship between predicates and arguments, while semantic dependencies not only focus on the relationship between predicates and arguments, but also on predicates and predicates. The semantic relationship between arguments, between arguments, and within arguments. Semantic dependencies can describe the semantic information of sentences more completely and comprehensively.
Semantic dependencies are divided into three categories: main semantic roles, each of which has a nested relationship and an inverse relationship; event relationships, which describe the relationship between two events; semantic attachment markers, which mark the speaker’s tone of voice and other dependencies. sexual information.
relationship type | Tag | Description | Example |
---|---|---|---|
agency relationship | Eight | Agent | I send her a bouquet of flowers (I <-- send) |
relationship | Exp | Experiencer | I run fast (run --> me) |
emotional relationship | Aft | Affection | I miss my hometown (missing --> me) |
Consular relations | Poss | Possessor | He has a good read (he <-- has) |
Subject relationship | Pat | Patient | He hit Xiao Ming (hit --> Xiao Ming) |
guest relationship | Account | Content | He heard firecrackers (listen --> firecrackers) |
success relationship | Prod | Product | He wrote a novel (write --> novel) |
source relationship | Orig | Origin | Our army captured four enemy tanks (captured --> tanks) |
relationship | Datv | Dative | he told me a secret ( tell --> me ) |
Compare roles | Comp | Comitative | His grades are better than me (he --> me) |
Subjective role | Belg | Belongings | Lao Zhao has two daughters (Lao Zhao <-- yes) |
similar role | Class | Classification | He is a middle school student (yes --> middle school student) |
By role | Accd | According | This court pronounces judgment according to law (law <-- judgment) |
sake role | In row | Reason | He is worrying about his daughter's marriage (wore --> marriage) |
intent role | Int | Intention | He worked hard for the gold medal (gold medal <-- hard work) |
ending role | Cons | Consequence | He ran sweaty (run --> sweaty) |
way role | Mann | Manner | The ball slowly rolls into the empty gate (slow <-- roll) |
tool role | Tool | Tool | 她用砂锅熬粥 (砂锅 <-- 熬粥) |
材料角色 | Malt | Material | 她用小米熬粥 (小米 <-- 熬粥) |
时间角色 | Time | Time | 唐朝有个李白 (唐朝 <-- 有) |
空间角色 | Loc | Location | 这房子朝南 (朝 --> 南) |
历程角色 | Proc | Process | 火车正在过长江大桥 (过 --> 大桥) |
趋向角色 | Dir | Direction | 部队奔向南方 (奔 --> 南) |
范围角色 | Sco | Scope | 产品应该比质量 (比 --> 质量) |
数量角色 | Quan | Quantity | 一年有365天 (有 --> 天) |
数量数组 | Qp | Quantity-phrase | 三本书 (三 --> 本) |
频率角色 | Freq | Frequency | 他每天看书 (每天 <-- 看) |
顺序角色 | Seq | Sequence | 他跑第一 (跑 --> 第一) |
描写角色 | Desc(Feat) | Description | 他长得胖 (长 --> 胖) |
宿主角色 | Host | Host | 住房面积 (住房 <-- 面积) |
名字修饰角色 | Nmod | Name-modifier | 果戈里大街 (果戈里 <-- 大街) |
时间修饰角色 | Tmod | Time-modifier | 星期一上午 (星期一 <-- 上午) |
反角色 | r + main role | 打篮球的小姑娘 (打篮球 <-- 姑娘) | |
嵌套角色 | d + main role | 爷爷看见孙子在跑 (看见 --> 跑) | |
并列关系 | eCoo | event Coordination | 我喜欢唱歌和跳舞 (唱歌 --> 跳舞) |
选择关系 | eSelt | event Selection | 您是喝茶还是喝咖啡 (茶 --> 咖啡) |
等同关系 | eEqu | event Equivalent | 他们三个人一起走 (他们 --> 三个人) |
先行关系 | ePrec | event Precedent | 首先,先 |
顺承关系 | eSucc | event Successor | 随后,然后 |
递进关系 | eProg | event Progression | 况且,并且 |
转折关系 | eAdvt | event adversative | 却,然而 |
原因关系 | eCau | event Cause | 因为,既然 |
结果关系 | eResu | event Result | 因此,以致 |
推论关系 | eInf | event Inference | 才,则 |
条件关系 | eCond | event Condition | 只要,除非 |
假设关系 | eSupp | event Supposition | 如果,要是 |
让步关系 | eConc | event Concession | 纵使,哪怕 |
手段关系 | eMetd | event Method | |
目的关系 | ePurp | event Purpose | 为了,以便 |
割舍关系 | eAban | event Abandonment | 与其,也不 |
选取关系 | ePref | event Preference | 不如,宁愿 |
总括关系 | eSum | event Summary | 总而言之 |
分叙关系 | eRect | event Recount | 例如,比方说 |
连词标记 | mConj | Recount Marker | 和,或 |
的字标记 | mAux | Auxiliary | 的,地,得 |
介词标记 | mPrep | Preposition | 把,被 |
语气标记 | mTone | Tone | 吗,呢 |
时间标记 | mTime | Time | 才,曾经 |
范围标记 | mRang | Range | 都,到处 |
程度标记 | mDegr | Degree | 很,稍微 |
频率标记 | mFreq | Frequency Marker | 再,常常 |
趋向标记 | mDir | Direction Marker | 上去,下来 |
插入语标记 | mPars | Parenthesis Marker | 总的来说,众所周知 |
否定标记 | mNeg | Negation Marker | 不,没,未 |
情态标记 | mMod | Modal Marker | 幸亏,会,能 |
标点标记 | mPunc | Punctuation Marker | ,。! |
重复标记 | mPept | Repetition Marker | 走啊走 (走 --> 走) |
多数标记 | mMaj | Majority Marker | 们,等 |
实词虚化标记 | mVain | Vain Marker | |
离合标记 | mSepa | Seperation Marker | 吃了个饭 (吃 --> 饭) 洗了个澡 (洗 --> 澡) |
根节点 | Root | Root | 全句核心节点 |
以上资料整理于哈工大的语言云
20180503 于求是园