UCAS-AI Academy-Special Course on Natural Language Processing-Lecture 1-Course Notes

introduction

Basic Information

  • 50 credit hours, 3 credits
  • Zong Chengqing, Zhang Jiajun
  • Assignment: method practice + technical report (group or single)

statement of problem

  • Analysis of the relationship between people and events is of great significance
  • A lot of complex data is difficult to deal with manually
  • Let the computer understand natural language text automatically or semi-automatically
  • Natural language processing : Let the computer realize the automatic processing of massive language texts and the effective use of mining rivers to meet the various needs of different users and realize personalized services.

basic concept

  • Linguistics :
    • Scientific research on language
    • Subjects that study the nature, structure and development law of language
    • Speech and text are two basic attributes of language
  • Computational Linguistics (Computational Linguistics):
    • The discipline of analyzing, understanding and generating natural language by establishing formal computational models
    • Interdisciplinary
    • More research on basic theories and methods than natural language processing
    • Consider the problem of language modeling, mathematical models and methods
    • Distinction 1/3: Language modeling and calculation
  • NLU (Natural Language Understanding):
    • Disciplines that study natural language processing methods and implementation techniques that mimic human language cognitive processes
    • Interdisciplinary (including cognitive science)
    • Thinking about language thinking
    • The standard of "understanding": judging the intelligence of a computer?
      • Performance (act), reaction (react), interaction (interact)
      • How does it compare with conscious individuals (people)? Turing experiment
    • Distinction 2/3: Language Cognition
  • NLP (Natural Language Processing):
    • The subject of using computer technology to process and process language text
    • Recognition, classification, extraction, conversion and generation of lexical, syntactic, semantic and pragmatic information
    • Distinction 3/3: Implementation of Language Engineering System
  • Unified Understanding of the Three: Human Language Technology Research (Human Language Technology)
    • NLP -> CL -> NLU
  • Language Family:
    • Inflectional language (fusional language): morphological changes in words to express grammatical relations (English)
    • Adhesive language (agglutinative language): There are additional components within the word that specifically represent the grammar, and the root or stem and the additional components are not tightly combined (Japanese)
    • Isolating language (isolating language): morphological changes little, grammatical relations expressed by the order and function words (Chinese)
  • Chinese Information Processing: Chinese language natural language processing technology

The emergence and development of disciplines

  • Early: rationalism, symbolic logic (rules, dictionary + algorithm)
  • Mid-term: empiricism, statistical learning (corpus, feature + model)
  • Later: Connectivism, neural network (corpus + model)

research content

  • machine translation

    • Experiment with automatic translation from one language to another
  • Information retrieval

    • Information retrieval, using computer systems to find relevant information that meets user needs from a large number of documents
  • Automatic digest

    • Automatically extract the main content of the original document or some information to form a summary or abbreviation
  • Question answering system

    • The system understands people's questions and uses automatic reasoning to automatically solve answers from knowledge resources and make corresponding answers
    • Can be combined with voice technology to form a man-machine dialogue system
    • Community Q & A
  • Information filtering

    • Automatically identify and filter document information that meets certain conditions
  • Information extraction

    • Extract information of interest to users from specified documents or massive texts
    • Entity relationship extraction
    • Social network
  • Document classification

    • Automatic document classification or information classification
    • A large number of documents are automatically classified according to certain classification criteria (theme, content)
    • Sentiment classification
  • Text editing and automatic proofreading

    • Continue to automatically check, proofread and arrange for spelling, wording, even grammar, document format, etc.
    • More difficult
  • Language teaching

  • Text recognition

  • Speech Recognition

    • Automatically convert input voice signals into written text
  • Text to speech conversion, speech synthesis

    • Automatically convert written text into corresponding speech representations
  • Speaker recognition

    • Determine or verify the identity of the speaker based on some speech sticks

Problems and challenges

  • Morphology (Morphology) question: how a meaningful word basic unit - morpheme
    • Inflectional morphological changes and word recognition
    • Chinese word segmentation
    • Morpheme : root, prefix, suffix, suffix
  • Syntax (Syntax) problem: the relationship between the structural components of the sentence and the rules that make up the sentence sequence
  • Semantic question: how to derive the meaning of a sentence from the meaning of the words in a sentence and the role of these words in the syntactic structure
  • Pragmatics (Pragmatic) problem: different contexts statements and the context of the application of the understanding of the impact statement
    • Context reflected in language structure
    • Meaning not covered by semantics
  • A lot of ambiguity difficulties:
    • Lexical ambiguity: morphological changes, Chinese segmentation
    • Speech Ambiguity
    • Structural ambiguity
    • Semantic ambiguity
    • Polyphonic characters and prosodic ambiguity: polyphonic characters , prosodic tones, etc.
  • Difficulty with a large number of unknown languages
    • New words, names, place names, terms
    • New meaning
    • New usage and new sentence patterns
  • challenge
    • Pervasive uncertainty
    • Unpredictability of unknown language phenomena
    • Inadequate data always faced
    • Complexity of knowledge representation
    • Unequivalence of mapping units in machine translation
  • The human brain understands language is a complex thought process

Basic methods and technical status

  • basic method
    • Rationalist approach: rule-based approach
    • Empirical approach: data-driven approach
    • Linkist approach: data-driven, neural networks
  • Rationalism : through the study of some representative sentences or language phenomena, we can get an understanding of human language ability, summarize the laws of language use, and analyze and infer the expected effect of the test sample
    • Establish symbol processing system based on rule analysis method
    • Knowledge base + inference system
    • Theoretical basis: Chomsky's grammar theory
    • Rule method : good effect on the content of standard structure, but it is difficult to deal with irregular content
  • Empiricism : Using a large amount of real language data, the help of the ending person (annotation and feature selection), statistically discover the law of language use and the possibility of the size, based on which to calculate the possible results of predicting the test sample
    • Statistical unit for discrete events
    • Building a calculation model based on large-scale real data
    • Corpus + statistical model
    • Theoretical technology: statistics, information theory, machine learning
    • Bayesian formula
  • Connectionism : use large-scale real language data to build a model, statistically discover the rules of language use and the possibility of it, and use this as a basis to calculate the possible results of predicting test samples
    • The statistical unit is a continuous real space representation (vector)
    • Building a calculation model based on large-scale real data
    • Corpus + neural network + statistical model
    • Theoretical basis: statistics, deep learning
    • Vectorized representation, neural network model for target optimization, RNN, attention mechanism
    • Data-driven approach : no deep analysis is required, or even basic knowledge, and it depends on the amount of data; but the amount of data is also a difficult problem, which is difficult to deal with complex sentences, unfamiliar vocabulary, reference and translation consistency Lack of explanation
Published 14 original articles · Likes0 · Visits 69

Guess you like

Origin blog.csdn.net/cary_leo/article/details/105642905