Open Sogou machine reading comprehension toolkit SMRC, quickly to learn!

HowNet new sources of business, the original title: NLP Sogou latest research results open source, to create the industry's most complete machine reading comprehension Kit SMRC

 

Last week, a low-key Sogou released on GitHub machine reading comprehension Kit SMRC (Sogou Machine Reading Comprehension).

This is the industry's most complete version of TensorFlow reading comprehension tool set, download data sets from the final training and testing models, everything.

The open-source search dogs also aims to help achieve rapid NLP practitioners to understand the existing machine model, in order to develop new models more efficiently.

The past two years, NLP has achieved many breakthroughs in the field. But in terms of open-source machine reading comprehension resources is still very small. Currently in CoQA "hit list" of players, only search dogs and Microsoft disclosed source code.

Open Sogou SMRC the right time to fill the open field scarce resources. SMRC only released just a week, it has become one of the most popular open source research project.

What is SMRC?

Speaking SMRC, he would have to mention hot issue in recent years in the field of NLP - the machine reading comprehension. Its goal is based on a given issue and articles, or rewrite the text fragments extracted as the answer in the article.

Sogou machine reading comprehension task pipeline broken down into four steps: reading data set, preprocessing, model building, training and evaluation, each step of the abstract and modular, presented in a concise interface.

Knowledge maps, Sogou latest NLP research open source, to create the industry's most complete machine reading comprehension Kit SMRC

In Sogou open source SMRC kit, each of these steps can be used to separate the use of embedded developer's own processes to ensure ease of use and scalable set of tools.

Meanwhile, SMRC reading comprehension of a variety of machine data sets published, the model for the integration or recurring.

Code is divided into the following modules:

1, the data set read module (dataset_reader)

The module integrates reading and preprocessing function SQuAD 1.0 / 2.0, CoQA CMRC and the Chinese data set.

2, data preprocessing (data, utils)

portion includes table data word is responsible for building blocks and feature transformation data and batch stream generator. utils for extracting linguistic features.

3, model building (nn, models)

NN (Neural Network) by a machine reading comprehension components used, and can quickly build a prototype model training, to avoid duplication of work portion. model is integrated in a common understanding of the machine model, such as BiDAF, DrQA, FusionNet, QANet and so on.

4, model training and evaluation (examples)

This section is an example of different models of operation.

Knowledge maps, Sogou latest NLP research open source, to create the industry's most complete machine reading comprehension Kit SMRC

SMRC code installed easy to use. Sogou official document SQuAD 1.0 data set, DrQA model as an example, just two dozen lines of code to implement training and testing a mainstream machine-readable model.

Since from the model to the data set was so rich in resources, Sogou why should they integrate?

This is because the model does not understand part of the machine to achieve the official version, while other open source model due to different framework, enabling developers need their own understanding on different platforms, to improve and to reproduce, greatly reducing the efficiency of development.

To solve these problems, Sogou open source 'reading comprehension tool collection. " But SMRC is not a simple integration, it also contains a search dog in recent years, NLP research areas.

SMRC in Sogou Technology

Sogou CEO Ms. Yu believes that the future of search questions and answers, and the machine reading comprehension is one of the core technology of today's quiz.

Since the search, input method and other core business drivers, Sogou read in NLP especially in the field of machine has a deep understanding of technology accumulation.

It can be said, SMRC projects condensation Sogou most advanced research achievements over the years.

In January, Sogou with BERT + Answer Verification (single model) CoQA boarded the first list, surpassing many well-known domestic and foreign research institutions and universities, such as Microsoft, hearing fly, Tsinghua University, Fudan University, Stanford, and so on.

Knowledge maps, Sogou latest NLP research open source, to create the industry's most complete machine reading comprehension Kit SMRC

Sogou theoretical research has been non-stop pace. In April, Sogou cooperation with the Chinese Academy of Sciences Institute of Automation, published on top international academic conference SIGIR 2019 in the field of information retrieval "based on open domain Question Answering document door controller," and proposed a new algorithm for reading comprehension.

The so-called open domain Question Answering (open-domain question answering), refers to the set after any type of questions, get answers to any resource in. More and more open domain Question Answering machine method using reading comprehension techniques to generate the answer.

Knowledge maps, Sogou latest NLP research open source, to create the industry's most complete machine reading comprehension Kit SMRC

However, the traditional machine-based reading comprehension open domain Question Answering technology there is great noise in the data, the answer probability bias and other issues, the answer is finally obtained poor results.

To solve this problem, search dogs in the traditional model based on the introduction of the document door controller (Document Gate) to control the output of the final answer, the document selection information is incorporated into the final result to go.

In addition, search dogs also used to generate data based on weak supervision bootstrapping (bootstrapping) to solve the problems of the conventional large noise present in the data weak supervision.

Sogou not only theoretical research articles published, also attaches great importance to technical landing of research results in the past has infiltrated into the search products, unknowingly to customer service.

When the problem we use Sogou web search, when the search keywords entered by the user is a problem, especially in the medical and legal public concern, intelligent question answering system will try to find answers from the search results page and the highest priority level presented to the user.

Share the results with industry

CoQA challenge we mentioned earlier, there are now 29 companies and institutions to submit the results, but only three open source its own algorithm, namely Microsoft, Allen Institute and search dogs.

And SDNet, FlowQA only for the open-source model itself, and contains some data preprocessing tools, Sogou open source machine reading comprehension is a complete solution.

Sogou hesitate to share the latest research results from academia and industry who are a blessing.

If you are a college student in NLP research, then SMRC can help you quickly will own model in conjunction with other technologies, this process only one interface, greatly reducing the use of students who are interested in NLP threshold can be reduced to researchers duplication of work, speed up relevant academic research.

And if you are an industry personnel, brought-to-use SMRC can help you search dogs will integrate research into their product solutions.

We can say that open-source developers to solve the SMRC process from data collection to the point of pain and a series of model training is an area of ​​research for the benefit of understanding the events of the entire machine. For the general public, you may see in the future more intelligent dialogue system, problem-solving applications, and perhaps there Sogou support behind open source technology silently.

Guess you like

Origin www.cnblogs.com/xinzhihao/p/10955748.html