Medium-term notes Q & Visitor Information System

School of professional practice, I have chosen the topic Visitor Information Q & A system. Because it is consistent with the direction of future research, it can also be added to the current development of Tutor small micro-channel process as a function.

Unfortunately, although the full research value, but encountered a need to mark large amounts of data, function not practical enough other issues, halfway to change the topic, so in this to make a mid-term record.

Research Program

demand

First we need a clear demand for question answering system. Aim to achieve this question answering system, it is to answer relevant questions for scholars, such as academic scholars, graduate schools, research interests, and so on.

Since the existing generation model answer general effect, this system is based on questions and answers questions, find answers in the answer to the library, as the answer.

The answer Source

When the topic and no relevant information data resource for academics, so the answers are from the network.

There are two data climb Methods:

  • For a scholar information site, write a specific crawler, can be obtained by chemical structure data
  • According to the results of a search engine to return crawling, scholars need to write filters to find relevant information

My first realization is the second, because more general, and the former as the latter supplement.

The basic idea of ​​the answers match

The simple answer is usually a matching model neural network, input question and answer period, the output matching degree.

Due to limited data (required to prepare their own data, and difficult to transfer learning), I expect the effect is not very good match.

At the same time, I noticed that the majority of scholars on issues revolve around common information on their resumes education, job title, research and so on. Such problems can be dealt with separately: advance for frequently asked questions, training extractor that extracts data from the introductory text of the scholars, stored in the database. Responding to a question, to determine whether the problem belongs to a class of such problems, it is, then give the answer directly from the database.

In summary, the process of matching the answer is as follows:

  1. Ask questions to determine whether a common information
  2. If so, determine the problem category, using the data in the database of answers. If there is no data in the database, the search engines crawling your resume, and then use filters to find common information.
  3. If not, use a search engine to search questions, the sentence will result in one by one with the problem than to find the best match of the sentence as an answer.

Task List

After the above analysis, complete with a lot of components, listed below:

  • Web crawler
  • Text extractor
  • Resume Filter: determine whether the text resume
  • More basic information extractor to extract resumes titles, graduate schools, research and so often mentioned, frequently asked information from the text
  • The answer matching model, I want to try this

research process

Web crawling

Use the google custom search api, there is 1000 free requests per day. This api is to search for a particular site, but can be added to the search results of the whole network, Quxianjiuguo.

Text extraction

There is a text extraction library web page, I tried three:

  • boiler pipe
  • goose3
  • Ranks layout extraction

The effect is not good, because the page is included in your resume, and general articles about different, and finally wrote a.

Resume filters

I used the "school name +" to search resumes. The study found that the top surface of the search results may also contain other content such as news, we need to be screened.

Resulting classifier is as follows:

  1. Concentrated extract high-frequency words from the training
  2. Each text word calculated frequency value in the training set TfIdf, characterized as
  3. Fitted logistic regression models

You can refer to this article .

Data collection is my own points. In order to improve efficiency labeling, not seeing things, also wrote a tagger. (In fact, he wrote a total of three taggers.) Tagger simple tkinter framework, easy to write, but it feels interface design and documentation are relatively simple.

Basic information extractor

I used the research as a starting point, "Master graduate school." However, I found many teachers do not master graduate school, because it is straight Bo. . .

First thought model, enter a string of words, master output graduate school starting and ending location. But a sentence answer may appear several times, interfere with attention model may also be dispersed in the school name, but it does not matter what the school called. I was determined to extract ideas: first find out the sentence that contains the name of the school, the school will replace the name into a symbol, then segmentation, vectorization, into the classifier to identify, to determine whether the symbols represent a master's graduate school.

At first I was looking for a school list, but did not find include a list of colleges and universities at home and abroad, and this list of ways although accurate but difficult to promote, not beautiful. I switch to NER way to find out with "school", all institutions were the words "college", "Institute", should be able to cover basic school resume arise.

Due to limited data, I chose a simple model, and trained word vector . The model used is CNN , and other vectors with the word width of the window length, data alignment fill 0.

Then there is the issue of data, for which I wrote a tagger, marked nearly 1,000 sentences. . . More embarrassing was later discovered that the name of the school is only less than 30 master's graduate school. . .

I again deal with the problem of sample imbalance with the right category, but because too few positive samples, correct about 50%, need to find more positive samples to improve results.

Above, I work in the Visitor Information System question and answer problem, due to changes in the subject, temporarily not continue to do it.

pit

The pit is the biggest problem jpype package, in which more than ten hours to toss it. . .

When I debug the code extracted from the Master's school resumes, reports JVM can only start once the error. I know pyhanlp will be called upon Import jpype.startJVM(), and mistakenly thought Stanfordnlp also depends jpype (does not actually rely on, the code is to launch a Java server processes, and then communicate with python code), try the following methods:

  • The second JVM first before starting it off. In function local delivery pyhanlp, the import pyhanlpexecution prior to the statement jpype.shutdownJVM(). But not, jpype document to say that "Because of lack of JVM support, you can not shutdown the JVM and then restart it."? ? ? (Introductory essay python import that local delivery of useless here when handling package conflicts not to use the Well)
  • Since the JVM can not afford off, then kill the process . I do ner with Stanfordnlp execution of code in a new process, and then the process of killing. I have tried multiprocessing package and subprocess package. Since the conflict is not Stanfordnlp and pyhanlp, of course, it is no effect. . But I believe that the plan to kill the process itself is feasible. In addition, when using subprocess package, the child process by passing an object into the pickle string to the pipeline, but I print debugging information in a child process layers of function calls, the destruction of a string of pickle, a tune at night. . . Eat a cutting wisdom of it.
  • Kill the process did not succeed, I turn to jpype.addClassPath()this method. Pyhanlp can then modify the initialization code (although codes change very beautiful package), added is determined, if the JVM has started, will be added directly to the desired a jar in the class path. But still reported ClassNotFoundException. Checked, Java class path in the dynamic modification need to use reflection, and to allow the security manager is a very tricky method of. The online rarely mentioned jpype.addClassPath().

Finally, I found that before the conflict and pyhanlp used to try to extract the text, import statements have been grayed boilerpipe, deleted just fine. . .

Impressions

The issue before and do a month, in fact, time to really study the removal of National Day, the rest of the time use only half of it. Recent always felt I did not make good use of the free time, we must strive to do previously wanted to do without making meaningful things.

Complete the process of the subject so I have a basic understanding of the question answering system, familiar with the word, named entity recognition, basic operations such as nlp term vectors, and contact with standfordnlp, jieba, pyhanlp, nltk and other toolkits. python programming also practiced a lot, multi-process, tkinter, jpype and so on. Also, I try to use python virtualenv to manage the project, it makes the code and dependencies are together, not overlooked, portable convenience.

I think, do these issues, the most valuable of all aspects of research, design research programs for the specific circumstances, and to address some of the training process to implement the program encountered difficulties, as well as programming capabilities.

Next is the basic information extractor and achieve good tune, commissioning answer matching model, very interesting, but unfortunately on this issue to be suspended.

I would like to Tucao about python. The debugger can jam in the calculation expression, the collapse of the object may also have dynamic properties, mixed in a pile underlined in an internal variable, hard to see. As a dynamically typed language, do not declare the type, the brain actually pay attention to what type of a variable is, what properties, methods, IDE often derive out. No type declaration, function parameters, return values ​​are also lost, often depends on the document, look at the code, like Java visibility name to see the type known meaning. pip and other package manager is not reliable, error from time to time. In fact, many modules are C ++, Java written, leaving python is simple interface and documentation. "Glue language" is really a bit derogatory, simple surface at the expense of engineering capabilities for the price, it does not glue, but can only make sense of glue. I probably did not realize that python thinking, writing more likely to adapt to it. .

Verbatim large column  https://www.dazhuanlan.com/2019/08/27/5d64b4586e4a4/


Guess you like

Origin www.cnblogs.com/petewell/p/11418196.html