Big data base - a large Internet Data Processing (Peng "big data" after-school exercise answers)

1. Description of Internet information crawl way.

   Internet information automatically grab the most common and effective way is to use a web crawler.

2. Description of the composition of the architecture of the system of public opinion.

   The user terminal -> acquisition layer -> Analysis Layer -> presentation layer -> User

                                              

3. Chinese word segmentation algorithm can be divided into several categories?

   (1) segmentation method based on string matching, which is the Chinese string with a pending "as comprehensive as possible," the dictionary entries matching according to certain rules, if a string exists in the dictionary, it is considered the string matching success.

  (2) a method based on word statistics, since the word is a particular word combinations, then in this context, the higher the frequency of co-occurrence of words adjacent to, the more likely it is under that word combination constitute a word .

  (3) methods based segmentation appreciated that the method to solve the problem by ambiguous word sentence information and semantic information, and semantic and syntactic analysis of the word at the same time.

4. common tools text word what?

   (1) MMSEG segmentation tool

  (2) Stanford NLTK segmentation tool

5. Description of inverted index principle.

   Inverted index (Inverted Index), also known as "reverse index" or "inverted file", is an index data structure. Inverted index between "content" (eg, words, numbers) and storage content of the "position" (eg, databases, files, a group of files) to establish the mapping, aimed at rapid full text retrieval and processing cost of the new minimum files added to the database. By inverted index, you can quickly locate the file containing it under "Content."

6. Description of inverted index update strategy.

                                            

7. briefly inverted to achieve index.

                                          

8. popular web sorting algorithms What?

   (1) sorting algorithms based on traffic. This algorithm is, the more important pages, the greater the traffic.

  (2) the position of the weighted word frequency and word sorting algorithm, for example, TF-IDF algorithm, algorithm BM25.

  (3) sorting algorithm based on link analysis, such as PageRank algorithm, Reputation algorithm.

  (4) intelligent sorting algorithm based on.

9. Brief TD-IDF algorithm main idea.

                                 

10. BM algorithm outlined the main ideas.

                                  

11. Brief history information retrieval system architecture.

  面向历史领域的智能信息检索引擎,从互联网上抓取重大历史事件的网站内容,经过数据汇聚和整合从而在数据库中建立专门的数据库。通过在数据库中检索与用户查询条件匹配的相关记录,然后将查询结果进行优化,并按照一定的排序方式将最终结果返回给用户。

                                       

Guess you like

Origin www.cnblogs.com/lsm-boke/p/11964395.html