[A brief introduction to Lucene of Apache]

1.  Introduction to Lucene

Lucene was originally developed by Doug Cutting and is available for download on SourceForge's website.

In September 2001, it joined the Jakarta family of the Apache Software Foundation as a high-quality open source Java product.

With each release, the project has been significantly enhanced, attracting more users and developers.

Search with Apache Solr...@  Core (Java) Solr PyLucene Large, Vibrant community

The goal of Apache Lucene and Solr is to provide world class search capabilities

 

The Apache LuceneTM project develops open-source search software, including:

Lucene Core, our flagship sub-project, provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

SolrTM is a high performance search server built using Lucene Core, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.

PyLucene is a Python port of the Core project.

 

Who is Doug Cutting? Anyone who is exposed to big data knows Hadoop. The name Hadoop is not an acronym, but a fictitious name. The project's creator, Doug Cutting, explains how Hadoop got its name: "My kid named a tan elephant toy. My naming convention is short, easy to pronounce and spell, doesn't mean much, and It's not going to be used anywhere else. Kids are just the best at that."



 

 

Lucene is a sub-project of the 4 jakarta project group of the Apache Software Foundation. It is an open source full-text search engine toolkit, but it is not a complete full-text search engine, but a full-text search engine architecture that provides a complete Query engine and indexing engine, part of the text analysis engine (English and German two western languages). The purpose of Lucene is to provide a simple and easy-to-use toolkit for software developers to easily implement full-text search functions in the target system, or to build a complete full-text search engine based on this. Lucene is an open source library for full-text search and search, supported and provided by the Apache Software Foundation. Lucene provides a simple yet powerful API for full-text indexing and searching. Lucene is a mature free and open source tool in the Java development environment. For its part, Lucene is the most popular free Java information retrieval library currently and in recent years. Information retrieval libraries are often mentioned, although related to search engines, information retrieval libraries should not be confused with search engines

 

 

The 2.1 version of IBM's open source software eclipse also uses Lucene as the full-text indexing engine of the help subsystem, and the corresponding IBM commercial software Web Sphere also uses Lucene. Lucene has gained more and more applications due to its open source features, excellent index structure, and good system architecture.

Lucene is a high-performance, scalable Information Search (IR) library. It can add indexing and search capabilities to your application. Lucene is a mature open source project implemented in java, a member of the famous Apache Jakarta family, and based on the Apache software license [ASF, License]. Likewise, Lucene is currently a very popular, free Java Information Search (IR) library.

 

 

Second, the advantages of Lucene

As a full-text search engine, Lucene has the following outstanding advantages:

(1) The index file format is independent of the application platform. Lucene defines a set of index file formats based on 8-bit bytes, so that compatible systems or applications on different platforms can share the created index files.

(2) On the basis of the inverted index of the traditional full-text search engine, the block index is realized, which can establish a small file index for new files and improve the index speed. Then through the merging with the original index, the purpose of optimization is achieved.

(3) The excellent object-oriented system architecture reduces the learning difficulty of Lucene extension and facilitates the expansion of new functions.

(4) A text analysis interface independent of language and file format is designed. The indexer completes the creation of index files by accepting the Token stream. Users only need to implement the text analysis interface to expand new languages ​​and file formats.

(5) A set of powerful query engine has been implemented by default. Users do not need to write their own code to enable the system to obtain powerful query capabilities. The query implementation of Lucene implements Boolean operations and fuzzy queries by default (Fuzzy Search[11]) , group query, etc.

In the face of existing commercial full-text search engines, Lucene also has considerable advantages.

First of all, its development source code distribution method (complying with Apache Software License [12]), on this basis, programmers can not only make full use of the powerful functions provided by Lucene, but also learn the full-text search engine production technology in-depth and meticulous. And the practice of object-oriented programming, and then write a better full-text search engine more suitable for the current application according to the actual situation of the application. At this point, commercial software is far less flexible than Lucene.

Secondly, Lucene, adhering to the advantages of open source code's consistent excellent architecture, has designed a reasonable and highly scalable object-oriented architecture. Programmers can expand various functions on the basis of Lucene, such as expanding Chinese processing capabilities, from text to text. Expanded to the processing of text formats such as HTML, PDF [13], etc., the functions of writing these extensions are not only uncomplicated, but also because Lucene has properly and reasonably abstracted the system equipment, the extended functions can also easily achieve cross-border functions. platform capabilities.

Finally, after transferring to the Apache Software Foundation, with the help of the Apache Software Foundation's network platform, programmers can easily communicate with developers and other programmers, facilitate the sharing of resources, and even directly obtain fully written extended functions. Finally, although Lucene is written in the Java language, programmers in the open source community are working tirelessly to implement it in various traditional languages ​​(such as the .net framework [14]). Able to run on a variety of platforms, system administrators can choose a reasonable language according to the current platform.

Lucene has 7 packages that need to be imported: analysis, document, index, queryParser, search, store, util

 

3. Other search engines

Lucene is currently the most popular full-text search framework for Java

Nutch is an open source Java search engine

ElasticSearch is a distributed search engine based on Lucene framework

Solandra is a real-time distributed search engine

IndexTank is a set of Java-based index-real-time full-text search engine implementation

Compass is a powerful, transactional, high-performance object/search engine mapping (OSEM: object/search engine mapping) with a Java persistence layer framework

Solr is also implemented based on Java and based on Lucene. Solr's main features include: efficient and flexible caching capabilities, vertical search capabilities, and highlighting of search results.

LIRE is a Java-based image search framework whose core is also based on Lucene. Using this index, a content-based image retrieval (CBIR) system can be constructed to search for similar images.

Egothor is an open source and efficient full text search engine written in Java.

Sphinx is an open source search engine written in C++ language, and it is also one of the more mainstream search engines. It is 50% faster than Lucene in terms of indexing events, but the index file is twice as large as Lucene. One aspect is the strategy of exchanging space for events. In terms of retrieval speed, it is not much different from lucene, but in terms of retrieval accuracy, Lucene is better than Sphinx. In addition, in terms of the difficulty of adding Chinese word segmentation engine, Lucene is better than Sphinx. Among them, Sphinx supports real-time search , which is relatively simple and convenient to use. 

Xapian is a full-text retrieval program written in C++. Its api and retrieval principle are similar to lucene in many aspects, which can be regarded as filling a vacancy of lucene in C++. 

DataparkSearch is an open source search engine implemented in C language. The neural network model is used for web page sorting. It supports HTTP, HTTPS, FTP, NNTP, etc. to download web pages. It includes indexing engine, retrieval engine and Chinese word segmentation engine (this is also the only one An open source search engine has a Chinese word segmentation engine). It can customize the search results and have a complete log record. 

Whoosh is an open source search engine written in pure python. 

 

 

4. How Search Engines Work

(crawl, grab, handle, search)

Step 1: Crawl

Search engines track the links of web pages through a specific regular software, and crawl from one link to another, just like spiders crawl on spider webs, so they are called "spiders" and also called "robots". The crawling of search engine spiders is input with certain rules, and it needs to follow the content of some commands or files.

Step 2: Crawl Storage

Search engines crawl to web pages through spiders following links and store the crawled data into the original page database. The page data is exactly the same as the HTML that the user's browser gets. Search engine spiders also do certain duplicate content detection when crawling pages. Once they encounter a lot of plagiarized, collected or copied content on a website with low authority, it is likely to stop crawling.

Step 3: Preprocessing

The search engine crawls the page back from the spider and performs various steps of preprocessing.

⒈ Extract text

⒉ Chinese word segmentation

⒊ go to stop words

⒋Remove noise (search engines need to identify and remove these noises, such as copyright notice text, navigation bars, advertisements, etc...)

5. Forward indexing

6. Inverted Index

7. Link relationship calculation

8. Special file handling

In addition to HTML files, search engines can often crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT files, and more. We also often see these file types in our search results. But search engines can't handle non-text content like images, videos, Flash, or execute scripts and programs.

Step 4: Rank

After the user enters a keyword in the search box, the ranking program calls the index database data, calculates the ranking and displays it to the user, and the ranking process directly interacts with the user. However, due to the huge amount of data in search engines, although small updates can be achieved every day, in general, the ranking rules of search engines are updated at different levels on a daily, weekly and monthly basis.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327042779&siteId=291194637