Article directory
1. Demand analysis
- There are a large number of equipment maintenance office documents in the company. When retrieving specific maintenance knowledge, equipment personnel need to find a list of potentially related files on the server according to the index files of the directory, and then open them one by one for retrieval. Inefficient and poor experience.
- Users hope to use the existing document system (preparation, release, upgrade and other document control management with special personnel responsible) unchanged, and build a document search engine that can search according to key words to improve efficiency and experience.
- This article will build a file search engine system through ElasticSearch(open source search engine), FSCrawler(file crawler, "upload" documents to elasticsearch), SearchUI(use elasticsearch to search the front-end page of the API).
二、ElasticSearch
-
We first download the file from (https://www.elastic.co/cn/downloads/elasticsearch (this article takes the windows version as an example).
-
unzip files
-
Download and install jdk and set java environment variables
-
Go to the unzipped bin directory and double-click the elasticsearch.bat file to run
-
Verify that ElasticSearch is successfully started: use a browser to access http://localhost:9200 , and see the following page, which means the installation is successful
3. FSCrawle
- Let's download the file from https://fscrawler.readthedocs.io/en/fscrawler-2.7/installation.html (this article uses the windows version as an example).
- unzip files
- Create a file crawling job
- After running the above command, the program prompts whether to create the job
- Select y, the program will create the following configuration file in the user directory, we need to configure the task
---
name: "test"
fs:
url: "d:\\test" # 监控windows下的D盘test目录
update_rate: "15m" # 间隔15分进行扫描
excludes:
- "*/~*" #排除以~开头的文件
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://127.0.0.1:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
ssl_verification: true
- After saving the configuration, we can start the FSCrawle crawler:
- After the startup is successful, we will find an additional status file in the original configuration directory, which will record the regular running records of the file crawler:
{
"name" : "test",
"lastrun" : "2021-11-27T09:00:16.2043064",
"indexed" : 0,
"deleted" : 0
}
Third, SearchUI
- We finally download the frontend page from https://github.com/elastic/search-ui :
- After the file is unzipped, open the examples\elasticsearch directory with vscode:
- And modify the search.js, buildRequest.js, buildState.js files in turn
- Modify search.js: set job path
- Modify buildRequest.js
- Modify buildState.js
- Note: In order to allow users to download files directly through the file link on the search page, we build a file download service through IIS:
- This address is reflected in buildState.js
- Finally, we modify app.js to match the page returned by the search with the search field name
Five, run the test
- Install dependencies and run the program
# 安装
- npm install
# 运行
- npm start
-
Place files in the directory monitored by FSCrawler
-
Test search performance