Develop a habit, like first and then watch!!
table of Contents
1. Search Engine
1.1- What is a search engine
Let's first understand what a search engine is through official explanations. Of course, we will explain the concept of search engines in more common ways later.
Wikipedia's introduction to search engines:
A search engine (English: search engine) is an information retrieval system designed to assist in searching for information stored in a computer system. Search results are generally called "hits" and are usually listed in a form . Internet search engine is the most common and public search engine, its function is 搜索万维网上储存的信息
.
In fact, Wikipedia is more in place, saying that white is a search engine to help our 快速检索信息
tools.
But everyone has to say it again, the concept is known, but I really think I haven't used search engines.
Believe me, in fact, we are in contact with the search engines every day , where we can give a very simple chestnuts.
If we are using Google Chrome, then we can see such an option in the settings that is to change our search engine. After
you see the options, you can know what the search engines are. Here Google provides us several options, as we often have 百度,360
these belong to the search engine. of course, like 阿里的夸克,搜狗,UC
so are the search engines.
1.2- Why are search engines so fast?
Since it comes to search by search engines, we will accordingly think of the search in the database, so we will ask why the same search finally chooses to search through search engines instead of searching through the database?
In fact, we can summarize the above concept through a search engine's most most most biggest feature is the speed of the search is very fast , we all know that database in the amount of data one million level when it will clearly show 搜索能力的下降
must pass 优化SQL的方式才能提高运行的速度
but search engines because 底层的搜索算法
above it and search different databases, which makes the search engine on the nature of speed on the show飞一般感觉
In this case, we will definitely think, fast, fast, what is the difference between their algorithms? Come on! Just Teach Me!
Since we want to know why the search engine so fast, we have to want him with the database horizontal comparison , so as to reflect why he is powerful.
1.2.1- Front Index
Let's first explain the search algorithm at the bottom of the database-forward index:
Prior to re-explain the forward index, we need to understand database of some of the processes search, I believe we are well aware of the primary key concept of it in the database, then we need to be clear following the concept of these primary keys as well as the concept of searching for content , you need First understand these concepts, and then you can better understand them later:
- The primary key is generally defined as a numeric type, that is, int , which is generally not defined by characters, and specific primary keys in these industries are excluded,
身份证,电话号码
and the primary key is generally defined by int. - Primary keys are usually displayed in the background , which is generally not in the foreground of a screen
就算在前台显示,也是显示前端写好的序号即递升的主键:1,2,3,4.......n
, which allows users to use the foreground is using the search function is generally not directly search for the primary key , after all, they do not know what is the primary key in Where to look. - When users use search function usually performed by a search string , which makes the search generally do not directly match the primary key ---- As we emphasize the first point, when the primary key is generally defined as an int
- Users generally do not search directly through the primary key - as we said in the second point, the primary key is generally displayed in the background, and is generally not displayed in the foreground page. The user generally does not know the primary key at all. What, because I can't see it at all
After understanding the above four points, we will explain the front index. The concept of the front index is actually very simple. It is to search according to our primary key order . After finding the object according to the primary key, the attributes of the object and the content entered by the user are sequentially Match , stop if it matches, if not, continue to repeat the above search process.
Next, let’s check that we don’t build it with a simple chestnut. Let’s understand:
Suppose we go to a classroom to find Xiaoming students, but we only know the student numbers of these students, so obviously our search process should be like this:
Obviously this kind of efficiency is not enough. First, we must first find the corresponding objects in the order of the primary key. Second, we need to check whether each object of the object matches the content we search. If the attributes of the object are too bizarre, then the process will be even more Time consuming.
1.2.1-Inverted Index
Next, we will explain the search algorithm of search engines-inverted index:
The inverted index takes another way to store data.After binding with the data in the database, he will reconstruct the data in the database, and bind 先将对象的各项属性进行分词处理
the corresponding attributes to their primary keys after the processing is completed. But this binding process is no longer in 主键----属性
the format, but 分词----主键
in the form of attributes , so that the attributes can be directly matched during the search and matching process , and then the final queried primary key can be matched. Maybe you are not can understand, we are still below the chestnuts to help you understand:
obviously this is the ability to greatly reduce the time for the query, because we 可以直接将主键对象与我们的内容进行匹配
, and not later than the first find the object and then the properties of objects such trouble.
If it is still not understandable, let's deepen our understanding through the following chestnuts:
Suppose the data in our database is like this:
Suppose we inquire Kung Fu Panda this content, then obviously our database search process is such
Let's look at the search process of our search engine:
First, we will reconstruct the data like this
After reconstructing the data into this way, let's take a look at the search process of the search engine: after
each word segmentation is obtained, only check whether it is one of the remaining word segmentation, and 记录主键是采取数据交集的策略
this makes the search speed greatly accelerated.
接下来的是我自己的想法,可能说的不对,大家就当看着玩玩
!!
If two data structures are used to represent the forward index and the inverted index, it can be as follows:
Linked list-positive index
Each time, it must be searched in order, just like a linked list 必须从头开始查找
, and also like the comparison process of a linked list.
Map-inverted index
It is also necessary 按序查找
, but the search process has become much easier. After matching, you can directly retrieve the corresponding primary key value like Map.类似于Map的get()方法,直接获取key的value值
1.3- What are the mainstream search engine technologies
After complete understanding of what is a search engine, let us look at the current mainstream search engines use the technology what?
At present, there are two main search engine technologies:
-
Solr
-
ElasticSearch
Next, we briefly introduce the two:
In fact, the bottom layers of both Solr and ElasticSearch are implemented through Apache's Lucene , but Solr was developed first, and ElasticSearch was developed later. The basic functions of the two are actually not much different, but there are differences in some specific directions.
Solr
:
Advantages:
- Support
多种数据格式
: json, xml, html, etc. - More mature and stable (after all, it was developed first, ginger is still hot)
非实时搜索
Search faster in the case of
Disadvantages:
- Search in the case of indexing to achieve real-time search
速度明显降低
ElasticSearch
:
Advantages:
- Support
实时搜索
, search speed will not decrease - stand by
分布式
Disadvantages:
- The degree of automation is not high enough
After introducing all the above, you need to learn to use it.Here I chose ElasticSearch, because it is more friendly to novices and the configuration is relatively simple, so I chose ElasticSearch.
The next step is the installation of ElasticSearch.
2.ElasticSearch installation steps
2.1- Installation environment
- The first is the installation environment:
Centos7+jdk1.8
2.2-Main configuration file
- Configuration file:
elasticsearch.yml(主要配置ElasticSearch集群信息) jvm.options(jvm内存信息)
2.3- Create a folder and upload files and unzip
-
Create folder and upload files and unzip
mkdir -p /opt/es
Upload our files to this directory
At this time, these files are not authorized, so you need to assign permissions to these files
unzip files:
tar -zxvf elasticsearch-6.3.1.tar.gz
2.4- Modify the configuration file
-
Modify the configuration file
ES uses the maximum number of threads, maximum memory, and maximum files accessed
If it is Centos6, the above three need to be configured, otherwise linux will not allow the environment to use such a large number of threads
But in Centos7, you only need to configure the maximum number of files to be accessed.
The main reason is that the default memory of elasticSearch is too large, which may exceed the tolerance of our service. My default here is 1G
Here we modify it to 256M
Then try to start:
We may encounter this problem later:
The reason is that in the
elasticSearch5.0版本
future, many large companies have also begun to use ElasticSearch as their search engine technology.After large companies have used it因为在5版本的ElasticSearch中,ElasticSearch运行都是通过root用户进行的
, they have discovered that ElasticSearch has security vulnerabilities , so some hackers have通过这个特性直接获取到root用户的密码以及其他信息
leaked information.So after version 5, ElasticSearch began to adopt this scheme, that is
所有的操作不能再是root用户
,单独创建一个用户
elasticSearch must be operated.So if we start ElasticSearch according to the default configuration file, it will still be started by the root user, so we need to recreate a user and start ElasticSearch in the state of that user.
//创建一个新的用户 adduser es //切换到es用户下 su es
After the switch is completed, we can find that
前面的用户就已经改变了
, and the symbol in front of the command has also changed.不再是#号,而是换成了$符号
Then let's try to restart elasticSearch again
Later we encountered this problem again, which means: our es user does not have permission to access the file jvm.options
So we need to switch back to the root user to modify the access permissions of the following es users
//切换成root用户 su root //返回上级目录 cd .. //进入config cd config //将config下的所有文件都给予最大权限 chmod 777 *
So our es user can access the jvm.options file.
After that, we will restart our elasticSearch but we will encounter the following problems again
The problem is mainly that the es user does not have permission to access the data folder (data is the es software and log data directory)
In order to solve all the problems of insufficient permissions that may be encountered later , we decided to switch to the root user, and then open the permissions of all files in the root directory of elasticSearch, but
不建议大家这样做,最好是启动之后哪里报权限不足的时候,我们在依次切换到root用户去将相应的文件的权限打开.
After we switch to the root user, use the following command to open all files under elasticSearch by polling.
chmod 777 -R elasticsearch-6.3.1
In this way, we have opened the permissions of all files.At this time, we switch to the es user to perform our next operations.
Then we need to configure our default ip and port number so that the external network can access our elasticSearch
Here we are configuring in the elasticsearch.yml file:
After entering, we mainly configure these two parameters:
If it is not a cloud server , you can directly follow my prompt configuration below
If you are a cloud server, then you cannot configure this way. If you still follow the above configuration, then we will start elasticSearch and the following error will occur:At this time we need to configure like this:
Remember that the public IP address of the cloud server cannot be filled here , otherwise it will not be connected
And after this configuration is completed, if it is a cloud server, we also need to open the two ports 9200 and 9300 in the firewall and the Alibaba Cloud console , otherwise it still cannot be connected.
In this way, we will not report the above error after configuration, but we will report another problem after restarting:
It means that elasticSearch despise our current system, saying that the maximum number of files that our current system can open and the maximum amount of memory that can be used are not enough, and it needs to be upgraded to his corresponding minimum requirements.
In this case, we have to modify the linux configuration (to meet the startup requirements of es). This operation needs to be performed under the root user, otherwise it will prompt insufficient permissions :
1. Modify Linux limits configuration file, set memory threads and files
The location of this file: /etc/security/limits.conf
Add the following code:
*hard nofile 65536 *soft nofile 131072 *hard nproc 4096 *soft nproc 2048
These codes need to be written before #End of file , otherwise these codes will not take effect. If you are a cloud server, then you need to modify the parameters following #End of file as well, otherwise it will still report the same after startup mistake
After that, our center refreshes the file to make it effective.
source /etc/security/limits.conf
2. Modify the sysctl configuration file of Linux and configure the system to use memory
File location: /etc/sysctl.conf
Add the following code:
vm.max_map_count=655360 fs.file-max=655360
After saving and exiting, we need to make the configuration take effect
sysctl -p
This configuration has taken effect.
After that, we can use the es user to start elasticSearch again, and we can find that the startup has been successful:
Although he shows it
发布地址是通过我们的内网ip地址
, when we access it through a browser, we can directly access it through the public network ip: 9200 .
In this way, our elasticSearch is installed and started successfully .
The code word is not easy. If you think it is helpful to you, you can follow my official account. Newcomers need your attention!!!
Don't look at it, you look good.
Keep watching, you look better!