ElasticSearch study notes (1)-search engine introduction and ElasticSearch installation

Develop a habit, like first and then watch!!

1. Search Engine

1.1- What is a search engine

Let's first understand what a search engine is through official explanations. Of course, we will explain the concept of search engines in more common ways later.

Wikipedia's introduction to search engines:

A search engine (English: search engine) is an information retrieval system designed to assist in searching for information stored in a computer system. Search results are generally called "hits" and are usually listed in a form . Internet search engine is the most common and public search engine, its function is 搜索万维网上储存的信息.

In fact, Wikipedia is more in place, saying that white is a search engine to help our 快速检索信息tools.

But everyone has to say it again, the concept is known, but I really think I haven't used search engines.

Believe me, in fact, we are in contact with the search engines every day , where we can give a very simple chestnuts.

If we are using Google Chrome, then we can see such an option in the settings that is to change our search engine. After
Insert picture description here
you see the options, you can know what the search engines are. Here Google provides us several options, as we often have 百度,360these belong to the search engine. of course, like 阿里的夸克,搜狗,UCso are the search engines.
Insert picture description here

1.2- Why are search engines so fast?

Since it comes to search by search engines, we will accordingly think of the search in the database, so we will ask why the same search finally chooses to search through search engines instead of searching through the database?

Insert picture description here

In fact, we can summarize the above concept through a search engine's most most most biggest feature is the speed of the search is very fast , we all know that database in the amount of data one million level when it will clearly show 搜索能力的下降must pass 优化SQL的方式才能提高运行的速度but search engines because 底层的搜索算法above it and search different databases, which makes the search engine on the nature of speed on the show飞一般感觉
Insert picture description here

In this case, we will definitely think, fast, fast, what is the difference between their algorithms? Come on! Just Teach Me!

Insert picture description here

Since we want to know why the search engine so fast, we have to want him with the database horizontal comparison , so as to reflect why he is powerful.

1.2.1- Front Index

Let's first explain the search algorithm at the bottom of the database-forward index:

Prior to re-explain the forward index, we need to understand database of some of the processes search, I believe we are well aware of the primary key concept of it in the database, then we need to be clear following the concept of these primary keys as well as the concept of searching for content , you need First understand these concepts, and then you can better understand them later:

  • The primary key is generally defined as a numeric type, that is, int , which is generally not defined by characters, and specific primary keys in these industries are excluded, 身份证,电话号码and the primary key is generally defined by int.
  • Primary keys are usually displayed in the background , which is generally not in the foreground of a screen 就算在前台显示,也是显示前端写好的序号即递升的主键:1,2,3,4.......n, which allows users to use the foreground is using the search function is generally not directly search for the primary key , after all, they do not know what is the primary key in Where to look.
  • When users use search function usually performed by a search string , which makes the search generally do not directly match the primary key ---- As we emphasize the first point, when the primary key is generally defined as an int
  • Users generally do not search directly through the primary key - as we said in the second point, the primary key is generally displayed in the background, and is generally not displayed in the foreground page. The user generally does not know the primary key at all. What, because I can't see it at all

After understanding the above four points, we will explain the front index. The concept of the front index is actually very simple. It is to search according to our primary key order . After finding the object according to the primary key, the attributes of the object and the content entered by the user are sequentially Match , stop if it matches, if not, continue to repeat the above search process.

Next, let’s check that we don’t build it with a simple chestnut. Let’s understand:
Suppose we go to a classroom to find Xiaoming students, but we only know the student numbers of these students, so obviously our search process should be like this:
Insert picture description here

Obviously this kind of efficiency is not enough. First, we must first find the corresponding objects in the order of the primary key. Second, we need to check whether each object of the object matches the content we search. If the attributes of the object are too bizarre, then the process will be even more Time consuming.

1.2.1-Inverted Index

Next, we will explain the search algorithm of search engines-inverted index:

The inverted index takes another way to store data.After binding with the data in the database, he will reconstruct the data in the database, and bind 先将对象的各项属性进行分词处理the corresponding attributes to their primary keys after the processing is completed. But this binding process is no longer in 主键----属性the format, but 分词----主键in the form of attributes , so that the attributes can be directly matched during the search and matching process , and then the final queried primary key can be matched. Maybe you are not can understand, we are still below the chestnuts to help you understand:
Insert picture description here
obviously this is the ability to greatly reduce the time for the query, because we 可以直接将主键对象与我们的内容进行匹配, and not later than the first find the object and then the properties of objects such trouble.

If it is still not understandable, let's deepen our understanding through the following chestnuts:
Suppose the data in our database is like this:

Insert picture description here

Suppose we inquire Kung Fu Panda this content, then obviously our database search process is such
Insert picture description here

Let's look at the search process of our search engine:

First, we will reconstruct the data like this

Insert picture description here
After reconstructing the data into this way, let's take a look at the search process of the search engine: after
Insert picture description here
each word segmentation is obtained, only check whether it is one of the remaining word segmentation, and 记录主键是采取数据交集的策略this makes the search speed greatly accelerated.

接下来的是我自己的想法,可能说的不对,大家就当看着玩玩!!
If two data structures are used to represent the forward index and the inverted index, it can be as follows:

Linked list-positive index

Each time, it must be searched in order, just like a linked list 必须从头开始查找, and also like the comparison process of a linked list.

Map-inverted index

It is also necessary 按序查找, but the search process has become much easier. After matching, you can directly retrieve the corresponding primary key value like Map.类似于Map的get()方法,直接获取key的value值

1.3- What are the mainstream search engine technologies

After complete understanding of what is a search engine, let us look at the current mainstream search engines use the technology what?

At present, there are two main search engine technologies:

  • Solr

  • ElasticSearch

Next, we briefly introduce the two:

In fact, the bottom layers of both Solr and ElasticSearch are implemented through Apache's Lucene , but Solr was developed first, and ElasticSearch was developed later. The basic functions of the two are actually not much different, but there are differences in some specific directions.

Solr:
Advantages:

  • Support 多种数据格式: json, xml, html, etc.
  • More mature and stable (after all, it was developed first, ginger is still hot)
  • 非实时搜索Search faster in the case of

Disadvantages:

  • Search in the case of indexing to achieve real-time search速度明显降低

ElasticSearch:
Advantages:

  • Support 实时搜索, search speed will not decrease
  • stand by分布式

Disadvantages:

  • The degree of automation is not high enough

After introducing all the above, you need to learn to use it.Here I chose ElasticSearch, because it is more friendly to novices and the configuration is relatively simple, so I chose ElasticSearch.

The next step is the installation of ElasticSearch.

2.ElasticSearch installation steps

2.1- Installation environment

  • The first is the installation environment:
    Centos7+jdk1.8
    

2.2-Main configuration file

  • Configuration file:elasticsearch.yml(主要配置ElasticSearch集群信息) jvm.options(jvm内存信息)
    Insert picture description here

2.3- Create a folder and upload files and unzip

  • Create folder and upload files and unzip

    mkdir -p /opt/es
    

    Upload our files to this directory

    Insert picture description here

    At this time, these files are not authorized, so you need to assign permissions to these files

    Insert picture description here

    unzip files: tar -zxvf elasticsearch-6.3.1.tar.gz

    Insert picture description here

2.4- Modify the configuration file

  • Modify the configuration file

    ES uses the maximum number of threads, maximum memory, and maximum files accessed

    If it is Centos6, the above three need to be configured, otherwise linux will not allow the environment to use such a large number of threads

    But in Centos7, you only need to configure the maximum number of files to be accessed.

    Insert picture description here

    The main reason is that the default memory of elasticSearch is too large, which may exceed the tolerance of our service. My default here is 1G

    Insert picture description here

    Here we modify it to 256M

    Then try to start:

    Insert picture description here

    We may encounter this problem later:

    Insert picture description here

    The reason is that in the elasticSearch5.0版本future, many large companies have also begun to use ElasticSearch as their search engine technology.After large companies have used it 因为在5版本的ElasticSearch中,ElasticSearch运行都是通过root用户进行的, they have discovered that ElasticSearch has security vulnerabilities , so some hackers have 通过这个特性直接获取到root用户的密码以及其他信息leaked information.

    So after version 5, ElasticSearch began to adopt this scheme, that is 所有的操作不能再是root用户, 单独创建一个用户elasticSearch must be operated.

    So if we start ElasticSearch according to the default configuration file, it will still be started by the root user, so we need to recreate a user and start ElasticSearch in the state of that user.

    //创建一个新的用户
    adduser es
    //切换到es用户下
    su es
    

    Insert picture description here

    After the switch is completed, we can find that 前面的用户就已经改变了, and the symbol in front of the command has also changed.不再是#号,而是换成了$符号

    Then let's try to restart elasticSearch again

    Insert picture description here

    Later we encountered this problem again, which means: our es user does not have permission to access the file jvm.options

    So we need to switch back to the root user to modify the access permissions of the following es users

    //切换成root用户
    su root 
    //返回上级目录
    cd ..
    //进入config
    cd config
    //将config下的所有文件都给予最大权限
    chmod 777 *
    

    So our es user can access the jvm.options file.

    After that, we will restart our elasticSearch but we will encounter the following problems again

    Insert picture description here

    The problem is mainly that the es user does not have permission to access the data folder (data is the es software and log data directory)

    In order to solve all the problems of insufficient permissions that may be encountered later , we decided to switch to the root user, and then open the permissions of all files in the root directory of elasticSearch, but不建议大家这样做,最好是启动之后哪里报权限不足的时候,我们在依次切换到root用户去将相应的文件的权限打开.

    After we switch to the root user, use the following command to open all files under elasticSearch by polling.

    chmod 777 -R elasticsearch-6.3.1 
    

    In this way, we have opened the permissions of all files.At this time, we switch to the es user to perform our next operations.

    Then we need to configure our default ip and port number so that the external network can access our elasticSearch

    Here we are configuring in the elasticsearch.yml file:

    Insert picture description here

    After entering, we mainly configure these two parameters:

    If it is not a cloud server , you can directly follow my prompt configuration below

    Insert picture description here
    If you are a cloud server, then you cannot configure this way. If you still follow the above configuration, then we will start elasticSearch and the following error will occur:

    Insert picture description here

    At this time we need to configure like this:

    Insert picture description here

    Remember that the public IP address of the cloud server cannot be filled here , otherwise it will not be connected

    And after this configuration is completed, if it is a cloud server, we also need to open the two ports 9200 and 9300 in the firewall and the Alibaba Cloud console , otherwise it still cannot be connected.

    In this way, we will not report the above error after configuration, but we will report another problem after restarting:

    Insert picture description here

    It means that elasticSearch despise our current system, saying that the maximum number of files that our current system can open and the maximum amount of memory that can be used are not enough, and it needs to be upgraded to his corresponding minimum requirements.

    In this case, we have to modify the linux configuration (to meet the startup requirements of es). This operation needs to be performed under the root user, otherwise it will prompt insufficient permissions :

    1. Modify Linux limits configuration file, set memory threads and files

    The location of this file: /etc/security/limits.conf

    Add the following code:

    *hard nofile 65536
    *soft nofile 131072
    *hard nproc 4096
    *soft nproc 2048
    

    These codes need to be written before #End of file , otherwise these codes will not take effect. If you are a cloud server, then you need to modify the parameters following #End of file as well, otherwise it will still report the same after startup mistake

    Insert picture description here

    After that, our center refreshes the file to make it effective.

    source /etc/security/limits.conf
    

    Insert picture description here

    2. Modify the sysctl configuration file of Linux and configure the system to use memory

    File location: /etc/sysctl.conf

    Add the following code:

    vm.max_map_count=655360
    fs.file-max=655360
    

    After saving and exiting, we need to make the configuration take effect

    sysctl -p
    

    Insert picture description here

    This configuration has taken effect.

    After that, we can use the es user to start elasticSearch again, and we can find that the startup has been successful:

    Insert picture description here

    Although he shows it 发布地址是通过我们的内网ip地址, when we access it through a browser, we can directly access it through the public network ip: 9200 .

Insert picture description here

In this way, our elasticSearch is installed and started successfully .

The code word is not easy. If you think it is helpful to you, you can follow my official account. Newcomers need your attention!!!

Insert picture description here

Don't look at it, you look good.

Keep watching, you look better!

Guess you like

Origin blog.csdn.net/lovely__RR/article/details/110678315