The realization of ip2region

In mobile Internet applications, it is often necessary to do some statistical analysis of user-side information based on user location information, etc. To get the user's location information, there are generally two methods: GPS positioning information and the user's IP address. Since every mobile phone does not necessarily turn on GPS, and sometimes does not need too precise location (up to the level of a city), it is a good choice to analyze the user's location based on the IP address. To achieve this function, you need a mapping relationship library between IP and geographic location, and rely on this library to start an IP to geographic location service. This article starts from the requirements, combined with the ip2region with 8.4k stars on Github to analyze the design of the mapping library and how the IP can be quickly converted into a geographic location.

Introduction

IP location services are very common, and many companies provide similar paid services, such as Ali, Gaode, Baidu, etc., of course, there are also public free services, such as GeoIP, pure IP, etc. These services are either parsed through HTML pages or requested through an interface, but no matter what, an http request is indispensable, not to mention that most services have restrictions on QPS. The following table enumerates some common ways to obtain an address by IP.

Open API service the way limit Sample
Taobao IP address library interface Each user’s QPS should be less than 1 curl -d "ip=218.97.9.25&accessKey=alibaba-inc" http://ip.taobao.com/outGetIpInfo
Gaode Map interface Each user has an access limit of 100,000 per day, and enterprise developers have an access limit of 30 million curl "https://restapi.amap.com/v3/ip?ip=218.97.9.25&key=f4cf14aca974dfbb0501c582ce3fce77"
GeoIP HTML parsing curl -d "ip=218.97.9.25&submit=提交" https://www.geoip.com
Pure IP HTML parsing curl http://www.cz88.net/ip/?ip=218.97.9.25

In daily work, it is usually necessary to convert a large number of user request logs into user location information for subsequent analysis. The key to this is the large amount of data and fast processing. Obviously, we cannot meet our daily needs by requesting API public services every time.

Brute force generation IP library

For daily needs, a simple and rude way is to obtain the location information corresponding to all public network IPs through API in advance. According to the following TIPS, we can estimate that if you visit Taobao IP address library to traverse 330 million domestic IP addresses 10 years. If it is a high-tech enterprise user, it will take about 11 days to traverse the domestic IP address. I feel that this 11 days is still acceptable.

TIPS:
The IP address currently referred to by IPv4 refers to IPv4 , which uses a 32-bit (4 byte) address, so the address space is about 4.29 billion 2 3 2 = 4294967296 2^32=4294967296232=4 2 9 4 9 6 7 2 9 6 Ge,
but some addresses are for special uses retained, such as private networks (approximately 18 million) and multicast addresses (about 270 million), which reduces available on the Internet The number of addresses routed.
Accordingto statistics on thewiki, the number of IPv4 in China reached 330 million, while the number of IPv4 in the United States was 1.54 billion.

Here we agree on the data format of the location information: 国家|区域|省份|城市|ISPIf the field returned in the interface has no corresponding information, the corresponding field is filled with 0. Then we can obtain the following file data by sequentially requesting the API service (addresses increase in order):

0.0.0.0|0|0|0|内网IP|内网IP
0.0.0.1|0|0|0|内网IP|内网IP
...
1.0.15.255|中国|0|广东省|广州市|电信
...
255.255.255.255|0|0|0|内网IP|内网IP

As long as you have this file, you can read it into the memory and save it in a dictionary. The key is the IP address and the value is the location information. The program can return location information in O(1) time complexity, but we can roughly calculate the size of the program or file.
Suppose we use utf-8 for storage. The shortest record is 0.0.0.0|0|0|0|0|0, which occupies 17 bytes. The size of the IP library file is 17*4294967296 = 73014444032 B = 71303MB = 71GB. This size is unacceptable for any program.

Space optimization

IP library file optimization

From the above file data, it is found that a large number of adjacent IPs have the same location information (the customer will try to connect together when applying for a segment of IP addresses), so we can combine such records into one record. The following file data (the address segment increases sequentially):

0.0.0.0|0.255.255.255|0|0|0|内网IP|内网IP
...
1.0.8.0|1.0.15.255|中国|0|广东省|广州市|电信
...
224.0.0.0|255.255.255.255|0|0|0|内网IP|内网IP

The latest ip.merge.txt in the ip2region library has a total of 658207 records and the file size is 39 M.

IP address optimization

From the above file data, it is found that a large number of IP addresses are stored in the form of strings , while IPv4 uses 32-bit addresses. So converting it into an integer for storage can greatly save space. For example, the shortest string 0.0.0.0 occupies 7 bytes, and the longest string 111.111.111.111 occupies 15 bytes. If you convert it to an integer, they Both occupy 4 bytes. 0.0.0.0 is int(0), 111.111.111.111 is int(1869573999).

Location information optimization

From the file data above, it is found that the same location information corresponds to different IP segments (customers may apply for IP segments at different time periods), so there is still a large amount of location information in the IP library file, and we can only keep it in the memory A copy of location information, and use the pointer or file offset + data length to obtain the corresponding location information.

Optimized IP library

According to the above optimization, we can generate the final IP library: ip2region.db, which is only 8.1M.

The structure of the IP library

The structure of the IP library file ip2region.db is divided into four parts: super block, header index area, data area, and index area. The details are shown in the figure below:

ip2region_db.png

  • The super block is
    used to store the start address and end address of the 第一个索引指针index block . It points to the start position of the index block, which is the first index block of the first index partition, and 最后一个索引指针points to the end position of the index block -12, which is the last index. The head address of the last index block of the partition. In this way, you can directly read the 8 bytes of the super block when querying, and you can quickly obtain the address range of the index block.
  • The header index in the header index area
    is a secondary index to the index block, specifically for b+tree search services. The total length of the index area divided by the length of the index partition 12*(1024*8/12-1)is the actual number of indexes of the header index. The size of this area is 2048*8 bytes, which is composed of 2048 8-byte header index blocks. The first four bytes of the header index block store the starting ip value of the first index block of each index partition, and the last four bytes point to the address of the index block.
    The header index area is defined as close to 16k because it can read the entire header index area through four disk reads, and then query it in memory. The result of the query can determine that the ip is in an index partition in the index area. Then read the 8k index partition to the memory twice according to the address, and then query it in the memory, thereby reducing the number of disk reads.
  • The data
    saved in the data area , the data format is as follows:, 中国|华南|广东省|深圳市|鹏博士respectively indicate the country, region, province, city, operator
  • Index area The
    index area is composed of index blocks. Each index block occupies 12 bytes, including start IP, end IP, and data information. The first three bytes in the data information store the data address, and the last byte stores the data length. Each index block corresponds to a record in ip.merge.txt, which represents the index of an IP segment.
    In the search, when the specified IP is between the start IP and the end IP of an index block, it means that the index is hit. Then through the data address and data length in the index block, the corresponding location information data can be read from ip2region.db.

IP library generation

The generation process of ip2region.db is provided in the Github warehouse of ip2region, which is written in JAVA, and its class diagram is as follows:
[External link image transfer failed, the source site may have anti-leech link mechanism, it is recommended to save the image and upload it (img-Wi5gW3qH-1610985591446)(https://blog.haojunyu.com/imgs/ip2region_class.svg)]

Through familiarity with the source code for generating ip2region.db, briefly describe the generation process as follows:

  1. Reserve 8 bytes super block and 2048*8 bytes header index area in the file through RandomAccessFile
  2. Scan the ip.merge.txt file, and process each record as follows:
    According to the start IP, end IP and data of each record, an index block is generated, the first four bytes store the start IP, and the middle four bytes After storing the IP, the last four bytes store the calculated data address (written by RandomAccessFile, where a dictionary of location information to file location is maintained to ensure that the same location information is written only once.), and the index block is temporarily stored Exist in the indexPool linked list. This step will determine all the location information of the data area.
  3. After scanning all the records in ip.merge.txt, write all the index blocks in indexPool to the back of the data area. In this process, int(1024*8/12-1)=681 index blocks are formed into an index partition, and the starting IP and address information (header block) of the first index block of each index partition are recorded, and Temporarily stored in the headerPool linked list. In addition, the start and end positions of the index area will be recorded.
  4. Adjust RandomAccessFile to point to the beginning of the file. The starting position of the write index area is stored in the first four bytes of the super block, and the end position is stored in the last four bytes of the super block.
  5. Continue to write the header block in headerPool into the header area.
  6. Adjust RandomAccessFile to point to the end of the file, write the time stamp and copyright information.

TIPS: The
global_region.csv data is also used in the ip2region warehouse. This file has 5 columns (line number, region, zip code), corresponding to the specific information of the region, which can be filled in each location information in the data area.

fast Search

ip2region provides three query algorithms, the worst query time is ms level. They are memory binary search, b+tree search, and binary search. Time-consuming increases sequentially. The search structure diagram is as follows:
ip2region_search.png

Binary search

The start and end positions of the index area can be obtained through the super block, and each index block is 12 bytes, and the IP address in it is incremented, so binary search can be used to quickly obtain location information. The steps are as follows:

  1. Convert the IP value to an integer using the ip2long method
  2. Read the super block to get the start position and end position of the index area, and subtract +1 to get the total number of index blocks
  3. Use the dichotomy to solve the problem directly, compare the size of the starting IP, ending IP and the current IP in the index block, you can find the index block corresponding to the IP, and get the data address and data length according to the four bytes behind the index block. location information.

b+tree search

The header index area is used in the b+tree search. The first step is to use binary search in the header index area. After locating an index partition, use the binary search in the corresponding index partition. Compared with binary search, it is faster, because the number of disk reads is much lower than that of binary search. The steps are as follows:

  1. Convert IP value to integer through ip2long
  2. Use the dichotomy to search in the header index area, and compare the corresponding header index block and its corresponding index partition.
  3. Read the corresponding index partition, and then locate the corresponding index block through the dichotomy to obtain location information.

Memory-based binary search

This method is similar to the binary search method. The difference is that the former reads all ip2region.db into the memory, while the latter reads the ip2region.db file continuously.

to sum up

The ip2region library only solves a very common IP positioning problem, but the service is small and fast (of course, it also provides a multi-language client), and thus obtained an 8.4k star on Github.

Small memory usage

  1. The location information of neighboring IPs is the same, and the IP segment is used to solve the problem that neighboring IPs correspond to the same location information to avoid repeated storage of location information
  2. IP is converted to INT, like string 111.111.111.111 is converted to int (1869573999), reduced from 15Byte to 4Byte
  3. Different IP segments also have the same location information, and pointers are used to point to specific location information to ensure that the location information is only saved once (full scan is stored in the dictionary)

Fast search

  1. IP is ordered, using binary search to reduce the time complexity to O(logN)
  2. The use of the secondary index header index area reduces the frequency of disk reads and writes. First determine the index partition, then determine the index location from the index partition, and then determine the location information data.

Multilingual client support

Support java, C#, php, c, python, nodejs, php extension (php5 and php7), golang, rust, lua, lua_c, nginx.

references

  1. ip2region database file structure and principle
  2. ip2region source code
  3. Wikipedia for ipv4
  4. List of IPv4 address allocation in various countries
  5. Gaode map api
  6. Baidu map api

If this article has helped you, or if you are interested in technical articles, you can follow the WeChat public account: Technical Tea Party, you can receive related technical articles as soon as possible, thank you!
Technical tea party

This article is automatically published by ArtiPub , a multi- posting platform

Guess you like

Origin blog.csdn.net/haojunyu2012/article/details/112797753