Fat Tiger once asked me: "How to perform fuzzy query on encrypted data?"

We know that encrypted data is not very friendly to fuzzy queries. This article will talk about the implementation of fuzzy queries on encrypted data, hoping to inspire everyone. For data security, we often encrypt and store important data during the development process. Common ones include: passwords, mobile phone numbers, phone numbers, detailed addresses, bank card numbers, credit card verification codes, and other information. These information have different requirements for encryption and decryption. For example, we need to encrypt and store passwords. Generally, irreversible slow hash algorithms are used. Slow hash algorithms can avoid brute force cracking (typically using time for security). We do not need decryption or fuzzy search when searching, and we directly use the ciphertext to match exactly, but this cannot be done for the mobile phone number, because we need to view the original information of the mobile phone number, and we need to support fuzzy search for the mobile phone number, so today we will support fuzzy query for reversibly encrypted and decrypted data to see what implementation methods there are.

I sorted out the fuzzy query of encrypted data into three categories, as follows:

  • Sand sculpture method (don't think about the straight man's thinking, just realize the function and never think deeply about the problem)
  • Conventional practice (think about query performance issues, and use some storage space for performance, etc.)
  • Super God approach (more high-end approach thinks from the algorithm level)

Let's talk about the implementation ideas and advantages and disadvantages of these three implementation methods one by one. First, let's look at the sand sculpture method.

1. Sand sculpture method

  • Load all the data into the memory for decryption, and after decryption, use the program algorithm for fuzzy matching
  • Map the ciphertext data to a plaintext mapping table, commonly known as the tag table, and then fuzzily query the tag to associate the ciphertext data

sand sculpture one

Let’s take a look at the first approach, which is to load all the data into the memory for decryption. If the amount of data is small, this method can be used. This is simple and affordable. If the amount of data is large, it will be a disaster. Let’s make a rough calculation. An English letter (case-insensitive) occupies one byte, and a Chinese character occupies two bytes. Using DES as an example, the 13800138000encrypted string HE9T75xNx6c5yLmS5l4r6Q==occupies 24 bytes.

number of lines Bytes MB
100w 24 million 22.89
1000w 240 million 228.89
100000000 2.4 billion 2288.89

It can range from hundreds of megabytes to gigabytes, so that the application can be converted into out of memory in minutes. If the data is only a few hundred, thousands, or tens of thousands, it is completely possible to do so, but it is strongly not recommended if the amount of data is large.

Sand Sculpture II

Let's look at the second method, map the ciphertext data to a plaintext mapping table, and then fuzzy query the mapping table to associate the ciphertext data, what? ? ? ! ! ! Then why do we encrypt the data, wouldn't it be better not to encrypt it directly! Since we must have security requirements for data encryption, we will do this. Adding a plaintext mapping table violates the security requirements. This is neither safe nor convenient.

2. Conventional practice

Let's take a look at the conventional approach, which is also the most widely used method. This type of method satisfies data security and is query-friendly.

  • Implement the encryption algorithm function in the database and use it when fuzzy querydecode(key) like '%partial%
  • Carry out word segmentation combination on the ciphertext data, encrypt the result sets of the word segmentation combination respectively, and then store them in the extended column. When querying, passkey like '%partial%'

routine one

Implement the encryption and decryption algorithm consistent with the program in the database, modify the fuzzy query conditions, and use the database encryption and decryption function to decrypt first and then fuzzy search. The advantage of this is that the implementation cost is low, and the development and use costs are low. It can be realized only by slightly modifying the previous fuzzy search, but the disadvantage is also obvious. This method cannot use the database index to optimize the query. Even some databases may not be able to guarantee the same encryption and decryption algorithm as the program, but it can be guaranteed to be consistent with the application program for conventional encryption and decryption algorithms. If the requirements for query performance are not particularly high and the requirements for data security are average, it is also a good choice to use common encryption and decryption algorithms such as AES and DES. If the company has its own algorithm implementation and does not provide a multi-terminal algorithm implementation, either find someone with a good algorithm to study and complete the multi-terminal implementation, or give up using this method.

routine two

Carry out word segmentation combination for ciphertext data, encrypt the result set of word segmentation combination separately, and then store it in the extension column, and use key like '%partial%' when querying. This is a relatively cost-effective implementation method. Let's analyze its implementation idea first. First group characters with a fixed length, and split a field into multiples. For example, 4 English characters (half-width) and 2 Chinese characters (full-width) are used as a search condition. For example:

ningyu1Use the encryption method with 4 characters as a group, the first group is ning, the second group is ingy, the third group is ngyu, the fourth group is gyu1...and so on. If you need to retrieve all the data containing 4 characters of the search condition, such as: ingy, after encrypting the characters,  key like “%partial%” check the database.

We all know that the length will increase after encryption, and the increased length storage is the extra cost we have to spend. The typical use cost is exchanged for speed. The growth rate of ciphertext varies with different algorithms. Take DES as an example, it occupies 11 bytes before encryption, and the encrypted string occupies 24 bytes, and the increase is 2.18 13800138000times HE9T75xNx6c5yLmS5l4r6Q==.

Back to the topic, although this method can realize fuzzy query of encrypted data, there is a requirement for the character length of fuzzy query. In the example I mentioned above, the length of the original text of fuzzy query characters must be greater than or equal to 4 English/numbers, or 2 Chinese characters. Shorter lengths are not recommended, because word segmentation combinations will increase, resulting in increased storage costs and reduced security. Have you ever connected to the APIs of Taobao, Pinduoduo, and JD? They encrypt the user-sensitive data in the order data of the platform and support fuzzy query at the same time. This is the method used. Below I have compiled the descriptions of the ciphertext field retrieval schemes of several e-commerce platforms. If you are interested, you can check the link below.

  • Taobao ciphertext field retrieval scheme: https://open.taobao.com/docV3.htm?docId=106213&docType=1
  • Alibaba text field retrieval scheme: https://jaq-doc.alibaba.com/docs/doc.htm?treeId=1&articleId=106213&docType=1
  • Pinduoduo ciphertext field retrieval scheme: https://open.pinduoduo.com/application/document/browse?idStr=3407B605226E77F2
  • Jingdong ciphertext field retrieval scheme: https://jos.jd.com/commondoc?listId=345

The advantage of this method is that it is not complicated to implement, and it is relatively simple to use. It is a compromise method, because the storage cost of extended fields will increase, but the query speed can be optimized by using the database index. This method is recommended.

3. Supernatural approach

Let’s take a look at the excellent practices. These methods are more difficult, and they are all considered from the algorithm level. Some even design a new algorithm. Although there are some ready-made algorithm references, most of them are semi-finished products that cannot be used directly. Therefore, someone still needs to conduct in-depth research and integrate them into their own applications.

Thinking from the algorithm level, even a new algorithm will be designed to support fuzzy search. This level is mostly the research field of professional algorithm engineers. It is not a simple matter to design an algorithm that is orderly, non-irreversible, and the length of the ciphertext cannot grow too fast. The general idea is as follows: Use the decoding method to encrypt and decrypt, and keep the ciphertext in the same order as the original text, so as to support ciphertext fuzzy matching. I am not an expert in this field and have not done further research, so I found some information from the Internet for reference.

  • Fuzzy matching encryption method for character data in the database: https://www.jiamisoft.com/blog/6542-zifushujumohupipeijiamifangfa.html

The Hill cipher processing and fuzzy matching encryption method FMES mentioned here can be focused on.

  • How to encrypt the database that supports fast query: https://www.jiamisoft.com/blog/5961-kuaisuchaxunshujukujiami.html
  • Lucene-based cloud search and fuzzy query based on ciphertext: https://www.cnblogs.com/arthurqin/p/6307153.html

The idea based on Lucene is similar to the conventional method 2 we introduced above. Characters are segmented into words of equal length, and the result set after word segmentation is encrypted and stored, but the stored db is different, one is a relational database, and the other is an es search engine. We have finished introducing all the retrieval schemes for encrypted data here. We first mentioned the sand sculpture methods that can be found everywhere on the Internet. We also said here that these sand sculpture methods are not recommended and should be used as much as possible. If the company has professional algorithm talents, it may wish to consider supernatural methods based on the algorithm level. Generally speaking, the second routine method is highly recommended in terms of input, output ratio, and implementation and use costs.

Guess you like

Origin blog.csdn.net/qq_28165595/article/details/131586944