spark-sql: Custom UDF function to encrypt and decrypt sensitive fields

demand

Some user data contains information such as the user's mobile phone number, which is illegal if exposed directly. . . The data needs to be desensitized. If you simply replace the mobile phone number with the *** number, it means that the user’s mobile phone number data is lost, because the *** cannot be changed back to the mobile phone number. Therefore, you need to customize the UDF function to realize the encryption and decryption of sensitive data.

Two UDF functions are implemented here, one for encryption and one for decryption. Use Java's own crypto module to implement AES encryption.

In the code, the Seed of SecureRandom in the two UDF functions is written dead, so that the encrypted data is fixed, and the original data can be parsed by decrypting the UDF.

Customizing spark-sql UDF functions needs to implement UDF1, UDF2 and other methods. The following numbers indicate the number of parameters of UDF functions, but spark-sql can use Hive function library, so it is also OK to directly define Hive UDF functions. I implemented the UDF function of Hive

 

( Advanced Encryption Standard (English: Advanced Encryption Standard , abbreviation: AES ) is a block encryption standard. This standard is used to replace the original DES, which has been analyzed by many parties and is widely used all over the world.

So why the original DES will be replaced? The reason is that it uses a 56-bit key, which is easier to crack. AES can use 128, 192, and 256-bit keys, and use 128-bit blocks to encrypt and decrypt data, which is relatively safer. A perfect encryption algorithm cannot be cracked in theory unless an exhaustive method is used. It is unrealistic to use an exhaustive method to crack encrypted data with a key length of more than 128 bits, and there is only a theoretical possibility. Statistics show that even with the fastest computer in the world, it would take billions of years to exhaust 128-bit keys, let alone crack the AES algorithm with 256-bit key length.

At present, there are still organizations in the world studying how to break through the thick wall of AES, but because the cracking time is too long, AES is guaranteed, but the time spent is shrinking. With the increase in computer computing speed and the emergence of new algorithms, the attacks on AES will only become more and more violent and will not stop.

AES is now widely used in Finance, online transactions, wireless communication, digital storage and other fields, has withstood the most rigorous test, but maybe someday it will step back DES dust. )

achieve

Encrypted UDF implementation code:

package com.zixuan.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import sun.misc.BASE64Encoder;

import javax.crypto.Cipher;
import javax.crypto.KeyGenerator;
import javax.crypto.SecretKey;
import javax.crypto.spec.SecretKeySpec;
import java.security.SecureRandom;

public class EncryptionUDF extends UDF {
    Cipher cipher;
    //编码规则,可以任意指定,但是加密与解密的规则必须一致,否则无法解密
    String encodeRules = "default";
    {
        try {
            //1.构造密钥生成器,指定为AES算法,不区分大小写
            KeyGenerator keygen = KeyGenerator.getInstance("AES");
            //2.根据ecnodeRules规则初始化密钥生成器
            //生成一个128位的随机源,根据传入的字节数组
            SecureRandom secureRandom = SecureRandom.getInstance("SHA1PRNG") ;
            secureRandom.setSeed(encodeRules.getBytes());
            keygen.init(128, secureRandom);
            //3.产生原始对称密钥
            SecretKey original_key = keygen.generateKey();
            //4.获得原始对称密钥的字节数组
            byte[] raw = original_key.getEncoded();
            //5.根据字节数组生成AES密钥
            SecretKey key = new SecretKeySpec(raw, "AES");
            //6.根据指定算法AES自成密码器
            cipher = Cipher.getInstance("AES");
            //7.初始化密码器,第一个参数为加密(Encrypt_mode)或者解密解密(Decrypt_mode)操作,第二个参数为使用的KEY
            cipher.init(Cipher.ENCRYPT_MODE, key);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public String evaluate(String str){
        try {
            //将待加密字符串转换成字节数组
            byte [] byte_encode=str.getBytes("utf-8");
            //进行加密
            byte [] byte_AES=cipher.doFinal(byte_encode);
            return new String(new BASE64Encoder().encode(byte_AES));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
/**
    public static void main(String[] args) {
        EncryptionUDF encryptionUDF = new EncryptionUDF();
        System.out.println(encryptionUDF.evaluate("MCWABKVKBD"));
    }
 */
}

Decrypt UDF implementation code:

package com.zixuan.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import sun.misc.BASE64Decoder;
import javax.crypto.Cipher;
import javax.crypto.KeyGenerator;
import javax.crypto.SecretKey;
import javax.crypto.spec.SecretKeySpec;
import java.security.SecureRandom;

public class DecryptUDF extends UDF {
    Cipher cipher;
    //编码规则,可以任意指定,但是加密与解密的规则必须一致,否则无法解密
    String encodeRules = "default";
    {
        try {
            //1.构造密钥生成器,指定为AES算法,不区分大小写
            KeyGenerator keygen=KeyGenerator.getInstance("AES");
            //2.根据ecnodeRules规则初始化密钥生成器
            //生成一个128位的随机源,根据传入的字节数组
            SecureRandom secureRandom = SecureRandom.getInstance("SHA1PRNG") ;
            secureRandom.setSeed(encodeRules.getBytes());
            keygen.init(128, secureRandom);
            //3.产生原始对称密钥
            SecretKey original_key=keygen.generateKey();
            //4.获得原始对称密钥的字节数组
            byte [] raw=original_key.getEncoded();
            //5.根据字节数组生成AES密钥
            SecretKey key=new SecretKeySpec(raw, "AES");
            //6.根据指定算法AES自成密码器
            cipher=Cipher.getInstance("AES");
            //7.初始化密码器,第一个参数为加密(Encrypt_mode)或者解密(Decrypt_mode)操作,第二个参数为使用的KEY
            cipher.init(Cipher.DECRYPT_MODE, key);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public String evaluate(String str){
        try {
            //将待解密字符串转换成字节数组
            byte [] byte_content= new BASE64Decoder().decodeBuffer(str);
            //进行解密
            byte [] byte_decode=cipher.doFinal(byte_content);
            return  new String(byte_decode,"utf-8");
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
/**
    public static void main(String[] args) {
        DecryptUDF decryptUDF = new DecryptUDF();
        System.out.println(decryptUDF.evaluate("dZlVbUSx/TLCeddjpPatBw=="));
    }
 */
}

test

Create temporary method

Add jar package

add jar /root/udf-1.0-SNAPSHOT.jar;

Create temporary method

--加密
create temporary function encryption as 'com.zixuan.hive.udf.EncryptionUDF';

--解密
create temporary function decryptUDF as 'com.zixuan.hive.udf.DecryptUDF';

carry out testing

raw data:

select ChannelMemoID from mdm_channel_memo where ChannelMemoID='MCWABKVKBD';

MCWABKVKBD

After encryption:

select encryption(ChannelMemoID) from ods.mdm_channel_memo where ChannelMemoID='MCWABKVKBD';

dZlVbUSx/TLCeddjpPatBw==

Decrypt the encrypted data:

select decryptUDF('dZlVbUSx/TLCeddjpPatBw==');

MCWABKVKBD

Pits encountered

The original test was passed on the window, and when it was packaged into Linux for the actual test, the decryption always reported an error. After investigation, it was found that the result of encrypting a piece of data on the window was always the same, but the ciphertext obtained by encrypting on Linux was different each time. .

The original code is like this:


            KeyGenerator keygen=KeyGenerator.getInstance("AES");


            //2.根据ecnodeRules规则初始化密钥生成器
            //生成一个128位的随机源,根据传入的字节数组
            keygen.init(128, new SecureRandom(encodeRules.getBytes())); //本行为出错代码
           


            SecretKey original_key=keygen.generateKey();
            
            byte [] raw=original_key.getEncoded();
           
            SecretKey key=new SecretKeySpec(raw, "AES");
              
            Cipher cipher=Cipher.getInstance("AES");
              
            cipher.init(Cipher.ENCRYPT_MODE, key);
          
            byte [] byte_encode=content.getBytes("utf-8");
           
            byte [] byte_AES=cipher.doFinal(byte_encode);

Reason: The SecureRandom implementation completely follows the internal state of the operating system itself, unless the caller is calling the getInstance method and then calling the setSeed method; the key generated by this implementation on windows is the same each time, but it is different on solaris or some linux systems. For a detailed introduction to the SecureRandom class, see http://yangzb.iteye.com/blog/325264

In other words, the result of new SecureRandom(encodeRules.getBytes())) is the same every time on the window, but it is different on Linux. Modify the error line of code to the following code to solve the problem:

            SecureRandom secureRandom = SecureRandom.getInstance("SHA1PRNG") ;
            secureRandom.setSeed(encodeRules.getBytes());
            keygen.init(128, secureRandom);

 

Guess you like

Origin blog.csdn.net/x950913/article/details/107206226