Simple processing of ID card identification using Baidu PaddleOCR

I. Introduction

     This article only uses Baidu's PaddleOCR to identify the ID card. Since it is used directly without training on multiple data sets, the current effect is that the identification rate of ID cards for non-minorities can reach more than 85%. At the same time, the ID card is required The certificate picture is positive and relatively clear. Otherwise, the effect is not ideal. This article mainly introduces what PaddleOCR is, the overall installation process , the simple use of PaddleOCR, and the processing of ID card recognition results. After processing, the ID card can identify more than 95% of non-minority ethnic groups and more than 90% of other ID cards except for Korean ethnic minorities when the positive situation is guaranteed and relatively clear.

2. What is PaddleOCR?

See what the official says: PaddleOCR/README_ch.md at release/2.6 PaddlePaddle/PaddleOCR GitHub

2.1: Introduction

          PaddleOCR is an open source OCR project of Baidu's deep learning framework PaddlePaddle, which aims to create a rich, leading, and practical OCR tool library to help users train better models and implement applications. PaddleOCR contains rich text detection, text recognition and end-to-end algorithms.

 2.2: PaddleOCR features:

  • Ultra-lightweight Chinese OCR model, the total model is only 8.6M
  • Single model supports Chinese and English number combination recognition, vertical text recognition, long text recognition
  • Detection model DB (4.1M) + recognition model CRNN (4.5M)
  • Practical General Chinese OCR Model
  • Multiple predictive reasoning deployment solutions, including service deployment and device-side deployment
  • Various text detection training algorithms, EAST, DB, SAST
  • Various text recognition training algorithms, Rosetta, CRNN, STAR-Net, RARE, SRN
  • Can run on Linux, Windows, MacOS and other systems

Three: Installation process

3.1: Installation of paddlepaddle environment

      Official installation link: Start using_Flying Paddle - an open source deep learning platform derived from industrial practice

See the linked tutorial for details, including the CPU version of PaddlePaddle and The GPU version of PaddlePaddle needs attention

3.1.1: The difference between CPU version and GPU version

         The PaddlePaddle CPU version refers to the version that uses the CPU for inference and training. It can run without a GPU, which is very useful for users who don't have a GPU device or have high cost requirements.

        In the CPU version, the training and inference speed of the model is slower than the GPU version, but it can still solve some simple problems. In addition, during the development stage, the CPU version can also help users verify the function and effect of the model, so that they can conduct more in-depth training on the GPU later.

        Therefore, the PaddlePaddle CPU version can help users perform model development and evaluation without the need for a GPU, thereby reducing development costs.

3.1.2: cuda, cudnn installation can refer to

tensorflow [cpu/gpu, cuda, cudnn] the most detailed installation of the whole network, commonly used python image source, tensorflow deep learning reinforcement learning teaching

Teach you how to install Paddlepaddle-GPU in win10_AI Duckling Academy Blog-CSDN Blog


3.2: Install PaddleOCR whl package

Directly use the following command to download very fast:

pip install --index-url https://pypi.douban.com/simple paddleocr==2.6.1.3   

2.6.1.3 represents the version to be downloaded, do not write to download the latest version

If the download encounters some library download timeouts, you can use the following command to download separately

 pip install 库名 -i https://pypi.tuna.tsinghua.edu.cn/simple      

Four: The use of PaddleOCR

4.1: Basic usage

from paddleocr import PaddleOCR

# 识别身份证
def findIdcardResult():
    # 定义图片路径
    img_path = r'C:\Users\Jewel\Desktop\身份证\蒙文.png'
    

    #加载预训练的模型

    # Paddleocr目前支持的多语言语种可以通过修改lang参数进行切换
    # 例如`ch`, `en`, `fr`, `german`, `korean`, `japan`
    # 这里 use_angle_cls=False 为不使用自定义训练集
    ocr = PaddleOCR(use_angle_cls=False , lang="ch")

    # use_angle_cls=True使用训练模型,模型放在models目录下
    # ocr = PaddleOCR(use_angle_cls=True,lang="ch",
    #                 rec_model_dir='../models/ch_PP-OCRv3_rec_slim_infer/',
    #                 cls_model_dir='../models/ch_ppocr_mobile_v2.0_cls_slim_infer/',
    #                 det_model_dir='../models/ch_PP-OCRv3_det_slim_infer/')

    # 识别图片中的文字
    result = ocr.ocr(img_path, cls=True)
    print(result)
  

Version Note paddleocr uses the PP-OCRv3 model (--ocr_version PP-OCRv3) by default. If you need to use other versions, you can set the parameter --ocr_version. The specific version description is as follows:

version name Release Notes
PP-OCRv3 Support Chinese and English detection and recognition, direction classifier, support multilingual recognition
PP-OCRv2 Support Chinese and English detection and recognition, direction classifier, multi-language not yet updated
PP-OCR Support Chinese and English detection and recognition, direction classifier, support multilingual recognition

 paddleocr requires a total of three models, detection model (det model), direction classifier (cls model), recognition model (rec model)

If you need to add your own trained model, you can add model links and fields in paddleocr, and recompile.

4.2: Using the trained model

For example, download a library from the official website Flying Paddle AI Studio - Artificial Intelligence Learning and Training Community

  For example, [PP-OCRv3] model, after downloading it locally, decompress it separately, create a models folder, decompress the downloaded model file into the models folder, and put the models folder into the root directory of PaddleOCR, as follows Shown:

 # use_angle_cls=True使用训练模型,模型放在models目录下
 ocr = PaddleOCR(use_angle_cls=True,lang="ch",
                     rec_model_dir='../models/ch_PP-OCRv3_rec_slim_infer/',
                    cls_model_dir='../models/ch_ppocr_mobile_v2.0_cls_slim_infer/',
                     det_model_dir='../models/ch_PP-OCRv3_det_slim_infer/')

The ocr model has usually been trained on a data set with a sufficient amount of data. The general relationship between various models is:

  • Ordinary training model: In most cases, it refers to a model that has not been trained too much, and is suitable for training when you have a large amount of data yourself
  • Pre-training model: The model has been trained by the official, but it can be used for users to use new data for finetune, suitable for the case where your data volume is not too much
  • Reasoning model: The official model has been adjusted and can be directly used for the recognition task of OCR, which is not suitable for retraining

4.3: ID card recognition

from paddleocr import PaddleOCR
from common import IdCardStraight


# 识别身份证
def findIdcardResult(img_path):
    # 定义文件路径
    img_path = r'C:\Users\Jewel\Desktop\身份证\蒙文.png'
    # 初始化ocr模型和后处理模型
    ocr = PaddleOCR(use_angle_cls=False, lang="ch")
    # 获取模型检测结果
    result = ocr.ocr(img_path, cls=True)
    print(result)
    # 将检测到的文字放到一个列表中
    # txtArr = [line[1][0] for line in result[0]]
    txtArr = []
    for line in result[0]:
        txt = line[1][0]
        # 发现朝鲜文、彝文的身份证
        if (("姓" in txt and "性" in txt and "住" in txt) or ("名" in txt and "别" in txt and "生" in txt)) and line[1][1] < 0.75:
            continue
        else:
            txtArr.append(txt)

    print(txtArr)
    postprocessing = IdCardStraight(txtArr)
    # # 将结果送入到后处理模型中
    id_result = postprocessing.run()
    print(id_result)
    return id_result

Processing of data for identification processing 


import re
import json
import string


def verifyByIDCard(idcard):
    """
    验证身份证号码是否有效
    """
    sz = len(idcard)
    if sz != 18:
        return False

    weight = [7, 9, 10, 5, 8, 4, 2, 1, 6, 3, 7, 9, 10, 5, 8, 4, 2]
    validate = ['1', '0', 'X', '9', '8', '7', '6', '5', '4', '3', '2']
    sum = 0
    for i in range(len(weight)):
        sum += weight[i] * int(idcard[i])
    m = sum % 11
    return validate[m] == idcard[sz - 1]


class IdCardStraight:
    """
    身份证OCR返回结果,正常的身份证大概率识别没有什么问题,少数名族身份证,壮文、藏文、蒙文基本识别也没问题
    """
    nation_list = ["汉", "蒙古", "回", "藏", "维吾尔", "苗", "彝", "壮", "布依", "朝鲜", "满", "侗", "瑶", "白",
                   "土家", "哈尼", "哈萨克", "傣", "黎", "傈僳", "佤", "畲", "高山", "拉祜", "水", "东乡", "纳西",
                   "景颇", "柯尔克孜", "土", "达斡尔", "仫佬", "羌", "布朗", "撒拉", "毛难", "仡佬", "锡伯", "阿昌",
                   "普米", "塔吉克", "怒", "乌孜别克", "俄罗斯", "鄂温克", "崩龙", "保安", "裕固", "京", "塔塔尔",
                   "独龙", "鄂伦春", "赫哲", "门巴", "珞巴", "基诺"]

    def __init__(self, result):
        self.result = [
            i.replace(" ", "").translate(str.maketrans("", "", string.punctuation))
            for i in result
        ]
        print(self.result)
        self.out = {"result": {}}
        self.res = self.out["result"]
        self.res["name"] = ""
        self.res["idNumber"] = ""
        self.res["address"] = ""
        self.res["gender"] = ""
        self.res["nationality"] = ""

    def birth_no(self):
        """
        身份证号码
        """
        for i in range(len(self.result)):
            txt = self.result[i]

            # 身份证号码
            if "X" in txt or "x" in txt:
                res = re.findall("\d*[X|x]", txt)
            else:
                res = re.findall("\d{18}", txt)

            if len(res) > 0:
                # 验证身份证号码是否有效  因为像藏文会出现刷出的身份证有超过18位数字的情况
                if verifyByIDCard(res[0]):
                    self.res["idNumber"] = res[0]
                    self.res["gender"] = "男" if int(res[0][16]) % 2 else "女"
                    break

    def full_name(self):
        """
        身份证姓名
        """
        # 如果姓名后面有跟文字,则取名后面的字段,如果"名"不存在,那肯定也就没有"姓名",所以在没有"名"的情况下只要判断是否有"姓"就可以了
        # 名字限制是2位以上,所以至少这个集合得3位数,才进行"名"或"姓"的判断
        for i in range(len(self.result)):
            txt = self.result[i]
            if ("姓名" in txt or "名" in txt or "姓" in txt) and len(txt) > 3:
                resM = re.findall("名[\u4e00-\u9fa5]+", txt)
                resX = re.findall("姓[\u4e00-\u9fa5]+", txt)
                if len(resM) > 0:
                    name = resM[0].split("名")[-1]
                elif len(resX) > 0:
                    name = resX[0].split("姓")[-1]
                if len(name) > 1:
                    self.res["name"] = name
                    self.result[i] = "temp"  # 避免身份证姓名对地址造成干扰
                    return

        # 如果姓名或名后面没有跟文字,但是有名 或姓名这个字段出现过的,去后面的集合为名字
        # 如果取的一个几个只有一个字,则接着取后面的集合,一般最多取2个集合就够了
        # 由于像新疆文、彝文这种类型的身份证,识别处理的集合值可能是英文,要进行去除
        indexName = -1
        for i in range(len(self.result)):
            txt = self.result[i]
            if "姓名" in txt or "名" in txt:
                indexName = i
                break
        if indexName == -1:
            for i in range(len(self.result)):
                txt = self.result[i]
                if "姓" in txt:
                    indexName = i
                    break
        if indexName == -1:
            return
        resName = self.result[indexName + 1]
        if len(resName) < 2:
            resName = resName + self.result[indexName + 2]
            self.res["name"] = resName
            self.result[indexName + 2] = "temp"  # 避免身份证姓名对地址造成干扰
        else:
            self.res["name"] = resName
            self.result[indexName + 1] = "temp"  # 避免身份证姓名对地址造成干扰

    def sex(self):
        """
        性别女民族汉
        """
        for i in range(len(self.result)):
            txt = self.result[i]
            if "男" in txt:
                self.res["gender"] = "男"

            elif "女" in txt:
                self.res["gender"] = "女"

    def national(self):
        # 性别女民族汉
        # 先判断是否有"民族xx"或"族xx"或"民xx"这种类型的数据,有的话获取xx的数据,然后在56个名族的字典里判断是否包含某个民族,包含则取对应的民族
        for i in range(len(self.result)):
            txt = self.result[i]
            if ("民族" in txt or "族" in txt or "民" in txt) and len(txt) > 2:
                resZ = re.findall("族[\u4e00-\u9fa5]+", txt)
                resM = re.findall("民[\u4e00-\u9fa5]+", txt)
                if len(resZ) > 0:
                    nationOcr = resZ[0].split("族")[-1]
                elif len(resM) > 0:
                    nationOcr = resM[0].split("民")[-1]

                for nation in self.nation_list:
                    if nation in nationOcr:
                        self.res["nationality"] = nation
                        self.result[i] = "nation"  # 避免民族对特殊情况下名字造成干扰
                        return
        # 如果 "民族" 或 "族" 和对应的民族是分开的,则记录对应对应的位置,取后一位的字符,同样去字典里判断
        indexNational = -1
        for i in range(len(self.result)):
            txt = self.result[i]
            if "族" in txt:
                indexNational = i
                break
        # 如果没有"民族"或 "族" ,则去判断是否含有"民",有则记录对应的位置,取后一位的字符,同样去字典里判断
        if indexNational == -1:
            for i in range(len(self.result)):
                txt = self.result[i]
                if "民" in txt:
                    indexNational = i
                    break
        if indexNational == -1:
            return
        national = self.result[indexNational + 1]
        for nation in self.nation_list:
            if nation in national:
                self.res["nationality"] = nation
                self.result[indexNational + 1] = "nation"  # 避免民族对特殊情况下名字造成干扰
                break

    def address(self):
        """
        地址
        """
        addString = []
        for i in range(len(self.result)):
            txt = self.result[i]
            # 这步的操作是去除下”公民身份号码“里的号对地址的干扰
            txt = txt.replace("号码", "")
            if "公民" in txt:
                txt = "temp"
            # 身份证地址    盟,旗,苏木,嘎查  蒙语行政区划  ‘大学’有些大学集体户的地址会写某某大学

            if (
                    "住址" in txt
                    or "址" in txt
                    or "省" in txt
                    or "市" in txt
                    or "县" in txt
                    or "街" in txt
                    or "乡" in txt
                    or "村" in txt
                    or "镇" in txt
                    or "区" in txt
                    or "城" in txt
                    or "室" in txt
                    or "组" in txt
                    or "号" in txt
                    or "栋" in txt
                    or "巷" in txt
                    or "盟" in txt
                    or "旗" in txt
                    or "苏木" in txt
                    or "嘎查" in txt
                    or "大学" in txt
            ):
                # 默认地址至少是在集合的第2位以后才会出现,避免经过上面的名字识别判断未能识别出名字,
                # 且名字含有以上的这些关键字照成被误以为是地址,默认地址的第一行的文字长度要大于7,只有取到了第一行的地址,才会继续往下取地址
                if i < 2 or len(addString) < 1 and len(txt) < 7:
                    continue
                    # 如果字段中含有"住址"、"省"、"址"则认为是地址的第一行,同时通过"址"
                # 这个字分割字符串
                if "住址" in txt or "省" in txt or "址" in txt:
                    # 通过"址"这个字分割字符串,取集合中的倒数第一个元素
                    addString.insert(0, txt.split("址")[-1])
                else:
                    addString.append(txt)
                self.result[i] = "temp"

        if len(addString) > 0:
            self.res["address"] = "".join(addString)
        else:
            self.res["address"] = ""

    def predict_name(self):
        """
        如果PaddleOCR返回的不是姓名xx连着的,则需要去猜测这个姓名,此处需要改进
        """
        for i in range(len(self.result)):
            txt = self.result[i]
            if self.res["name"] == "":
                if 1 < len(txt) < 5:
                    if (
                            "性别" not in txt
                            and "姓名" not in txt
                            and "民族" not in txt
                            and "住址" not in txt
                            and "出生" not in txt
                            and "号码" not in txt
                            and "身份" not in txt
                            and "nation" not in txt
                    ):
                        result = re.findall("[\u4e00-\u9fa5]{2,4}", txt)
                        if len(result) > 0:
                            self.res["name"] = result[0]
                            break
        for i in range(len(self.result)):
            txt = self.result[i]

    def run(self):
        self.full_name()
        self.sex()
        self.national()
        self.birth_no()
        self.address()
        self.predict_name()
        return json.dumps(self.out, ensure_ascii=False)

Guess you like

Origin blog.csdn.net/u012693479/article/details/130830264