[Original] Teach you to write web crawler (8): garbled characters

 

Teach you how to write web crawler (8)

Author: Takumi

Summary: Writing a Crawler from Scratch, A Crash Guide for Beginners!

Cover:

 

Character encoding and decoding is a must-learn knowledge in crawlers. In our crawling career, we will crawl into garbled web pages sooner or later. Instead of panicking when encountering them, it is better to learn early and avoid garbled codes completely.

 

Introduction to Character Encodings

what is a character set

Before introducing character encoding, let's first understand what a character set is.

Character is a general term for various characters and symbols, including national characters, punctuation marks, graphic symbols, numbers, etc. A character set is a collection of multiple characters. There are many types of character sets, and each character set contains a different number of characters. Common character sets: ASCII character set, GBK character set, Unicode character set, etc.

what is character encoding

Character encodings and character sets are different. A character set is just a collection of characters, which cannot be transmitted and processed over the network, and can only be used after being encoded. For example, the Unicode character set can be encoded in UTF-8, UTF-16, UTF-32, etc. according to different needs.

Character encoding is the use of binary numbers to correspond to the characters of the character set. When various countries and regions formulate coding standards, "set of characters" and "coding" are generally formulated at the same time. Therefore, the "character set" we usually call, in addition to the meaning of "the set of characters", also includes the meaning of "encoding".

Common character sets

Briefly introduce a few common ones.

ASCII:

ASCII is an enlightening character set for computer students, generally learned from this book:

 

Please allow me to be nostalgic. The following quote from Mr. Tan Haoqiang’s explanation:

 

 

Chinese character set:

GB2312: Contains 6763 Chinese characters.

GBK: Contains 21003 Chinese characters. GBK is compatible with GB2312, which means that Chinese characters encoded with GB2312 can be decoded with GBK.

GB18030: 70,000 Chinese characters are included, so many are because they contain minority characters. Also compatible with GBK and GB2312

Unicode: Unicode was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique binary encoding for each character in each language to meet the needs of cross-language and cross-platform text conversion and processing. Require. Has a variety of encoding methods, such as UTF-7, UTF-8, UTF-16, UTF-32, etc.

 

Why is garbled

Simply put, the appearance of garbled characters is because: different character sets are used when encoding and decoding. Corresponding to real life, it is like a British person wrote bless (code) on paper to express blessings. And a Frenchman got this piece of paper. Since bless means injury in French, he thought that what he wanted to express was injury (decoding). Similarly, in a computer, a character encoded with UTF-8 is decoded with GBK. Since the font tables of the two character sets are different, and the positions of the same Chinese character in the two character tables are also different, garbled characters will eventually appear.

So, how is the garbled code in the crawler generated, and how to solve it?

 

gibberish in crawler

Suppose our crawler is developed in java, the network request library uses OkHttp, and the web page is stored in MongoDB. The process of generating garbled code is as follows:

  1. OkHttp requests the specified url and returns a GBK-encoded webpage byte stream;
  2. OkHttp decodes in default UTF-8 (which is messed up at this time), encodes it into Java's String type in UTF-16, and returns it to the handler. (Why is it encoded in UTF-16? Because the encoding of Java data in memory is UTF-16).
  3. The crawler gets the wrongly encoded String type web page, calls the MongoDB API, and encodes the data as UTF-8 and stores it in the database. So the last data seen in the database is messy.

 

 

Obviously, the root cause of the garbled code is that OkHttp used the wrong decoding method to decode at first. So to solve this problem, it is necessary to let OkHttp know the encoding type of the web page and decode it correctly.

 

 

There are two agreed-upon ways for a web page to tell the crawler what encoding it is using:

  1. The convention in the response header of the Http protocol:

Content-Type: text/html;charset=utf-8

  1. Conventions in meta tags in Html:

<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8“/>

After getting the encoding of the web page from the convention, Okhttp can decode it correctly. However, the actual situation is not optimistic. Many web pages do not follow the conventions and lack these two information. Someone uses Alexa to count the number of web pages that comply with this convention in various countries:

language

URL suffix

Number of URLs

HTTP header contains

number of URLs for charset

Chinese

.cn

10086

3776

English

.us/.uk

21565

13223

Russian

.ru

39453

28257

Japanese

.jp

20339

6833

Arabic

.iq

1904

1093

German

.from

35318

23225

Persian

.ir

7396

4018

Indian

.in

12236

4867

Total

all

148297

85292

The results show that we cannot passively rely on the web page to tell us, but actively detect its encoding type based on the content of the web page.

 

detect character encoding

What is character encoding auto-detection?

It refers to trying to determine an encoding method so that we can read the text content when faced with a string of byte streams that do not know the encoding information. It's like trying to crack the encoding when we don't have the decryption key.

Isn't that impossible?

Generally speaking, yes, not possible. However, some encodings are optimized for specific languages, and languages ​​do not exist randomly. There are some sequences of characters that are always present in a language, and others that are meaningless to that language. A person who is proficient in English opens a newspaper and finds sequences like "txzqJv 2!dasd0a QqdKjvz" and he immediately realizes that this is not English (even though it consists entirely of letters in English). By studying many "typical" texts, computer algorithms can simulate this human perception of language and make heuristic guesses about the language of a piece of text. In other words, detecting encoding information is detecting the type of language, supplemented by some additional information, such as which encodings are commonly used in each language.

Does such an algorithm exist?

It turns out, yes, it exists. All major browsers have automatic character encoding detection, because the Internet is always full of pages that lack encoding information. Mozilla Firefox includes a library that automatically detects character encodings, ported to Python, called chardet.

chardet use

Install:

pip install chardet

use:

>>> import urllib

>>> rawdata = urllib.urlopen('http://www.jd.com/').read()

>>> import chardet

>>> chardet.detect(rawdata)

{'confidence': 0.98999999999999999, 'language': '', 'encoding': 'utf-8'}

Note: There is confidence in the returned results, that is, confidence, which means that the detection results are not 100% accurate.

Friends who use other languages, don't worry, chardet has ported versions in many languages. But C++ doesn't seem to have a good choice, you can consider using IBM's ICU ( http://site.icu-project.org/ ).

 

Extended reading

《A composite approach to language/encoding detection》

(https://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html)

This paper explains the detection algorithms used behind Chardet, namely the "coding pattern method", the "character distribution method" and the "two-character sequence distribution method". Finally, the necessity of using the three methods in combination is explained, and an example is given to illustrate how to combine them.

《Charset Encoding Detection of HTML Documents A Practical Experience》

(https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/wiki/Charset-Encoding-Detection-of-HTML-Documents.pdf)

Take advantage of existing detection techniques and use some tricks to improve detection accuracy. The main principle is to use a combination of Mozilla CharDet and IBM ICU, and cleverly remove HTML tags before detection. Although this is a Paper issued by the University of Iran, it is said that this method has achieved good results in the production environment and is currently being applied to a large crawler with a data volume of 1 billion.

 

Next step

The topic of conversation has become heavier and heavier recently, and everyone must be tired. In the next issue, I plan to take everyone to relax and talk about easy topics. It has been half a year since the beginning of the series. There have been many updates in the technical field, and some useful tools have appeared. Which tools do we need to replace? Please listen to the next breakdown!

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324938129&siteId=291194637