Talk about .NET web crawling and encoding conversion

In this article, you'll learn about two class libraries for HTML parsing. Plus, we'll discuss what we know about web scraping, transcoding, and compression processes, how to implement them in .NET, and finally optimize and improve them.

1. Background

With the blessing of Copilot, we can quickly complete the development tasks and complete the development of gadgets in a very short time. Who would have thought that today, the code comments written are for AI to read, even without writing comments, AI can guess your intentions. Now the code itself is worthless, only the product can reflect its value.

Because I usually read novels as entertainment, and I am used to using local plain text readers, which involves downloading novels. Some websites provide direct downloads with TXT, but some novel websites do not. Of course, I have also used similar tools like uncle-novel , and it is still very good to use, but I always feel that it is not very smooth.

2. Web scraping

In .NET, the HtmlAgilityPack library is a frequently used HTML parsing tool, which provides powerful enough functional support for parsing DOM, and is often used for web crawling and analysis tasks.

var web = new HtmlWeb();
var doc = web.Load(url);

I also used this tool library in the small tool I wrote, and the small tool is easy to use. It was not until I grabbed a novel a few days ago that there were garbled characters, and I realized that the web pages I grabbed before were all encoded UTF-8. , this time it is GBKtrue.

Although the function HtmlAgilityPackis provided AutoDetectEncodingand it is enabled by default, it seems that the actual effect does not work. The same goes for enabling after using HttpClientget .htmlStreamHtmlDocumentOptionReadEncoding

3. Code conversion

That being the case, let's HttpClientgrab it directly, although the analysis still cannot escape HtmlAgilityPack. For GBKthe support of , the package needs to be introduced here System.Text.Encoding.CodePages.

For the crawled webpage content, we first read the bytes and then UTF-8read it in the encoding, then parse out the actual character encoding of the webpage through regularization, and convert it as needed.

var client = new HttpClient();
var response = await client.GetAsync(url);
var bytes = await response.Content.ReadAsByteArrayAsync();
var htmldoc = Encoding.UTF8.GetString(bytes);
var match = Regex.Match(htmldoc, "<meta.*?charset=\"?(?<charset>.*?)\".*?>", RegexOptions.IgnoreCase);

4. Web page compression processing

When using HttpClientto grab a web page, it is best to add a request header to disguise it. Copilot is also really easy, and you don’t need to search for the browser UA as soon as you write the comment "Set request header" and press Enter. Speaking of search, in addition to being tormented by search engine advertisements, searchers may also be diverted by some attractive hot searches, and then there will be no more...

However, this time the Enter may have been knocked too much, knocking me into the pit. Originally, I just wanted to add a UA, but I thought the prompts were quite useful, so I added a bunch of them in the end:

// 设置请求头
client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36");
client.DefaultRequestHeaders.Add("Accept", "*/*");
client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, br");
client.DefaultRequestHeaders.Add("Accept-Language", "zh-CN,zh;q=0.9");
client.DefaultRequestHeaders.Add("Connection", "keep-alive");

Then I tested it and found that my code couldn’t run anymore. It’s annoying. Maybe it’s because the website has some advanced firewall:

insert image description here

After debugging for a long time, I just remembered, is it because of the compressed request header?

insert image description here

Commented out and tested again, sure enough it was it. Hey, I was thinking hello, me, hello, everyone, and adding compression, this catches faster and saves traffic on the other side.

However, it is impossible to comment out the comments. If you encounter problems, you can solve the problems and just ask GPT directly. There are a lot of complex solutions, and the decompression method will not be mentioned here. When I told GPT I use the latest .NET development, you give me some elegance, and it sure is elegant:

var handler = new HttpClientHandler  
{
    
      
    AutomaticDecompression = System.Net.DecompressionMethods.GZip | System.Net.DecompressionMethods.Deflate  | System.Net.DecompressionMethods.Brotli
};  
var httpClient = new HttpClient(handler);

After all, HttpClientit supports automatic processing of compression. You can use HttpClientHandlerto enable the automatic decompression function, which is indeed much more convenient than looking for official documents .

5. Code optimization

Through the previous adjustments, we have basically written the core code. Of course, there is still a lot of room for optimization, here we can directly ask GPT4 to help:

/// <summary>
/// 下载网页内容,并将其他编码转换为 UTF-8 编码
/// </summary>
static async Task<string> GetWebHtml(string url){
    
    
    // 使用 HttpClient 下载网页内容

    var handler = new HttpClientHandler();
    // 忽略证书错误
    handler.ServerCertificateCustomValidationCallback = (message, cert, chain, errors) => true;
    handler.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate | DecompressionMethods.Brotli;
    var client = new HttpClient(handler);
    // 设置请求头
    client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36");
    client.DefaultRequestHeaders.Add("Accept", "*/*");
    // 加上后不处理解压缩会乱码
    client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, br");
    client.DefaultRequestHeaders.Add("Accept-Language", "zh-CN,zh;q=0.9");
    client.DefaultRequestHeaders.Add("Connection", "keep-alive");
    var response = await client.GetAsync(url);
    var bytes = await response.Content.ReadAsByteArrayAsync();

    // 获取网页编码 ContentType 可能为空,从网页获取
    var charset = response.Content.Headers.ContentType?.CharSet;
    if (string.IsNullOrEmpty(charset))
    {
    
    
        // 从网页获取编码信息
        var htmldoc = Encoding.UTF8.GetString(bytes);
        var match = Regex.Match(htmldoc, "<meta.*?charset=\"?(?<charset>.*?)\".*?>", RegexOptions.IgnoreCase);
        if (match.Success) charset = match.Groups["charset"].Value;
        else charset = "utf-8";
    }

    Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
    Encoding encoding;

    switch (charset.ToLower())
    {
    
    
        case "gbk":
            encoding = Encoding.GetEncoding("GBK");
            break;
        case "gb2312":
            encoding = Encoding.GetEncoding("GB2312");
            break;
        case "iso-8859-1":
            encoding = Encoding.GetEncoding("ISO-8859-1");
            break;
        case "ascii":
            encoding = Encoding.ASCII;
            break;
        case "unicode":
            encoding = Encoding.Unicode;
            break;
        case "utf-32":
            encoding = Encoding.UTF32;
            break;
        default:
            return Encoding.UTF8.GetString(bytes);
    }

    // 统一转换为 UTF-8 编码
    var html = Encoding.UTF8.GetString(Encoding.Convert(encoding, Encoding.UTF8, bytes));
    return html;
}

5.1 Replace the Html parsing library

The cause of the matter is that HtmlAgilityPackthere is a problem with the automatic encoding and parsing of the library, so is there any other alternative library?

Of course, GPT4 recommends AngleSharp . I tested this library briefly, and it can directly identify the web page code without configuration. It seems to be easier to HtmlAgilityPackuse than . In addition, it also supports outputting Javascript, Linq syntax, ID and Class selectors, dynamically adding nodes, and supporting Xpath syntax.

In general, although the wheel was built this time, the programming knowledge has increased.

insert image description here

5.2 Optimization for wheels

Although there are the following points to be optimized, it is really not as convenient as changing the wheels directly, because the following problems will not exist after changing the wheels:

  1. For real usage, use a static HttpClientinstance instead of creating a new HttpClientinstance for each request. This avoids unnecessary waste of resources. You can move it and its configuration into a separate helper class like: HttpClientHelper, and access it when needed.

  2. Here we have written a separate function, which uses additional code registration Encoding.RegisterProvider(CodePagesEncodingProvider.Instance), in actual use, it should be executed when the program starts. This way, the encoding provider only needs to be registered once, at program startup, rather than every time a method is called.

  3. Some other wording optimizations, such as switch and method naming, etc.

6. Finally

This article is BookMakersome of my experience on web crawling when I was developing small tools. It mainly introduces two Html parsing libraries, which solve some problems of encoding conversion and compression. I hope it can be helpful to everyone.

Guess you like

Origin blog.csdn.net/marin1993/article/details/131499194