c # about the garbled reasons crawl the page source code analysis and Chinese display solutions

Analysis: First of all, most websites in order to enhance the web browsing transfer rate of website content will be compressed before transmission, the most commonly used GZIP compression decompression decompression algorithms, also supports the most widely used one.

Because the transmission is used when the site GZIP compression and transmission, if we accept webrespones accept the data that was not displayed in the GZIP decompression, it will result in garbled, how do you know if the site is GZIP compression or other transmission of it?

I am here as an example, as shown by the browser 360

 

 You can see, Baidu transmission as gzip, deflate the way to the client data we know the reason below to solve the problem

 

2, the decompression GZIP

Follows, the role of this method is the url address input, returns a string string contents decompression.

// given a decoding method gzip-compressed page
Private getGzip static string (string U)
{
the StringBuilder the StringBuilder new new SB = (204800); // string 200K for splicing frequently, than the string with stringbuilder save memory and improve performance
WebClient wc = new WebClient (); // define a web data sending and receiving class public methods.
wc.Headers [HttpRequestHeader.AcceptEncoding] = "gzip, deflate"; // gzip type of data received
wc.Headers [HttpRequestHeader.AcceptLanguage] = "zh- CN, zh"; // designation request for Chinese language type head, 
byte [] buffer = wc.DownloadData (u ); // wc downloaddata the object () method of the resource into the local buffer in
GZipStream g = new GZipStream ((Stream ) (new MemoryStream (buffer)), CompressionMode. decompress); // define a compression or decompression stream of objects, to extract the
byte [] tmpbuffer = new byte [ 20480]; // given a temporary array of bytes and 20K
int len = g.Read (tmpbuffer, 0,20480 ); // 
the while (len> 0)
{
sb.Append (Encoding.Default.GetString (tmpbuffer, 0, len)); // converted into the appropriate format, such as using the default GBK is our default, if it is written in UTF-8 to UTF-8. This view can find the source code format by right.
g.Read = len (tmpbuffer, 0,20480); 
}
g.Close (); 
return sb.ToString ();

 

 

Guess you like

Origin www.cnblogs.com/arcticfish/p/11925642.html