C # capture Google's browser to call the anti-crawler

In a previous blog reptile, we describe a strategy to deal with IP access restrictions, that is crawling proxy IP and changing the way agents. But some sites only when you visit have been restricted, and in return has done a clever web processing, such as call js dynamic loading request when the page content. This situation is not simply issue a request can get crawling, and this time you may need to call Google browser for crawling. # Invoke this article, we introduced Google browser via C to achieve dynamic information crawling.

The hardship suffered by ordinary reptile

If we are to the details of a blog page crawling, such as title, text, author, publication time, the number of reading and so on. Our first idea is to send a get url of the page request, the page returned message parsing the information we need by xpath. Below a blog listing page, for example: C # capture anti-crawling reptiles of the proxy IP

    class Program
    {
        static void Main(string[] args)
        {
            BasicalMothed("https://blog.csdn.net/Leaderxin/article/details/102764234"); //调用原始方法
            Console.Read();
        }
        /// <summary>
        /// 这个我们普通的爬虫思路,通过请求url返回报文爬取
        /// </summary>
        static void BasicalMothed(string url)
        {
            HttpWebRequest req = WebRequest.Create(url) as HttpWebRequest;
            if (req == null)
                return;
            HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
            Encoding bin = Encoding.GetEncoding("UTF-8");
            using (StreamReader sr = new StreamReader(resp.GetResponseStream(), bin))
            {
                string str = sr.ReadToEnd();
                Console.WriteLine(str.ToString());
                //在这里通过xpath对str的内容进行解析并存储
                return ;
            }
        }
    }

We look at the print out the contents of the message:
C # Reptilia
returns the content is garbled, obviously a platform to do a clever deal directly request the url can not get what we want packets. There are some scenes, such as some pages there will be more similar to [such] button, click on the page will request and render more, this is not a direct request by the web crawling.

Call the browser through chromedriver

Let's control the browser to access the page we want to achieve crawling and crawling through C # calling chromedriver. First install Selenium library by nuget:

Note If you choose to install Selenium is the latest version of the Google browser will need to update to the latest, otherwise the call will be abnormal time

Selenium
Then we have to crawl through the structure of the page F12 look, take a look at the information we have to crawl in which the label, the results are as follows:
Title in class = 'title-article' in the h1, Published in class = 'time' of the span, bloggers class = 'follow-nickName' of a tag, the number of the read class 'read-count' = the span in the text id = 'article_content' div with.
C # Reptilia
Then coding to call Google browser to access this page and take us to specified elements:

        /// <summary>
        /// 通过Selenium调用谷歌浏览器来爬取
        /// </summary>
        /// <param name="url"></param>
        static void SeleniumMothed(string url)
        {
            //启动谷歌浏览器
            IWebDriver selenium = new ChromeDriver();
            //浏览器跳转到我们要爬取的url
            selenium.Navigate().GoToUrl(url);
            //确保页面内容已加载完成
            while (string.IsNullOrEmpty(selenium.Title))
            {
                Task.Delay(100).GetAwaiter().GetResult();
            }
            
            
            //取到标题信息,通过css选择器
            var title = selenium.FindElement(By.CssSelector("h1.title-article")).Text; 
            //发布时间
            var time = selenium.FindElement(By.CssSelector("span.time")).Text;
            //博主名
            var name = selenium.FindElement(By.CssSelector("a.follow-nickName")).Text;
            //阅读数
            var nums = selenium.FindElement(By.CssSelector("span.read-count")).Text;
            //正文,由于id固定,我们直接用id选择器获取
            var content = selenium.FindElement(By.Id("article_content")).Text;
            Console.WriteLine("标题:"+title);
            Console.WriteLine("发布时间:"+time);
            Console.WriteLine("博主名:"+name);
            Console.WriteLine("阅读数:"+nums);
            Console.WriteLine("正文:"+content);
        }

Look at the console print the results:
C# Selenium
you can see, we have been successfully crawled to the information we want by calling Chrome. Selenium can do things far more than that, we are interested can study at their own. Behind will continue to talk about some other uses of Selenium.
Previous: C # capture anti-crawling reptiles of the proxy IP

Published 12 original articles · won praise 27 · views 20000 +

Guess you like

Origin blog.csdn.net/Leaderxin/article/details/102923172