C # HtmlAgilityPack + Selenium crawling need to pull the contents of the page scroll bar

Now most sites are loaded with the scroll bar to slide the page content, so simple to get Html static pages that can not get all of the page content. Selenium can be simulated using the browser to pull the slider to load all page content.

Recap

About Selenium

Selenium is a WEB automated testing tools. Selenium tests run directly in the browser, just as real users in the same operation. Supported browsers include IE (7, 8, 9, 10, 11), Mozilla Firefox, Safari, Google Chrome, Opera and so on. Key features include: testing and browser compatibility - Test your application to see if the well had to work on different browsers and operating systems. Test system functions - create regression testing test software functionality and user requirements. Support automatically record test scripts and actions in different languages ​​automatically generate .Net, Java, Perl and so on. Selenium is an equally use Apache License 2.0 protocol published open-source framework.

C # Selenium installation

This article is only achieved using Selenium pull the scroll bar functionality, so do not be too much introduction Selenium.
By Nuget Package Manager search "Selenium", were installed:

  • Selenium.WebDriver
  • Selenium.Chrome.WebDriver

Examples (get a website home page photos)

General access page Html

ChromeDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl(url);
string title = driver.Title;//页面title
string html = driver.PageSource;//页面Html

Chrome does not start the console window and get close Chrome page

Chrome will automatically open the output console window and some information about the program executes, we do not need these things.

//不启动chrome窗口
ChromeOptions options = new ChromeOptions();
options.AddArgument("headless");

//关闭ChromeDriver控制台
ChromeDriverService driverService = ChromeDriverService.CreateDefaultService();
driverService.HideCommandPromptWindow = true;

ChromeDriver driver = new ChromeDriver(driverService, options);
driver.Navigate().GoToUrl(url);

The page scroll in the end section

If you are using scrollTo(0, document.body.scrollHeight), directly to the Ministry of the page scroll in the end will lead to the middle part of the page read fails, we need to slide several times and give enough time to load the page

for (int i = 1; i <= 10; i++)
{
    string jsCode = "window.scrollTo({top: document.body.scrollHeight / 10 * " + i + ", behavior: \"smooth\"});";
    //使用IJavaScriptExecutor接口运行js代码
    IJavaScriptExecutor js = (IJavaScriptExecutor)driver;
    js.ExecuteScript(jsCode);
    //暂停滚动
    Thread.Sleep(1000);
}

Use HtmlAgilityPack resolve to read the Html

Following the last article basically the same

string title = driver.Title;//页面title
string html = driver.PageSource;//页面Html

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);//解析Html字符串
string imgPath = "//img";//选择img
//获取img标签中的图片
foreach (HtmlNode node in doc.DocumentNode.SelectNodes(imgPath))
{
    ······
}

The complete code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Net;
using System.IO;
using HtmlAgilityPack;
using System.Text.RegularExpressions;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using System.Threading;

namespace WebCrawlerDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient wc = new WebClient();

            int imgNum = 0;//图片编号
            string url = "https://www.bilibili.com";


            string html = FinalHtml.GetFinalHtml(url, 10);

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);

            string imgPath = "//img";//选择img

            //HtmlNode nodes = hd.DocumentNode.SelectSingleNode(path);

            //获取img标签中的图片
            foreach (HtmlNode node in doc.DocumentNode.SelectNodes(imgPath))
            {
                if (node.Attributes["src"] != null)
                {
                    string imgUrl = node.Attributes["src"].Value.ToString();
                    if (imgUrl != "" && imgUrl != " ")
                    {
                        imgNum++;

                        //生成文件名,自动获取后缀
                        string fileName = GetImgName(imgUrl, imgNum);

                        //Console.WriteLine(fileName);
                        //Console.WriteLine(imgUrl);
                        ImgDownloader.DownloadImg(wc, imgUrl, "images/", fileName);
                    }
                }
            }
            //获取背景图
            string bgImgPath = "//*[@style]";//选择具有style属性的节点
            foreach (HtmlNode node in doc.DocumentNode.SelectNodes(bgImgPath))
            {
                if (node.Attributes["style"].Value.Contains("background-image:url"))
                {
                    imgNum++;
                    string bgImgUrl = node.Attributes["style"].Value;
                    bgImgUrl = Regex.Match(bgImgUrl, @"(?<=\().+?(?=\))").Value;//读取url()的内容
                    //Console.WriteLine(bgImgUrl);
                    //生成文件名,自动获取后缀
                    string fileName = GetImgName(bgImgUrl, imgNum);

                    ImgDownloader.DownloadImg(wc, bgImgUrl, "images/bgcImg/", fileName);
                }
            }
            Console.WriteLine("----------END----------");
            Console.WriteLine($"一共获得: {imgNum}张图");
            Console.ReadKey();
        }
    }
    /// <summary>
    /// 图片下载器
    /// </summary>
    public class ImgDownloader
    {
        /// <summary>
        /// 下载图片
        /// </summary>
        /// <param name="webClient"></param>
        /// <param name="url">图片url</param>
        /// <param name="folderPath">文件夹路径</param>
        /// <param name="fileName">图片名</param>
        public static void DownloadImg(WebClient webClient, string url, string folderPath, string fileName)
        {
            //如果文件夹不存在,则创建一个
            if (!Directory.Exists(folderPath))
            {
                Directory.CreateDirectory(folderPath);
            }
            //判断路径是否完整,补全不完整的路径
            if (url.IndexOf("https:") == -1 && url.IndexOf("http:") == -1)
            {
                url = "https:" + url;
            }
            //下载图片
            try
            {
                webClient.DownloadFile(url, folderPath + fileName);
                Console.WriteLine(fileName + "下载成功");
            }
            catch (Exception ex)
            {
                Console.Write(ex.Message);
                Console.WriteLine(url);
            }
        }
        /// <summary>
        /// 生成图片名称
        /// </summary>
        /// <param name="imageUrl">图片地址</param>
        /// <param name="imageNum">图片编号</param>
        /// <returns></returns>
        public static string GetImgName(string imageUrl, int imageNum)
        {
            string imgExtension;
            if (imageUrl.LastIndexOf(".") != -1)
            {
                imgExtension = imageUrl.Substring(imageUrl.LastIndexOf("."));
            }
            else
            {
                imgExtension = ".jpg";
            }
            return imageNum + imgExtension;
        }
    }
    /// <summary>
    /// 获得执行过js的网址
    /// </summary>
    public class FinalHtml
    {
        /// <summary>
        /// 获得拉动滚动条后的页面
        /// </summary>
        /// <param name="url">网址</param>
        /// <param name="sectionNum">滚动几次</param>
        /// <returns>html字符串</returns>
        public static string GetFinalHtml(string url, int sectionNum)
        {
            //不启动chrome窗口
            ChromeOptions options = new ChromeOptions();
            options.AddArgument("headless");

            //关闭ChromeDriver控制台
            ChromeDriverService driverService = ChromeDriverService.CreateDefaultService();
            driverService.HideCommandPromptWindow = true;


            ChromeDriver driver = new ChromeDriver(driverService, options);

            driver.Navigate().GoToUrl(url);

            string title = driver.Title;
            Console.WriteLine($"Title: {title}");
            //将页面滚动到底部
            Console.Write("页面滚动中,请稍后");

            for (int i = 1; i <= sectionNum; i++)
            {
                string jsCode = "window.scrollTo({top: document.body.scrollHeight / " + sectionNum + " * " + i + ", behavior: \"smooth\"});";
                IJavaScriptExecutor js = (IJavaScriptExecutor)driver;
                js.ExecuteScript(jsCode);
                Console.Write(".");
                Thread.Sleep(1000);
            }
            Console.WriteLine();

            string html = driver.PageSource;
            driver.Quit();

            return html;
        }
    }
}

Reference article

Guess you like

Origin www.cnblogs.com/xueyubao/p/11465348.html