Teach you how to crawl Youku movie information-2

In the previous chapter, we implemented the crawling of Youku single page. A brief review. Using the HtmlAgilityPack library, crawling of the crawler is divided into three steps.

  • Crawler steps
    • Load page
    • Analytical data
    • save data

Following the advancement of crawlers after the first document, this article is mainly an advancement of the previous one. The realized functions are mainly as follows:
1. Crawl the list of movie categories
2. Loop the movie information of each category
, and crawl the information of each category by page 3. Save the crawled data to the database

1. Crawl the list of movie categories

Movie category page.png

Use Chrome browser, F12, find the current location, get the Xpath of the current location. The data we need is the category code of the movie and the name of the movie category.

Rule analysis: The
XPATH path is "//*[@id='filterPanel']/div/ul/li/a") The
category code is the content
of the A tag Href path, and we intercept the category name as A tag InnerTest, We intercept it

Code example

     //加载web内容
         private static readonly string _url = "http://list.youku.com/category/video/c_0.html";

        /// <summary>
        ///     得到所有的类别
        /// </summary>
        public static List<VideoType> GetVideoTypes()
        {
            //加载web内容
            var web = new HtmlWeb();
            var doc = web.Load(_url);

            //内容解析-获得所有的类别
            var allTypes = doc.DocumentNode.SelectNodes("//*[@id='filterPanel']/div/ul/li/a").ToList();

            //类别列表中去掉【全部】这个选项
            var typeResults = allTypes.Where((u, i) => { return i > 0; }).ToList();

            var reList = new List<VideoType>();
            foreach (var node in typeResults)
            {
                var href = node.Attributes["href"].Value;
                reList.Add(new VideoType
                {
                    Code = href.Substring(href.LastIndexOf("/") + 1, href.LastIndexOf(".") - href.LastIndexOf("/") - 1),
                    Name = node.InnerText
                });
            }

            return reList;
        }

Second, crawl the total number of pages in each category

code is the movie category code.
Page rules $"http://list.youku.com/category/show/{code}.html"
crawl according to the page rules:

        /// <summary>
        ///     得到当前类别的总页数
        /// </summary>
        public static int GetPageCountByCode(string code)
        {
            var web = new HtmlWeb();
            var doc = web.Load($"http://list.youku.com/category/show/{code}.html");

            //分页列表
            var pageList = doc.DocumentNode.CssSelect(".yk-pages li").ToList();
            //得到倒数第二项
            var lastsecond = pageList[pageList.Count - 2];
            return Convert.ToInt32(lastsecond.InnerText);
        }

3. Get the content of each movie category according to the page number

According to the paging rules, it is found that the address after paging is
code and pageIndex is the first few pages.
Page rules: http://list.youku.com/category/show/{code} s_1_d_1_p {pageIndex}.html
Crawl according to page rules :

    /// <summary>
        ///     得到当前类别的内容
        /// </summary>
        public static List<VideoContent> GetContentsByCode(string code, int pageIndex)
        {
            var web = new HtmlWeb();
            var doc = web.Load($"http://list.youku.com/category/show/{code}_s_1_d_1_p_{pageIndex}.html");

            var returnLi = new List<VideoContent>();
            var contents = doc.DocumentNode.CssSelect(".yk-col4").ToList();

            foreach (var node in contents)
                returnLi.Add(new VideoContent
                {
                    PageIndex = pageIndex.ToString(),
                    Code = code,
                    Title = node.CssSelect(".info-list .title a").FirstOrDefault()?.InnerText,
                    Hits = node.CssSelect(".info-list li").LastOrDefault()?.InnerText,
                    Href = node.CssSelect(".info-list .title a").FirstOrDefault()?.Attributes["href"].Value,
                    ImgHref = node.CssSelect(".p-thumb img").FirstOrDefault()?.Attributes["Src"].Value
                });

            return returnLi;
        }

Fourth, the results of the test crawl


        /// <summary>
        ///     打印得到的内容
        /// </summary>
        public static void PrintContent()
        {
            var count = 0;
            foreach (var node in GetVideoTypes())
            {
                var resultLi = new List<VideoContent>();
                //得到当前类别总分页数
                var pageCount = GetPageCountByCode(node.Code);
                //遍历分页得到内容
                for (var i = 1; i <= pageCount; i++) resultLi.AddRange(GetContentsByCode(node.Code, i));
                Console.WriteLine($"编码{node.Code} \t 页数{pageCount} \t 总个数{resultLi.Count}");
                count += resultLi.Count;
            }

            Console.WriteLine($"总个数为{count}");
        }

Code download address:

https://github.com/happlyfox/FoxCrawler/tree/master/%E5%AD%A6%E4%B9%A0%E7%A4%BA%E4%BE%8B/YouKuCrawler/YouKuCrawlerAsync

Guess you like

Origin blog.csdn.net/HapplyFox/article/details/114112822