[05] Web Scraper tutorial (with video presentation) Web Scraper reptiles crawling "58 city data."

Copyright notice: reproduced please indicate the source and marked "AI algorithms Wuhan study" https://blog.csdn.net/qq_36931982/article/details/91414500

"Web Scraper Web Crawler course"  is my browser plug-in to Google Web Scraper tool for the reptiles, the combination of theory and practical tutorials.

If you have reptiles demand, we welcome the public number to contact me, I can help free crawling data.

More about my study notes, welcome your interest in " Wuhan AI algorithms study ," No public, the public number of the browser better this series of tutorials visual effects !

58 city classified information network to provide real estate, recruitment and other mass classification information, data and information such classified information platform has a good authenticity, especially for big data analytics, in practice essay writing in many cases need to collect such classification data information platform.

In this paper, "58 city" platform data, for example, the use of Web Scraper crawling data

 

"demand"

1, crawling platform rental housing data , data including names, leasing, housing type, where the district, rent key information

2, crawling store final data to Excel

 

"demand analysis"

1, 1 requires crawling in demand platform rental data, found that the site after web analytics rental information branch on display the entire page, click on the page to load the next page, home page shows a single additional details .

 

"Web Scraper crawling operation."

The first step: page shows realization

As the 58 city pages of data, the entire data page display, to unfold all the data by clicking on "Next."

 

Step Two: Click to enter the display page for more information

List page lists the information required roughly the house, by clicking into the details page to view details.

 

The third step: a full implementation of the "next page" and "page to enter the details."

sitemap代码demo
{"_id":"wuba","startUrl":["https://wh.58.com/chuzu/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d3090a7-0009-ed19-d8a2-26ba87ad5704&ClickID=2"],"selectors":[{"id":"click","type":"SelectorElementClick","selector":"li.house-cell","parentSelectors":["_root"],"multiple":true,"delay":"1000","clickElementSelector":"a.next span","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"link","type":"SelectorLink","selector":"a.strongbox","parentSelectors":["click"],"multiple":false,"delay":0},{"id":"aaa","type":"SelectorText","selector":"h1.c_333","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"bbbb","type":"SelectorText","selector":"p.house-update-info em","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"ccc","type":"SelectorText","selector":"div.house-desc-item li:nth-of-type(1) span:nth-of-type(2)","parentSelectors":["link"],"multiple":false,"regex":"","delay":0}]}
 

为什么Click element uniqueness使用Unique Text?

因为我们爬取58同城过程中,点击下一页,全程这个“下一页”都没有发生变化,所以可以用Unique Text,同时也可以用 其它选项。

为什么中间爬虫暂停后出现很多空格含有null?

数据爬取过程中随着爬取的进行,逐渐会填充这类空格,因为Web Scraper最大限度的模拟人为点击页面的操作,而人为点击很多时候我们是有随机性点击的并不是逐行进行点击,这可能也是为了规避网页后台的反爬虫机制。

 

 

Guess you like

Origin blog.csdn.net/qq_36931982/article/details/91414500