[04] tutorial Web Scraper Web Scraper understand selector plug

Copyright notice: reproduced please indicate the source and marked "AI algorithms Wuhan study" https://blog.csdn.net/qq_36931982/article/details/91414349

"Web Scraper Web Crawler course"  is my browser plug-in to Google Web Scraper tool for the reptiles, the combination of theory and practical tutorials.

If you have reptiles demand, we welcome the public number to contact me, I can help free crawling data.

More about my study notes, welcome your interest in " Wuhan AI algorithms study ," No public, the public number of the browser better this series of tutorials visual effects !

After the "Tutorial 03" Preliminary data on P2P sites to achieve a crawl, Web Scraper learned a very important concept is Selectors, crawling through pages of data Web Scraper, is a new different levels Selector.

 

「Selectors」

There are many Web Scraper comprising Selector, which correspond to different types Selector to select the type, mainly divided into three types 

Data class selectors

  1. Text selector
  2. Link selector
  3. Link popup selector
  4. Image selector
  5. Table selector
  6. Element attribute selector
  7. HTML selector
  8. Grouped selector

Connection class selectors

  1. Link selector
  2. Link popup selector

Element class selectors

  1. Element selector
  2. Element scroll down selector
  3. Element click selector

 

「Text selectors」

Text selector for selecting the text. Text selector extracts text data of the selected element (element) from the middle. Wherein html tags will be stripped only returns the text.

eg: If the page is crawling only news site an article. The title page contains articles, publication date and author. We Link selector will open each successive page, and then use the Text selector can extract the article title, date, author, and articles. Text selector parameter settings in Multiple selection will not be, because we extract the title, date, author and article information just take a record.

 

「Link selectors」

Connection selector, mainly for the jump page URL to access and, if Link selector has child selectors , will be based on the actual use Link selector URL-URLs automatically jump, otherwise there is no word is getting links address.

 

「Link popup selectors」

Connection pop-up selector and Link selector similar, except that the Link popup selectors are used when clicking on a link to bring up a new window of demand.

 

「Element selectors」

Element comprises a selector for selecting a plurality of element data elements. For example, the element of choice may be used to select items in the list of e-commerce site. Selectors will each selected element is returned to its child selector as the parent element. Extracts only the sub-element selector to the data selector elements thereof.

Element selectors is necessary to have a sub-selector to select the basis of its child elements must be selected in the Element selector selection on

 

「Element click selectors」

Click to select the elements, the main scene for many elements need to click on the page is loaded after, such as our common "Click Load More" and "Click page number." These need to click web Scraper operation page load new data and then crawling.

 

「Element scroll selectors」

Rolling element selector, mainly used in the scene need to scroll load more pages, such as we visit a lot of microblogging time, are rolling the mouse to load more.

 

「Grouped selectors」

A packet selector may be text data elements grouped into a plurality of record. Storing the extracted data as JSON form, the tool is a combination of a plurality of elements stitched together.

 

「Html selectors」

Html selector can extract elements of HTML and text selection. Its parent element only within the extracted range Html.

 

「Html attribute selectors」

Attribute selector element attribute values ​​may be extracted HTML elements. For example, you can use the title attribute selectors extracted from this link:

<a href="#" title="my title">

 

「Table selectors」

For many pages are in fact the data show a table inside, this header and content. This time we can use Table selectors simple batch extraction, in the process of extracting the need to develop the subject line and content lines.

 

「Image selectors」

Image selector may extract the src attribute of the image (URL).

 

Sitemap.xml selectors」

xml link selector extracted from the published site url Sitemap.xml file. Sitemap.xml main site for cheap search engine crawlers can more easily search the site, in most cases, they contain all the relevant page url site.

 

Guess you like

Origin blog.csdn.net/qq_36931982/article/details/91414349