python reptile tutorial: python crawl many types of pages instance method

Small to give in this article where we are finishing python crawling on multiple types of content page methods instance, there is a need friends can learn under.
And a good collection of pages to crawl predefined different, grab a site within the chain will all bring a challenge that you do not know what to get. Fortunately, there are several basic ways to identify the page type.
By URL

A website all blog articles may all contain a URL (eg http://example.com/blog/title-of-post).
The presence or absence of a particular field through the site
if a page contains a date, but does not include the author's name, you can be classified as a press release. If it has a title, main picture, the price, but not the main content, then it may be a product page.
Identify a particular tag page by page appears
, if not grab the data within a label, you can still use this tag. You can find similar reptiles

class Website:
"""所有文章/网页的共同基类"""
def __init__(self, type, name, url, searchUrl, resultListing,
resultUrl, absoluteUrl, titleTag, bodyTag):
self.name = name
self.url = url
self.titleTag = titleTag
self.bodyTag = bodyTag
self.pageType = pageType

If you sort the pages in a class SQL database, this model type means that these pages should be stored in the same table, and adding an extra pageType column.
If you rip a page or content varies (they contain different types of fields), you need to create a new object type for each page. Of course, some things are common to all pages - they all have a URL, it may have a name or page title. This case is ideal for sub-categories:

class Website:
"""所有文章/网页的共同基类"""
 
def __init__(self, name, url, titleTag):
self.name = name
self.url = url
self.titleTag = titleTag

This is not a subject of your crawler to use direct, but will be referenced your page types of objects:

class Product(Website):
"""产品页面要抓取的信息"""
def __init__(self, name, url, titleTag, productNumber, price):    
Website.__init__(self, name, url, TitleTag)
self.productNumberTag = productNumberTag
self.priceTag = priceTag
class Article(Website):
"""文章页面要抓取的信息"""
def __init__(self, name, url, titleTag, bodyTag, dateTag):
Website.__init__(self, name, url, titleTag)
self.bodyTag = bodyTag
self.dateTag = dateTag

I write to you, for everyone to recommend a very wide python learning resource gathering, click to enter, here are a senior programmer before learning to share experiences, study notes, there is a chance of business experience, and for everyone to carefully organize a python zero the basis of the actual project data, daily python to you on the latest technology, prospects, learning to leave a message of small details

This product extends the Website page base class, and joined applies only to products productNumber and price attributes, and Article type joined the body and date properties, these two attributes are not applicable to the product.
You can use these two classes to grab a store website, which in addition to the product, may also contain a blog post or press release.

Published 38 original articles · won praise 26 · views 40000 +

Guess you like

Origin blog.csdn.net/haoxun09/article/details/104741566