Beginner reptile summary

A crawling pages basic steps

1.1 determining data crawling

The role of crawler is crawling data needed to take in massive web page, before determining what data need. Second-hand house prices as an example, if we need to do in Changsha City second-hand house prices sectional regression analysis, we need to find factors associated with second-hand house prices: (???? A few rooms of several halls Chaoyang whether there is a balcony whether newly renovated) second-hand housing type second-hand housing location, price (yuan / square meter), property age and so on.

1.2 to determine the page crawled

Crawler faster, but can not search the entire Internet from the whole large network in order to get the data they need to climb as much as possible, it is necessary to determine the data they need which appear in the site. Second-hand house prices as an example, if we have determined that you need to find the data, we need to search for used prices on Baidu on Google to find a few data of website: Live off, the chain of family, etc., then we goal is to crawl these commercial sites.

1.3 Analysis page

Should first analyze whether the page can climb (anti-climb?), You can write a simple code that automatically crawls the page, set the initial random user-agent crawling, then you can not write very detailed code, for example, only crawling title. If you find the site very serious anti-climb, visit a few times to seal ip, requires re-login or validation code, etc., may use cookies to save the sign in question account password, authentication code can also simulate crack or break out of money automatic, but generally it can not stop others sealing of your ip, of course, is the best way to find agents. If you find a climb of about a thousand pages and others do not seal your ip, data indicating that the site can be crawled, the next step is to find the page of useful data, but the data is not necessarily visible web pages written in html it may be written in the json or database, which requires view the page source code, if viewed separately from the "check" or does not work, rendering show some results, you must view the html code from the "View page source" in the. Which part of the page to find the data useful, but the same general structure of these pages, then only need to determine a good root url, starting from the root of these pages can be traversed.

1.4 crawling pages

There are two ways to traverse reptiles, and can also "breadth-first" to describe the "depth-first", starting from the root, reptile is crawling this page from the network map, but in different ways will determine the quality of a web page, such as set the number of pages crawled for 1000, if the depth-first search may appear too much of a theme or no theme other too little, and the breadth of each priority theme will make evenly distributed, it is proposed to use breadth-first to write reptiles.
Crawling process requires a lot of garbage to get rid of the page, or page can only define a certain type of crawl, generally defines a good domain name and url filter expression to use positive.
After entering the url is required crawling parse the web pages, there are many analytical methods, such as regular expressions, beautifulSoup, Xpath, serve the same purpose, the regular expression is the most basic course must master, the latter two are to xml tree parsing, learn a can. Resolution process should pay attention to beautifulsoup or versatility xpath resolution, less specific index resolved, because this is equivalent to write the dead, the middle one of a web page advertising insert, may result in displacement index to access data not .

1.5 Saving data

Data crawling saved, stored in a database? json? xlsx? csv? and so on, the data points for each type of file folders, written in binary mode? encoding format? These are to be considered.

Second, usually reptiles

  1. requests
 response = requests.get(url, timeout=1, headers=headers1)
  1. header
 headers1={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'
}
  1. Regular Expressions
 response.encoding = 'utf-8'
    html = response.text
    # 解析网页
    urls = re.findall('href="(.*?)"', html)
    urls += re.findall('src="(.*?)"', html)
  1. Storage
dir_name="blog_url/asdf"+str(cnt_page)+".txt"
    with open(dir_name, 'w',encoding='utf-8') as f:
        f.write(url+" ")
        for url in urls:
            str1 = url.split('.')[-1]
            if "blog.sina" in url and (('/' in str1) or 'html'in str1 or "cn" in str1):
                f.write(url+" ")

  1. && wide search de-duplication
cnt_page=0

is_history=set()
url_queue=Queue()
def spide_xx(url):#从某个根界面开始爬取,保存在dir_name中

    global  cnt_page

    url_queue.put(url)
    while url_queue.empty()==False and cnt_page<10000:
        try:
            cnt_page = cnt_page + 1
            cur_url=url_queue.get();
            is_history.add(cur_url)
            get_url_txt(cur_url)
            urls=get_url_src(cur_url)

            for url in urls:
                str = url.split('.')[-1]
                if "blog.sina" in url and (('/' in str) or 'html'in str or "cn" in str):
                    if (url in is_history) == False:
                        print(url)
                        url_queue.put(url)
        except:
            print("访问超时!")
    return

Three, scrapy framework

Not here to install the like, assumed to have been installed, with the following chain of home site, for example crawling Changsha second-hand housing basic information, describes the basic usage scrapy.
Scrapy startproject xxxx command to create a framework crawler scrapy project
cd xxxx
scrapy genspider -t crawl xxx domain to create crawlspider crawler
write a driver script main.py:

from scrapy import cmdline
cmdline.execute('scrapy crawl housing_price_crawl'.split())

Used to start the reptile scrapy crawl xxx "xxx" is the name of reptiles start, here housing_price_crawl.py

Here Insert Picture Description
Briefly explain the respective files are produced py doing: housing_price_crawl.py for parsing web pages, and other rules where to begin climbing; item.py be crawled data package into a class, and then transmitted in the form of objects, so defined in advance item.py; middlewares.py define their own middleware, the middleware processing is mainly used for anti-anti-crawler, arranged randomly user-agent, are here arranged randomly ip; pipelines.py conduit for climb after bulk storage data, where the definition of the rules stored in the item object; setting.py set configuration scrapy frame, whether to read the robot.txt, open middleware, pipeline, user-agent, crawling stopped after how many pages, depth-first or breadth-first, and so on.

Write housing_price_crawl.py, because the chain of home page url is div package, can not get an accurate url, so I can not get all of the current page url by LinkExtractor, so they can only write your own code to get a url is resolved next page. Each page extraction cell names, place names, a few rooms of several rooms, the unit price. Recommended scrapy shell xpath advance or regular inspection is correct.

#items.py
import scrapy

class ChangshahousingpriceItem(scrapy.Item):
    name=scrapy.Field()
    position=scrapy.Field()
    type=scrapy.Field()
    price=scrapy.Field()
# -*- coding: utf-8 -*- housing_price_crawl.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ChangShaHousingPrice.items import ChangshahousingpriceItem
import json
from scrapy.loader import ItemLoader

class HousingPriceCrawlSpider(CrawlSpider):
    name = 'housing_price_crawl'
    allowed_domains = ['cs.lianjia.com']
    start_urls = ['https://cs.lianjia.com/ershoufang/yuhua/rs%E9%95%BF%E6%B2%99/']

    rules = (
        Rule(LinkExtractor(allow=r'.*/ershoufang/.*/rs长沙/'), callback='page_request', follow=True),
        #Rule(LinkExtractor(allow=r'https://cs.lianjia.com/ershoufang/\d+.html'), callback='parse_item', follow=False),
    )

    def page_request(self, response):
        # link=LinkExtractor(allow=r'https://cs.lianjia.com/ershoufang/\d+.html')
        # print(link.extract_links(response))
        root_path=response.xpath("//div[@class='page-box house-lst-page-box']/@page-url").get()
        max_page=response.xpath("//div[@class='page-box house-lst-page-box']/@page-data").get()
        if(max_page is not None):
            max_page = json.loads(max_page)
            max_page=max_page["totalPage"]
            root_path+='/'
            for i in range(1,max_page+1):
                path=root_path.replace('{page}',str(i))
                path='https://cs.lianjia.com'+path
                print(path)
                yield scrapy.Request(path,callback=self.page_info)

    def page_info(self,response):
        link = LinkExtractor(allow=r'https://cs.lianjia.com/ershoufang/\d+.html')
        urls=link.extract_links(response)
        for url in urls:
            url=url.url
            yield scrapy.Request(url, callback=self.parse_item)

    def parse_item(self, response):
        l=ItemLoader(item=ChangshahousingpriceItem(),response=response)
        l.add_xpath('name',"//div[@class='communityName']/a[@class='info ']/text()")
        l.add_value('position'," ".join(response.xpath("//div[@class='areaName']/span[@class='info']/a[@target='_blank']/text()").getall()))
        l.add_value('type',response.xpath("//div[@class='mainInfo']/text()").get())
        l.add_value('price',response.xpath("//span[@class='unitPriceValue']/text()").get()+response.xpath("//span[@class='unitPriceValue']/i/text()").get())
        # item=ChangshahousingpriceItem()
        # item['name']=response.xpath("//div[@class='communityName']/a[@class='info ']/text()").get()
        # item['position']=" ".join(response.xpath("//div[@class='areaName']/span[@class='info']/a[@target='_blank']/text()").getall())
        # item['type']=response.xpath("//div[@class='mainInfo']/text()").get()
        # item['price']=response.xpath("//span[@class='unitPriceValue']/text()").get()+response.xpath("//span[@class='unitPriceValue']/i/text()").get()
        return l.load_item()

Middleware provided random user-agent


class ChangshahousingpriceDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    USER_AGENTS = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
        'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
        'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6788.400 QQBrowser/10.3.2816.400',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
        'Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.8.1.11) Gecko/20080118 Firefox/2.0.0.11',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Deepnet Explorer 1.5.3; Smart 2x2; .NET CLR 2.0.50727; .NET CLR 1.1.4322; InfoPath.1)',
        'ELinks/0.9.3 (textmode; Linux 2.6.9-kanotix-8 i686; 127x41)',
        'Mozilla/5.0 (X11; U; Linux x86_64; it-it) AppleWebKit/534.26+ (KHTML, like Gecko) Ubuntu/11.04 Epiphany/2.30.6',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13',
        'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_5_8) AppleWebKit/537.3+ (KHTML, like Gecko) iCab/5.0 Safari/533.16',
        'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.13) Gecko/20100916 Iceape/2.0.8',
        'Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20121201 icecat/17.0.1',
        'Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20121202 Firefox/17.0 Iceweasel/17.0.1',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko',
        'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.21pre) Gecko K-Meleon/1.7.0',
        ]
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = user_agent
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Settings are stored in the csv file: pipelines.py

import csv
import os
class ChangshahousingpricePipeline(object):
    def __init__(self):
        # csv文件的位置,无需事先创建
        store_file = os.path.dirname(__file__) + '/spiders/长沙二手房价.csv'
        # 打开(创建)文件
        self.file = open(store_file, 'w+',newline="",encoding='utf-8')
        # csv写法
        self.writer = csv.writer(self.file)

    def process_item(self, item, spider):
        # 判断字段值不为空再写入文件
        self.writer.writerow((item['name'],item['position'],item['type'],item['price']))
        return item

    def close_spider(self,spider):
        self.file.close()

Configuration setting

BOT_NAME = 'ChangShaHousingPrice'

SPIDER_MODULES = ['ChangShaHousingPrice.spiders']
NEWSPIDER_MODULE = 'ChangShaHousingPrice.spiders'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
   'ChangShaHousingPrice.middlewares.ChangshahousingpriceDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
   'ChangShaHousingPrice.pipelines.ChangshahousingpricePipeline': 300,
}

result:
Here Insert Picture Description

Published 41 original articles · won praise 2 · Views 1225

Guess you like

Origin blog.csdn.net/qq_41418281/article/details/103958772