1.1 The basic principle of reptiles

What is a reptile

# 1 What is the Internet? 
    Is the Internet and a computer connected by a network device (cable, routers, switches, firewalls, etc.) is made the same as a net. 

# 2, the purpose of the establishment of the Internet? 
    The core value of the Internet is that shared data / transfer: Data is stored on a computer, and the computer to the Internet with the purpose is to be able to facilitate data sharing between each other / transfer, or you can take U disk to copy data on someone else's computer. 

# 3 What is the Internet? What reptiles do is? 
    What we call the Internet is sending a request by a client computer to the target computer to download the data to the target computer's local process.
    # 3.1 only, user access to network data is: 
      the browser to submit a request -> download page code -> parsing / rendering into pages.
   # 3.2 and crawlers do is: 
     Analog browser sends a request -> download page code -> only to extract useful data -> stored in the database or file
    # 3.1 and 3.2 is that the difference: 
      our crawler to extract only the page code is useful for our data 

# 4, summarized reptile 
    # metaphor 4.1 crawlers: 
      If we use the Internet compared to a large spider web, the data on that computer is a prey spider web, and the crawler is a only a small spider, spider web crawl along their desired prey / data
   # : Definitions 4.2 crawler 
      initiates a request to the site, analyze and extract useful data access to resources program 
   # value of 4.3 reptiles: 
    the Internet is the most valuable data, such as product information Lynx mall, villas chain of home network snowball net securities investment information, etc., these data represent real money in various industries, we can say, who mastered the first-hand data in the industry, who will become the master of the whole industry, if the entire Internet data likened to a treasure, that our crawlers course is to teach you how to efficiently mine these treasures, reptiles mastered the skills, you become all Internet information company behind the boss, in other words, they are free to provide you with valuable data.

 

The basic flow of two reptiles

# 1, a request to initiate 
the use of library initiates http request to the target site, i.e., a transmitting the Request 
the Request comprising: request header, request the like 

# 2, the content acquisition response 
if the server can properly respond, to give a the Response 
the Response comprising: HTML, JSON , pictures, videos, etc. 

# 3, parses the content 
parse html data: regular expressions, such as third-party parsing library Beautifulsoup, pyquery such as 
parsing json data: json module 
parses binary data: a way to write files b 

# 4, save data 
database 
file

 

Three request and response

# HTTP protocol: HTTP: //www.cnblogs.com/linhaifeng/articles/8243379.html 

# users to send their information to the server (socket server) via a browser (socket Client): Request 

# the Response: server receives the request, analysis sent to the user's request information, and then returns the data (data returned may contain other links, such as: images, JS, CSS, etc.) 

# PS: browser after receiving Response, parses its content to display to the user, and analog crawlers and the browser sends a request receiving Response, to extract useful data therein.

 

Four Request

# 1, the request mode: 
    common request methods: GET, POST 
    other requests ways: the HEAD, PUT, DELETE, OPTHONS 

`` ` 
PS: with the difference between the demo get the post browsers (with login demo post) 

post and get requests eventually spliced into this form: K1 = XXX & K2 = yyy & K3 = ZZZ 
pOST request parameter in the request body: 
    using a browser to view, stored in the form data 
after the get request parameters directly on the URL 
`` ` 

# 2, the request url 
    url stands for uniform resource locator, such as a web document, a picture of 
    a video so you can uniquely be determined by url 

`` ` 
url encoded 
HTTPS: //www.baidu.com/s?wd= picture 
images will be encoded (see sample code) 
`` ` ` 

`` 
loading page is: 
load a web page, usually are the first loaded document document, 
the document parsing documents when he met link, for a hyperlink to initiate a request to download the picture 
` `` 

# 3, the request header 
    User-agent: If there is no request header user-agent client configuration, 
    the server may be you as a user of illegal 
    Host 
    Cookies: the cookie used to store login information 

`` ` 
generally do crawlers will add request headers 
` `` 

# 4, requesting body 
    if it is get way, no request body content 
    if it is post way, the request body is the Data format 

`` ` 
PS:
 1 , the login window, file uploads, information will be attached to the request body
 2 , log in, enter the wrong user name and password, then submit, you can see post, usually right after the login page will jump, can not capture the pOST 
`` `

 

from urllib.parse import urlencode
import requests

headers={
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Cookie':'3mBTbA3JDYX Oh, jjRX56GhfO_0R3jsJKRy66jK4JKjHKet6vP; ispeed_lsm = 0; H_PS_PSSID = 1421_24558_21120_17001_24880_22072; BD_UPN = 123 253; H_PS_645EC = 44be6I1wqYYVvyugm2gc3PK9PoSa26pxhzOVbeQrn2rRadHvKoI% 2BCbN5K% 2Bg; BDORZ = B490B5EBF6F3CD402E515D22BCDA1598',
'Host':'www.baidu.com',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

# response=requests.get('https://www.baidu.com/s?'+urlencode({'wd':'美女'}),headers=headers)
response=requests.get('https://www.baidu.com/s',params={'wd':'美女'},headers = headers) Printinternal params is to call urlencode#
(response.text)
View Code

 

五 Response

# 1, the response status 
    200 : success on behalf of
     301 : Jump on behalf of
     404 : file does not exist
     403 : Permissions
     502 : Server Error 

# 2, respone header 
    SEt- cookie: There may be more, is to tell the browser, the cookie preserved 
    
# 3, preview the page source code is 
    the most important part, it contains the contents of the requested resource 
    such as web page html, picture 
    binary data, etc.

 

Six summary

# 1, reptiles flow summary: 
    crawling ---> --- resolve> store 

# 2, reptiles Tools Required: 
    Request Libraries: requests, selenium 
    parsing library: Regular, beautifulsoup, pyquery 
    repository: files, MySQL, Mongodb, Redis 

# 3, reptiles common framework: 
    scrapy
import requests
import re
import time
import hashlib

def get_page(url):
    print('GET %s' %url)
    try:
        response=requests.get(url)
        if response.status_code == 200:
            return response.content
    except Exception:
        pass

def parse_index(res):
    obj=re.compile('class="items.*?<a href="(.*?)"',re.S)
    detail_urls=obj.findall(res.decode('gbk'))
    for detail_url in detail_urls:
        if not detail_url.startswith('http'):
            detail_url='http://www.xiaohuar.com'+detail_url
        yield detail_url

def parse_detail(res):
    obj=re.compile('id="media".*?src="(.*?)"',re.S)
    res=obj.findall(res.decode('gbk'))
    if len(res) > 0:
        movie_url=res[0]
        return movie_url

def save(movie_url):
    response=requests.get(movie_url,stream=False)
    if response.status_code == 200:
        m=hashlib.md5()
        m.update(('%s%s.mp4' %(movie_url,time.time())).encode('utf-8'))
        filename=m.hexdigest()
        with open(r'./movies/%s.mp4' %filename,'wb') as f:
            f.write (response.content) 
            f.flush () 

DEF main (): 
    index_url = ' http://www.xiaohuar.com/list-3-{0}.html ' 
    for I in Range (. 5 ):
         Print ( ' * ' * 50 , I)
         # crawling home page 
        the index_page = the get_page (index_url.format (I,))
         # parsing home page, where to get the address of the video list 
        detail_urls = parse_index (the index_page)
         # cycles video page crawling 
        for detail_url in detail_urls:
             # crawling video page
            detail_page=get_page(detail_url)
            #拿到视频的url
            movie_url=parse_detail(detail_page)
            if movie_url:
                #保存视频
                save(movie_url)

if __name__ == '__main__':
    main()

#并发爬取
from concurrent.futures import ThreadPoolExecutor
import queue
import requests
import re
import time
import hashlib
from threading import current_thread

p=ThreadPoolExecutor(50)

def get_page(url):
    print('%s GET %s' %(current_thread().getName(),url))
    try:
        response=requests.get(url)
        if response.status_code == 200:
            return response.content
    except Exception as e:
        print(e)

def parse_index(res):
    print('%s parse index ' %current_thread().getName())
    res=res.result()
    obj=re.compile('class="items.*?<a href="(.*?)"',re.S)
    detail_urls=obj.findall(res.decode('gbk'))
    for detail_url in detail_urls:
        if not detail_url.startswith('http'):
            detail_url='http://www.xiaohuar.com'+detail_url
        p.submit(get_page,detail_url).add_done_callback(parse_detail)

def parse_detail(res):
    print('%s parse detail ' %current_thread().getName())
    res=res.result()
    obj=re.compile('id="media".*?src="(.*?)"',re.S)
    res=obj.findall(res.decode('gbk'))
    if len(res) > 0:
        movie_url=res[0]
        print('MOVIE_URL: ',movie_url)
        with open('db.txt','a') as f:
            f.write('%s\n' %movie_url)
        # save(movie_url)
        p.submit(save,movie_url)
        print('%s下载任务已经提交' %movie_url)
def save(movie_url):
    print('%s SAVE: %s' %(current_thread().getName(),movie_url))
    try:
        response=requests.get(movie_url,stream=False)
        if response.status_code == 200:
            m=hashlib.md5()
            m.update(('%s%s.mp4' %(movie_url,time.time())).encode('utf-8'))
            filename=m.hexdigest()
            with open(r'./movies/%s.mp4' %filename,'wb') as f:
                f.write(response.content)
                f.flush()
    except Exception as e:
        print(e)

def main():
    index_url='http://www.xiaohuar.com/list-3-{0}.html'
    for i in range(5):
        . p.submit (the get_page, index_url.format (I,)) add_done_callback (parse_index) 

IF  the __name__ == ' __main__ ' : 
    main () 

crawling net video Xiaohua
Crawling Xiaohua network video

 

 

Guess you like

Origin www.cnblogs.com/Ryan-Yuan/p/11927760.html