What is a Python crawler? Understand crawlers in one article

0. Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it by yourself

Python free learning materials, codes and exchange answers click to join


Take a small step every day and take a big step towards your goal.

Python crawlers are mainly divided into three major sections: crawling data, analyzing data, and storing data.

Simply put, the crawler needs to directly return the data required by the user through the specified URL, without manually operating the browser step by step to obtain it.

 

1. Capture data

Generally speaking, visiting the website url returns us two formats of data, html and json.

1) No parameters

Most of the data captured are get requests, and we can get the data directly from the server where the website is located. In python's own modules, there are mainly urllib and urllib2, requests, etc.

Here is an example of requests.


Requests:
  import requests
  response = requests.get(url)
  content = requests.get(url).content
  content = requests.get(url).json()
  print "response headers:", response.headers
  print "content:", content

2) With ginseng

In addition, there is another way to capture data in the form of parameters. The parameters are usually attached to the end of the url. The first parameter is connected with "?", and subsequent participation is connected with "&".


data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data为dict,json
  import requests
  response = requests.get(url=url, params=data)

2. Dealing with the login situation

1) Post form login

The form data is sent to the server first, and the server stores the returned cookie locally.


data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data为dict,json
  import requests
  response = requests.post(url=url, data=data)

2) Login with cookie

Use cookie to log in, the server will think that you are a logged-in user, and will return a logged-in content. If a verification code is required, we can consider this solution.


import requests      
requests_session = requests.session() 
response = requests_session.post(url=url_login, data=data)

3. Anti-crawler mechanism processing

We know that many websites now have an anti-crawler mechanism.

I believe we have all encountered that when we crawl a certain website, the first crawl can be done, the second time it can, and the third time it fails, it will prompt IP restrictions or frequent access errors, etc.

In view of this situation, we have several ways to solve it.

1) Use a proxy

It is mainly used in the case of "restricted IP" addresses, and can also solve the problem of frequent access that requires verification codes.

We can maintain a proxy IP pool. Many free proxy IPs can be found online, and we can choose what we need.


proxies = {'http':'http://XX.XX.XX.XX:XXXX'}
Requests:
  import requests
  response = requests.get(url=url, proxies=proxies)

2) Time limit

Solve the problem of restricted access caused by frequent access. Encountering this situation is very simple, we need to slow down the frequency between two clicks, just add the sleep function.


import time
time.sleep(1)

3) Visit by disguising as a browser

When we see some crawler code, we will find that the get request will have headers, which is an anti-hotlink that pretends to be a browser visit.

Some websites will check if you are actually visiting by a browser or by the machine automatically. In this case, add User-Agent to indicate that you are accessing by a browser.

Sometimes it will also check whether you bring Referer information and check whether your Referer is legal, and generally add Referer.


headers = {'User-Agent':'XXXXX'} # 伪装成浏览器访问,适用于拒绝爬虫的网站
headers = {'Referer':'XXXXX'}
headers = {'User-Agent':'XXXXX', 'Referer':'XXXXX'}
Requests:
  response = requests.get(url=url, headers=headers)

4) Reconnect after disconnection

You can refer to two methods.


def multi_session(session, *arg):
  retryTimes = 20
  while retryTimes>0:
    try:
      return session.post(*arg)
    except:
      retryTimes -= 1
   
   或
   
  def multi_open(opener, *arg):
    retryTimes = 20
    while retryTimes>0:
      try:
        return opener.open(*arg)
      except:
        retryTimes -= 1

In this way, we can use multi_session or multi_open to maintain the session or opener crawled by the crawler.

 

4. Multi-threaded crawling

When we crawl or the amount of data is too large, we can consider using multi-threading. One is introduced here, and of course there are other ways to achieve it.


import multiprocessing as mp

def func():
  pass

p = mp.Pool()
p.map_async(func)
# 关闭pool,使其不在接受新的(主进程)任务
p.close()
# 主进程阻塞后,让子进程继续运行完成,子进程运行完后,再把主进程全部关掉。
p.join(

5. Analysis

Generally, there are two main types of data returned by the server, html and json.
html format data, you can use BeautifulSoup, lxml, regular expressions, etc. to process
json format data, you can use Python lists, json, regular expressions, etc. to process

In addition, we can use numpy, pandas, matplotlib, pyecharts and other module packages to do corresponding data analysis, visual display, etc.

6. Storage

After data capture, analysis and processing, we generally need to store the data. Common methods include storing it in a database or an excel table. Choose a suitable method according to your needs, and process the data into a suitable method for storage.

Finally, let’s say that after so many words, don’t you really pay attention?

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/114981665