How to crawl webpage data with python, detailed tutorial of python crawling webpage

Hello everyone, this article will explain how python crawls all the web pages of the website. How to crawl webpage data with python is something that many people want to understand. To figure out how python crawls webpage data, you need to understand the following things first .

1. How to crawl webpage content with Python crawler?

In fact, the crawler process
is abstracted from the web crawler. It contains nothing more than the following steps:
Simulate requesting web pages. Simulate a browser and open the target website.
retrieve data. After opening the website, we can automatically obtain the website data we need.
save data. After getting the data, it needs to be persisted to a storage device such as a local file or database.
So how do we use Python to write our own crawler program, here I want to focus on a Python library: Requests.
Requests uses
the Requests library to initiate HTTP requests in Python, which is very convenient and simple to use.
Simulate sending an HTTP request
Sending a GET request
When we open the Douban homepage with a browser, the most original request sent is a GET request
import requests
res = requests.get('')
print(res)
print(type(res))
> >>
<Response [200]>
<class 'requests. models. Response'>

2. How to crawl web pages with python

# coding=utf-8
import urllib
import re
# Baidu Post Bar URL: https://tieba.baidu.com/index.html
# Get the HTML content of the web page according to the URL
def getHtmlContent(url):
    page = urllib.urlopen(url)
    return page.read()
# Parse URLs of all jpg images from HTML
# The jpg format from HTML is <img ... src = "xxx.jpg" width='''>
def getJPGs(html):
    # Parsing regular expressions for jpg image URLs
    jpgReg = re.compile(r'<img.+?src="(.+?\.jpg)"')
    # Parse out the URL list of jpg
    jpgs = re.findall(jpgReg, html)
    return jpgs
# Download the picture with the picture url and save it as a specified file name
def downloadJPG(imgUrl, fileName):
    urllib.urlretrieve(imgUrl, fileName)
# Download pictures in batches, save them to the current directory by default
def batchDownloadJPGs(imgUrls, path='../'):  # path='./'
    # rename the image
    count = 1
    for url in imgUrls:
        downloadJPG(url, ''.join([path, '{0}.jpg'.format(count)]))
        print "Download image number:", count, "sheet"
        count += 1
# Package: download pictures from Baidu Tieba web page
def download(url):
    html = getHtmlContent(url)
    jpgs = getJPGs(html)
    batchDownloadJPGs(jpgs)
def main():
    url = "http://www.xiaofamao.com/dongman/"
    download(url)
if __name__ == '__main__':
    main()

3. How to use Python to crawl websites that require login

Recently I had to perform an operation to scrape some pages from a website that required a login. It's not as easy as I thought it would be, so I decided to write a side tutorial for it.

In this tutorial, we will scrape a list of items from our bitbucket account.

The code for the tutorial can be found on my Github.

We will proceed as follows:

In this tutorial, I used the following packages (found in requirements.txt):

Python

1

2

requests

lxml

Step One: Research the Site

open login page

Go to the following page "bitbucket.org/account/signin". You will see the page shown below (perform logout in case you are already logged in)

Go through the details we need to extract for login purposes

In this section, we'll create a dictionary to hold the details of the login performed:

1. Right-click the "Username or email" field and select "View Element". We will use the value of the input box whose "name" attribute is "username". "username" would be the key, and our username/email would be the corresponding value (on other sites these keys might be "email", "user_name", "login", etc.).

2. Right-click the Password field and select View Elements. In the script we need to use the value of the input box whose "name" attribute is "password". "password" will be the key value of the dictionary, and the password we entered will be the corresponding value value (in other websites, the key value may be "userpassword", "loginpassword", "pwd", etc.).

3. In the source code page, look for a hidden input tag called "csrfmiddlewaretoken". "csrfmiddlewaretoken" will be the key value, and the corresponding value will be this hidden input value (on other websites, this value value may be a hidden input value named "csrftoken", "authenticationtoken"). For example: "Vy00PE3Ra6aISwKBrPn72SFml00IcUV8".

In the end we will end up with a dictionary like this:

Python

1

2

3

4

5

payload = {

"username": "&lt;USER NAME&gt;",

"password": "&lt;PASSWORD&gt;",

"csrfmiddlewaretoken": "&lt;CSRF_TOKEN&gt;"

}

Remember, this is a specific case for this site. While this login form is simple, other sites may require us to check the browser's request log and find the relevant key and value that should be used in the login step.

Step 2: Execute login site

For this script, we only need to import the following:

Python

1

2

import requests

from lxml import html

First, we create the session object. This object will allow us to store all login session requests.

Python

1

session_requests = requests.session()

Second, we want to extract the csrf token used at login from the page. In this example, we use lxml and xpath to extract, we can also use regular expressions or some other methods to extract these data.

Python

1

2

3

4

5

login_url = "n/?next=/"

result = session_requests.get(login_url)

tree = html.fromstring(result.text)

authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]

**More information about xpath and lxml can be found here.

Next, we want to perform the login phase. At this stage, we send a POST request to the login url. We use the payload created in the previous steps as data. It is also possible to use a header for the request and add a referrer key to this same url in the header.

Python

1

2

3

4

5

result = session_requests.post(

login_url,

data = payload,

headers = dict(referer=login_url)

)

Step 3: Crawl content

Now that we are logged in successfully, we will perform the actual scraping from the bitbucket dashboard page.

Python

1

2

3

4

5

url = '/overview'

result = session_requests.get(

url,

headers = dict(referer = url)

)

To test the above, we scraped the project list from the bitbucket dashboard page. We'll use xpath again to find the target element, strip text and whitespace from newlines, and print the result. If everything worked OK, the output should be a list of buckets / projects in your bitbucket account.

Python

1

2

3

4

5

tree = html.fromstring(result.content)

bucket_elems = tree.findall(".//span[@class='repo-name']/")

bucket_names = [bucket.text_content.replace("n", "").strip() for bucket in bucket_elems]

print bucket_names

You can also verify these request results by checking the status code returned from each request. It won't always let you know if the login phase was successful, but can be used as an authentication indicator.

For example:

Python

1

2

result.ok # will tell us if the last request was successful

result.status_code # will return the status of our last request

  • Extract the details needed to log in

  • Perform site login

  • Crawl the required data

  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    
  •    

4. What is the best tutorial for python crawlers

You can see this tutorial: Web link
This tutorial uses three crawler cases to make students understand the Scrapy framework, understand the structure of Scrapy, and familiarize themselves with each module of Scrapy.
The general content of this tutorial:
1. Introduction to Scrapy.
Main knowledge points: Scrapy's architecture and operation process.
2. Build a development environment:
main knowledge points: Scrapy installation in Windows and Linux environments.
3. The use of Scrapy Shell and Scrapy Selectors.
4. Use Scrapy to crawl website information.
Main knowledge points: Create a Scrapy project (scrapy startproject), define the extracted structured data (Item), write a Spider that crawls the website and extract the structured data (Item), write Item Pipelines to store the extracted Item (that is, the structure data).

5. How to get started with Python crawlers

The reason why so many small partners are keen on crawler technology is that crawlers can help us do many things, such as search engines, data collection, advertising filtering, etc. Taking Python as an example, Python crawlers can be used for data analysis , Play a huge role in data capture.
But this does not mean that simply mastering a Python language, you can understand crawler technology by analogy. There are still many knowledge and specifications to learn, including but not limited to HTML knowledge, basic knowledge of HTTP/HTTPS protocol, regular expressions, database knowledge , the use of commonly used packet capture tools, the use of crawler frameworks, etc. And when it comes to large-scale crawlers, you also need to understand the concept of distribution, message queues, commonly used data structures and algorithms, caching, and even the application of machine learning. Large-scale systems are supported by many technologies.
How to learn reptile technology with zero foundation? For confused beginners, the most important thing in the initial learning stage of reptile technology is to clarify the learning path and find out the learning method. Only in this way, under the supervision of good learning habits, the later systematic learning will be more effective with less effort.
To write a crawler in Python, you first need to know Python, understand the basic grammar, and know how to use functions, classes, and common data structures such as list and common methods in dict. As an entry-level crawler, it is necessary to understand the basic principles of the HTTP protocol. Although the HTTP specification cannot be written in one book, the in-depth content can be read slowly later, and the combination of theory and practice will make it easier to learn later. easy. Regarding the specific steps of crawler learning, I have roughly listed the following parts. You can refer to: Basic
knowledge of web crawlers:
definition of crawlers
Function of crawlers
Http protocol
Basic packet capture tool (Fiddler) uses
Python modules to implement crawlers:
urllib3, requests, Explanation of the general functions of lxml and bs4 modules
Use the get method of the requests module to obtain static page data
Use the requests module post to obtain static page data
Use the requests module to obtain ajax dynamic page data
Use the requests module to simulate login to the website
Use Tesseract for verification code identification
Scrapy framework and Scrapy-Redis:
Scrapy crawler framework general description
Scrapy spider class
Scrapy item and pipeline
Scrapy CrawlSpider Class
Realize distributed crawler through Scrapy-Redis Crawl
data with automated testing tools and browsers:
Selenium + PhantomJS description and simple examples Selenium +
PhantomJS Realize website login
Selenium + PhantomJS Realize dynamic page data crawling
Crawler project combat:
distributed crawler + Elasticsearch builds a search engine

6. How to get started with Python crawlers

Personally, I think
it is enough for beginners to learn python to crawl webpages: (the fourth one is really difficult to use, of course, it may not be able to handle it in some special cases) 1. Open the webpage and download the file:
urllib
2. Parsing web pages: BeautifulSoup, those who are familiar with JQuery can use Pyquery
3. Use Requests to submit various types of requests, support redirection, cookies, etc.
4. Use Selenium to simulate browsers to submit user-like operations and process web pages dynamically generated by js.
These libraries have their own functions. Cooperating together can complete the function of crawling various web pages and analyzing them. For specific usage, you can check their official website manual (there is a link above).
You need to be motivated to do things. If you have nothing special to grab, you can start from this checkpoint website. It is
currently updated to the fifth checkpoint. After passing the first four checkpoints, you should have mastered the basic operations of these libraries.
I really can't make it through, so let's look at the solution here, the fourth level will use parallel programming. (It will take a lot of time to complete the fourth level of serial programming), the fourth and fifth levels only have questions, and the solutions have not been released yet. . .
After learning these basics, it will be easier to learn scrapy, a powerful crawler framework. Here is its Chinese introduction.
This is my answer on Zhihu. Some links are not valid. You can read the original version here.

7. Does python need to open the webpage to crawl webpage content data?

Python needs to open the webpage to crawl the content of the webpage, because the relative content can only be opened when the webpage is opened, so it is necessary to crawl the corresponding data and open the webpage to crawl the content.

8. How to use Python as a crawler

1) First of all, you need to understand how reptiles work.
Imagine that you are a spider and you are now placed on the "web" of the Internet. Then, you need to read all the pages. How to do it? No problem, you can start anywhere, for example, the home page of the People's Daily, this is called initial pages, represented by $.
On the front page of the People's Daily, you see various links leading to from that page. So you happily climbed to the "national news" page. Great, so you've crawled both pages (the front page and domestic news)! For the time being, don't worry about how to deal with the crawled page, you can just imagine that you copied this page into an html and put it on you.
Suddenly you find that on the domestic news page, there is a link back to the "homepage". As a smart spider, you must know that you don't have to crawl back, because you have already seen it. Therefore, you need to use your brain to save the addresses of the pages you have seen. In this way, every time you see a new link that may need to be crawled, you first check whether you have already been to this page address in your mind. If you've been, don't go.
Well, in theory, if all pages can be reached from the initial page, then it can be proved that you can definitely crawl all the pages.
So how to achieve it in python?
It's very simple
import Queue
initial_page = "Initialization page"
url_queue = Queue.Queue()
seen = set()
seen.insert(initial_page)
url_queue.put(initial_page)
while(True): #Keep going until the sea is dead
if url_queue.size() >0:
current_url = url_queue.get() #Take out the first url in the queue example
store(current_url) #Store the web page represented by this url
for next_url in extract_urls(current_url): #Extract the url linked to in this url
if next_url not in seen:
seen.put(next_url)
url_queue.put(next_url)
else:
break
is already written in pseudocode.
The backbone of all crawlers is here. Let’s analyze why crawlers are actually very complicated things—search engine companies usually have an entire team to maintain and develop them.
2) Efficiency
If you directly process the above code and run it directly, it will take you a whole year to climb down the entire content of Douban. Not to mention that search engines like Google need to crawl down the content of the entire web.
Where is the problem? There are too many web pages that need to be crawled, and the above code is too slow. Assuming that there are N websites in the whole network, then analyze the complexity of judging the weight is N*log(N), because all webpages need to be traversed once, and the complexity of log(N) is required for reusing set every time. OK, OK, I know that python's set implementation is hash-but this is still too slow, at least the memory usage is not efficient.
What is the usual judgment method? Bloom Filter. Simply put, it is still a hash method, but its characteristic is that it can use fixed memory (does not increase with the number of urls) to determine whether the url is already in the set with O(1) efficiency. It's a pity that there is no free lunch in the world. The only problem is that if the url is not in the set, BF can be 100% sure that the url has not been seen. But if this url is in the set, it will tell you: this url should have appeared, but I have 2% uncertainty. Note that the uncertainty here can become very small when the memory you allocate is large enough. A simple tutorial: Bloom Filters by Example
noticed this feature. If the url has been viewed, it may be repeated with a small probability (it doesn’t matter, it won’t be exhausting to see more). But if it has not been read, it will definitely be read (this is very important, otherwise we will miss some pages!). [IMPORTANT: There is a problem with this paragraph, please skip it for now]
Okay, now we are close to the fastest way to handle weight judgment. Another bottleneck - you only have one machine. No matter how big your bandwidth is, as long as the speed at which your machine downloads web pages is the bottleneck, then you can only speed up the speed. If one machine is not enough - use many! Of course, we assume that each machine has reached the maximum efficiency - using multi-threading (in the case of python, multi-process).
3) Clustered crawling
When crawling Douban, I used a total of more than 100 machines to run around the clock for a month. Imagine that if you only use one machine, you have to run it for 100 months...
So, assuming you have 100 machines available now, how to implement a distributed crawling algorithm with python?
We call 99 of the 100 machines with smaller computing power slaves, and the other larger machine is called master, then review the url_queue in the above code, if we can put this queue on this master machine , all slaves can communicate with the master through the network. Whenever a slave finishes downloading a web page, it requests a new web page from the master to grab. And every time the slave captures a new webpage, it sends all the links on the webpage to the queue of the master. Similarly, the bloom filter is also placed on the master, but now the master only sends urls that have not been visited to the slave. The Bloom Filter is placed in the memory of the master, and the visited url is placed in Redis running on the master, so that all operations are O(1). (At least the amortization is O(1). For the access efficiency of Redis, see: LINSERT – Redis)
Consider how to implement it with python:
install scrapy on each slave, then each machine becomes a slave with crawling ability , Install Redis and rq on the master as a distributed queue.
The code is then written as
#slave.py
current_url = request_from_master()
to_send = []
for next_url in extract_urls(current_url):
to_send.append(next_url)
store(current_url);
send_to_master(to_send)
#master.py
distributed_queue = DistributedQueue()
bf = BloomFilter()
initial_pages = ""
while(True):
if request == 'GET':
if distributed_queue. size()>0:
send(distributed_queue. get())
else:
break
elif request == 'POST':
bf.put(request.url)
OK, in fact, as you can imagine, someone has already written what you need: darkrho/scrapy-redis · GitHub
4) Outlook and post-processing
are used a lot of "simple", but the real It is not an easy task to implement a crawler that is usable on a commercial scale. The above code is used to crawl an entire website with almost no major problems.
But if you need these follow-up processing, such as
effective storage (how should the database be arranged)
effective weight judgment (here refers to web page weight judgment, we don’t want to crawl the People’s Daily and the Damin Daily that plagiarized it)
effectively Information extraction (such as how to extract all the addresses on the web page, "Fenjin Road, Chaoyang District, Zhonghua Road"), search engines usually do not need to store all the information, such as why do I save pictures...Update in time (
forecast How often this page will be updated)
As you can imagine, every point here can be studied by many researchers for more than ten years. Even so,
"the road is long and long, and I will search up and down."
So, don't ask how to get started, just hit the road :)

Guess you like

Origin blog.csdn.net/aifans_bert/article/details/128902352