9. Write a web crawler with python, finished

   foreword

This is the last article of python web crawler for everyone to make a summary, let's read and cherish it!

    So far, the crawling technology introduced in the previous chapters of this book has been applied to a custom website, which can help us focus more on learning specific skills. And in this chapter, we'll analyze a few real websites to see how these techniques are applied. First we use Google to demonstrate a real search form, then the JavaScript-dependent site Facebook, then the typical online store Gap, and finally the BMW official website with a map interface. Since these are active sites, readers are reading this book There is a risk that these sites have changed at this time. That's fine though, because the purpose of these examples is to show you how to apply the techniques you've learned so far, not to show you how to crawl a specific website. When you choose to run a sample, you first need to check whether the site structure has changed since the sample was written, and whether the current terms and conditions of the site prohibit crawlers.

9.1 Google search engine

    According to the Alexa data in the 4th article, google.com is one of the most popular websites in the whole world, and very conveniently, the site is simple and easy to crawl.
     The image below shows the Google Search homepage when loading and viewing form elements using Firebug.
    You can see that the search query is stored in the input parameter q, and then the form is submitted to the search path set by the action attribute. We can test it by submitting test as a search condition to the form, which will jump to a URL like https://www.google.com/search?q=test&oq=test&es_sm=93&ie=UTF-8. The exact URL depends on your browser and geographic location. Also, note that if Google Live is turned on, the search results will be dynamically loaded using AJAX instead of submitting a form. Although many parameters are included in the URL, only the query parameter q is required. The same result can also be produced when the URL is https://www.google.com/search?q=test, as shown in the figure below.

 The structure of the search results can be inspected using Firebug, as shown in the image below.

 

    As can be seen from the figure below, the search results appear in the form of links, and its parent element is the <h3> tag whose class is "main". To grab search results, we can use the css selectors introduced in the second article.

 

 

    So far, we've downloaded Google's search results and used lxrnl to extract links from them. In the image above, we found that the real website URL in the link also contains a string of additional parameters, which will be used to track clicks. Below is the first link.

 

     The content we need here is http://www.speedtest.net/, which can be parsed from the query string using the urlparse module.

    This query string parsing method can be used to extract all links. 

    It worked! Links from Google searches have been successfully crawled. The complete source code of this example can be obtained from https://bitbucket.org/wswp/code/src/tip/chapter09/google.py.  
    One of the difficulties encountered when crawling Google search results is that if your IP has suspicious behavior, such as the download speed is too fast, a verification code image will appear as shown in the figure below

    We can use the technique introduced in Chapter 7 to solve the captcha image problem, but a better way is to slow down the download speed, or use a proxy when the download speed must be high, so as to avoid being suspected by Google.

9.2 Facebook

    Currently, Facebook is one of the largest social networks in the world in terms of monthly active users, so its user data is very valuable.

9.2.1 Website

    The image below shows the Packt Publishing Facebook page at https://www.facebook.com/PacktPub .
    When you view the source code of the page, you can find the first few logs, but the later ones are only loaded via AJAX when the browser scrolls. In addition, Facebook also provides a mobile interface, which, as discussed in Chapter 1, is usually easier to crawl. The URL of the page on the mobile side is https://m.facebook.com/PacktPub, as shown in the figure below.

    When we interact with the mobile site and view it with Firebug, we see that the interface uses a similar structure as before for handling AJAX events, so this approach doesn't actually simplify crawling. While these AJAX events can be reverse engineered, different types of Facebook pages use different AJAX calls, and in my experience Facebook often changes the structure of these calls, so scraping these pages requires ongoing maintenance. Therefore, as described in Chapter 5, unless performance is critical, it's best to use the browser's rendering engine to execute JavaScript events and then access the resulting HTML page.
    The following code snippet automates Facebook login using Selenium and redirects to a given page URL.

    Then, you can call this function to load the Facebook page you are interested in and scrape the resulting HTML page.

 9.2.2 API

    As mentioned in Chapter 1, crawling a website is a last resort when its data is not presented in a structured format. And Facebook provides APIs for some data, so we need to check whether the access provided by Facebook meets the requirements before crawling. Below is a code example that uses Facebook's Graph API to extract data from the Packt Press page .

    This API call returns data in JSON format, which we can parse into Python's diet type using the json module. Then, we can extract some useful features from it, such as company name, details, and website.
    The Graph API provides many other calls to access user data, and its documentation is available from Facebook's developer page at https://developers.facebook.com/docs/graph-api. However, most of these API calls are designed for Facebook apps interacting with authorized Facebook users, so they are not very useful in extracting other people's data. To get more detailed information, such as user logs, crawlers are still needed.

9.3 Gap

    Gap has a well-structured website, and a Sitemap helps web crawlers locate its latest content. If we investigate the site using the techniques we learned in Chapter 1, we find that the robots.txt file at http://www.gap.com/robots.txt contains a link to the sitemap.

     Here's what's in the linked Sitemap file.

    As shown above, the contents of a Sitemap link are simply indexes, which in turn contain links to other Sitemap files. These other S i temap files contain links to thousands of product categories, such as http://www.gap.com/products/blue-long-s leeve shirts-for-me n.jsp, as shown in the figure below shown
    There's a lot to crawl here so we'll use the multithreaded crawler we developed in Chapter 4. You may recall that the crawler supports an optional callback parameter that defines how to parse the downloaded web page. The following is the callback function to crawl the Sitemap link in the Gap website.

 

        This callback function first checks the extension of the downloaded URL. If the extension is .xml, the downloaded URL is considered to be a Sitemap file, and then the etree module of lxml is used to parse any B file and extract links from it. Otherwise, it is regarded as a category URL, but the function of grabbing categories has not been realized in this example. Now, we can use this callback function in our multithreaded crawler to crawl gap.com.

 

 

    As expected, the Sitemap file is downloaded first, followed by the clothing category. 

9.4 BMW

    There is a search tool for searching local dealers on the BMW official website, its URL is https://www.bmw.de/de/home.html?entryType=dlo, the interface is shown in the figure below

    The tool takes a geographic location as an input parameter, and then displays nearby dealership locations on a map. For example, in the image below, Berlin is used as a search parameter.

     Using Firebug, we can see that the search triggers the following AJAX request.

    Here, the maxResults parameter is set to 99. However, we can increase the value of this parameter to download all dealer locations in one request, using the technique described in Chapter 1. Below is the output when increasing the value of maxResults to 1000 

 

    The AJAX request provides data in JSONP format where JSONP refers to JSON with padding mode  (JSON with padding). The padding here usually refers to the function to be called, and the parameter of the function is pure JSON data, in this case, the callback function is called. If you want to use Python's json module to parse the data, you first need to intercept the filling part.

    Now that we have loaded all the BMW dealerships in Germany into a JSON object, we can see that there are currently 731 dealerships in total. Below is the data for the first dealer.

 

    Now it's time to save the data we're interested in. The code snippet below writes dealer names, latitude and longitude to a spreadsheet.
After running this example, the contents of the bmw.csv table obtained are similar to the following.

The complete source code of grabbing    data from BMW official website can be obtained from https: //b i t b uc ket . org/wswp/code/src/tip/chapter09/bmw.py.

9.5 Chapter Summary

    This chapter analyzes several well-known websites and demonstrates how to apply the techniques introduced in this book to them. We used css selectors when scraping Google results pages, tested browser rendering engines and APIs when scraping Facebook pages, Sitemap when scraping Gap, and scraped all BMW dealerships from a map using made an AJAX call.

Guess you like

Origin blog.csdn.net/weixin_74021557/article/details/131468805