Quick guide: How to create a Python-based crawler

The use of web scraping is actively increasing, especially in large e-commerce companies. Web scraping is a way to collect data to compete, analyze competitors and research new products. Web scraping is a method of extracting information from a website. In this article, learn how to create a Python-based scraper. Dig into the code to see how it works.

Quick guide: How to create a Python-based crawler

In today's big data world, it is difficult to keep track of what is happening. For companies that need a lot of information to succeed, the situation becomes more complicated. But first, they need to collect this data in some way, which means they have to deal with thousands of resources.

There are two methods of collecting data. You can use the service provided by the API media website, which is the best way to get all the news. Moreover, the API is very easy to use. Unfortunately, not every website provides this service. Then there is the second method-web crawling.

What is web crawling?

This is a method of extracting information from a website. An HTML page is nothing but a collection of nested tags. The tags form a kind of tree whose root is in the <html>tags and divide the page into different logical parts. Each label can have its own descendants (children) and parents.

For example, the HTML page tree can look like this:

Quick guide: How to create a Python-based crawler

To process this HTML, you can use text or tree. Bypassing this tree is web crawling. We will only find the nodes we need in all this diversity and get information from them! This method focuses on converting unstructured HTML data into easy-to-use structured information into a database or worksheet. Data scraping requires a robot to collect information and connect to the Internet via HTTP or a web browser. In this guide, we will use Python to create a scraper.

What do we need to do:

  • Get the URL of the page we want to scrape data from

  • Copy or download the HTML content of this page

  • Process this HTML content and get the required data

This sequence allows us to pop up the required URL, get the HTML data, and then process it to receive the required data. But sometimes we need to enter the website first, and then go to a specific URL to receive data. Then, we must add one more step-log in to the website.

Matching

We will use the Beautiful Souplibrary to analyze the HTML content and get all the necessary data. This is an excellent Python package for crawling HTML and XML documents.

The Selenium library will help us get the crawler into the website and go to the required URL address in one session. Selenium Python can help you perform operations such as clicking buttons and inputting content.

Let's dive into the code

First, let's import the library we will use.

# 导入库
from selenium import webdriver
from bs4 import BeautifulSoup

Then, we need to show the browser's driver how Selenium launches the web browser (we will use Google Chrome here). If we do not want the robot to display the graphical interface of the web browser, the "headless" option will be added to Selenium.

A web browser without a graphical interface (headless) can automatically manage web pages in an environment very similar to all popular web browsers. But in this case, all activities are carried out through the command line interface or using network communication.

# chrome驱动程序的路径
chromedriver = '/usr/local/bin/chromedriver'
options = webdriver.ChromeOptions()
options.add_argument('headless') #open a headless browser 
browser = webdriver.Chrome(executable_path=chromedriver, 
chrome_options=options)

After setting up the browser, installing the library, and creating the environment, we started using HTML. Let's go to the input page and find the identifier, category or field name where the user must enter the email address and password.

# 进入登录页面
browser.get('http://playsports365.com/default.aspx')

# 按姓名搜索标签
email =
browser.find_element_by_name('ctl00$MainContent$ctlLogin$_UserName')
password = 
browser.find_element_by_name('ctl00$MainContent$ctlLogin$_Password')
login = 
browser.find_element_by_name('ctl00$MainContent$ctlLogin$BtnSubmit')

Then, we will send the login data to these HTML tags. To do this, we need to press the action button to send the data to the server.

# 添加登录凭证
email.send_keys('********')
password.send_keys('*******')
# 点击提交按钮
login.click()
email.send_keys('********')
password.send_keys('*******')
login.click()

After successfully entering the system, we will go to the desired page and collect HTML content.

# 成功登录后,转到“ OpenBets”页面
browser.get('http://playsports365.com/wager/OpenBets.aspx')
# 获取HTML内容
requiredHtml = browser.page_source

Now, when we have HTML content, the only thing left is to process the data. We will do this with the help of Beautiful Soup and the html5lib library.

html5libIt is a Python software package that implements the HTML5 crawling algorithm influenced by modern web browsers. Once the standardized structure of the content is obtained, the data can be searched in any child element of the HTML markup. The information we are looking for is in the form tab, so we are looking for it.

soup = BeautifulSoup(requiredHtml, 'html5lib')
table = soup.findChildren('table')
my_table = table[0]

We will find the parent tag once, then recursively traverse the child tags and print out the value.

# 接收标签和打印值
rows = my_table.findChildren(['th', 'tr'])
for row in rows:
 cells = row.findChildren('td')
 for cell in cells:
 value = cell.text
 print (value)

To execute this program, you will need to install Selenium, Beautiful Soup, and html5lib using pip. After installing the library, the command is as follows:

# python <程序名称>

These values ​​will be printed to the console and this is how you crawl any website.

If we crawl a website with frequently updated content (for example, a sports score sheet), we should create a cron task to start the program at specific time intervals.

Very good, everything is normal, the content is crawled, and the data is filled. Apart from this, everything else is fine. This is the number of requests we want to get data.

Quick guide: How to create a Python-based crawler

Sometimes the server gets tired of making a bunch of requests from the same person and the server forbids it. Unfortunately, people have limited patience.

In this case, you must hide yourself. The most common reasons for bans are 403 errors and frequent requests to the server when the IP is blocked. When the server is available and able to process the request, the server throws a 403 error, but for some personal reasons, refuses to do so. The first problem has been solved – we can pretend to be a human by using html5lib to generate a fake user agent and pass a random combination of operating system, specification and browser to our request. In most cases, this is a good and accurate way to collect the information you are interested in.

But sometimes just putting it time.sleep()in the right place and filling in the request header is not enough. Therefore, you need to find a powerful way to change this IP. To capture large amounts of data, you can:

– Develop your own IP address infrastructure;

– Use Tor – this topic can be devoted to several large articles, but it has actually been completed;

– Use a network of commercial agents;

For beginners in web scraping, the best option is to contact a proxy provider, such as Infatica, etc. They can help you set up a proxy and solve all the difficulties in proxy server management. Collecting large amounts of data requires a lot of resources, so there is no need to "reinvent the wheel" by developing your own internal infrastructure to act as a proxy. Even many of the largest e-commerce companies use agency network services to outsource agency management, because most companies’ first priority is data, not agency management.

Guess you like

Origin blog.51cto.com/mageedu/2541099