Take you step by step to get started with Python crawler!

From environment configuration, to understanding basic knowledge, to actual crawler combat, we will guide you step by step to get started with Python crawlers.

This article is mainly for getting started. If you are looking to advance or go further in crawling, the help provided by this article is minimal. The main purpose of this article is to help students who are interested in web crawling in a simple way and simple language.

Currently, there are a lot of guidance on web crawlers on the Internet, but the routines are all the same, and they basically revolve around the following content:

Knowledge of web pages such as CSS/html
requests or urllib
BeautifulSoup or regular expressions
Selenium or Scrapy

For me, learning crawler knowledge is a tool for obtaining data, not the main content of my work. Therefore, I don’t have much time to spend on systematic learning of the above knowledge. Each piece mentioned above involves a large amount of knowledge. After a period of study, it is easy for people to fall into "clouds and fog", and then lose interest in learning. There is no overall view and no focus, which ultimately makes the learning efficiency very low.

This article does not explain in detail what CSS/html is and how to use requests or urllib. The main purpose of this article is to introduce how to crawl a website and crawl the resources we need. The knowledge in one or several of the above modules may be used. It is enough to understand the functions we use. There is no need to learn it from beginning to end. I hope that this method can give students who are interested in crawlers an overall understanding of this technology and can meet their daily needs for data acquisition. If you want to study this technology in depth, you can follow up with other systematic courses and study the above modules carefully and in detail.

Preparation

Many tools are used or mentioned in many web crawler tutorials. This article selects the following tools,
1 web browser (Google Chrome)
2 BeautifulSoup4
3 requests
web browser is mainly used to view web page html source code and check web page unit usage, browse There are many browsers, such as Google, Firefox, IE, etc. Everyone has different preferences and can choose according to their daily habits. This article uses Google Chrome as an example to explain.

BeautifulSoup4 is an HTML and XML parser. It can easily parse web pages and obtain the units and information we want. It can avoid the trouble of filtering information. It can provide usage for iterating, searching, and modifying parse trees. . In the process of web page matching, BeautifulSoup is not faster than regular expressions, or even slower, but its biggest advantage is simplicity and convenience, so it is one of the must-have tools in many web crawler projects.

Install
$ pip install beautifulsoup4

Requests is the masterpiece of Python master Kenneth Reitz. It is a third-party library for network requests. Python has included the urllib module for accessing network resources, but it is relatively troublesome to use. Requests is much more convenient and faster in comparison, so This article chooses to use requests for network requests.

Install
$ pip install requests

Hands

Many tutorials choose to crawl Encyclopedia of Embarrassing Stories and webpage pictures. This article will choose another direction and crawl the commonly used Baidu Encyclopedia, which is more intuitive and easier to understand.

Those who frequently browse the web, pay attention to details or are good at summarizing will find that a website address mainly consists of two parts, the basic part, and the suffix of the corresponding entry. For example, the above-mentioned encyclopedia entry consists of the basic part https://baike.baidu.com, The suffix is ​​item/Lin Chiling/172898?fr=aladdin, so if we want to crawl a website, we must first obtain a URL.

The first step is to determine a goal. What data do you want to crawl? Many people will think, isn’t this nonsense? I personally think this is very important. Only with a purpose can we be more efficient. Without some kind of goal-driven situation, it is difficult to do something with problems and pressure. This will become aimless and lead to The efficiency is very low, so I think the most important thing is to first know what data you want to crawl?


Music pictures materials on the web
...

This article will explain with the goal of crawling the internal links of Baidu Encyclopedia entries and downloading pictures.

In the second step, we need to obtain a basic URL, the basic URL of Baidu Encyclopedia`

https://baike.baidu.com/`

The third step is to open the homepage and start crawling with Lin Chiling’s Baidu entry as the homepage.

The fourth step is to view the source code. Many people know that the shortcut key to view the source code is F12, whether it is Google Chrome or IE browser, but after pressing F12, you can't help but wonder, "What is this?" , which makes people have no clue.

Of course, we can understand the source code step by step, learn the knowledge of html, and then use regular expressions to match the information we want step by step, unit by unit, but this is too complicated. I personally recommend using inspection tools.

Crawl internal links pointing to the elements we want to know about,

Right mouse button->Inspect can quickly locate the element we are concerned about.

I think this step is enough. The simplest web crawler is to repeat the following two steps over and over again: 1.
Check and locate the elements and attributes we want.
2. BeautifulSoup4 matches the information we want.

Through the inspection function, you can see that the source code of the internal link part of the encyclopedia entry is like this,
element 1:

<a target="_blank" href="/item/%E5%87%AF%E6%B8%A5%E6%A8%A1%E7%89%B9%E7%BB%8F%E7%BA%AA%E5%85%AC%E5%8F%B8/5666862" data-lemmaid="5666862">凯渥模特经纪公司</a>

Element 2:

<a target="_blank" href="/item/%E5%86%B3%E6%88%98%E5%88%B9%E9%A9%AC%E9%95%87/1542991" data-lemmaid="1542991">决战刹马镇</a>

Element 3:

<a target="_blank" href="/item/%E6%9C%88%E4%B9%8B%E6%81%8B%E4%BA%BA/10485259" data-lemmaid="10485259">月之恋人</a>
元素4:
<a target="_blank" href="/item/AKIRA/23276012" data-lemmaid="23276012">AKIRA</a>

As can be seen from the above four elements, the internal link of the information entry we want is in the tag, and the tag has the following attributes:

target: This attribute indicates where the linked document is opened. _blank indicates that the browser always loads the target document in a new tab, which is the document pointed to by the link href.

href: As mentioned many times before, the attribute href is used to specify the link of the hyperlink target. If the user selects the content in the tag, it will try to open and display the document with the link specified by href.

data-*: This is a new feature of HTML that can store user-defined attributes.

It can be seen that the information we want is in the href, which is the internal link of the entry. Therefore, the goal of our crawler is very clear, which is to parse out the href hyperlink. At this point, the browser inspection function has played its role. The next question becomes how do we parse out the href link in the tag? At this time, BeautifulSoup4 comes in handy. Use BeautifulSoup4 to parse the html we grabbed from the web page,

soup = BeautifulSoup(response.text, 'html.parser')

You may be wondering when you see this, what is html.parser?

This is an html parser. Several html parsers are provided in Python. Their main features are:

To sum up, we choose the html.parser parser. After selecting the parser, we start to match the elements we want. However, looking at the HTML, we find that there are many tags in the web page. Which type should we match?

<a target="_blank" href="/item/AKIRA/23276012" data-lemmaid="23276012">AKIRA</a>

If you look carefully, you will find the characteristics, target="_blank", the attribute href starts with /item, so we have our matching conditions,

{"target": "_blank", "href": re.compile("/item/(%.{2})+$")}

Use such matching conditions to match tags that meet the target and href requirements.

sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

The complete code is,

def main():
    url = BASE_URL + START_PAGE
    response = sessions.post(url)
    response.encoding = response.apparent_encoding
    soup = BeautifulSoup(response.text, 'html.parser')
    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
    for sub_url in sub_urls:
        print(sub_url)

`

        输出结果为,<a href="/item/%E5%B9%B8%E7%A6%8F%E9%A2%9D%E5%BA%A6" target="_blank">幸福额度</a>
<a href="/item/%E5%8C%97%E4%BA%AC%C2%B7%E7%BA%BD%E7%BA%A6" target="_blank">北京·纽约</a>
<a href="/item/%E5%A4%9A%E4%BC%A6%E5%A4%9A%E5%A4%A7%E5%AD%A6" target="_blank">多伦多大学</a>
<a href="/item/%E5%88%BA%E9%99%B5" target="_blank">刺陵</a>
<a href="/item/%E5%86%B3%E6%88%98%E5%88%B9%E9%A9%AC%E9%95%87" target="_blank">决战刹马镇</a>
<a href="/item/%E5%8C%97%E4%BA%AC%C2%B7%E7%BA%BD%E7%BA%A6" target="_blank">北京·纽约</a>
<a href="/item/%E5%BC%A0%E5%9B%BD%E8%8D%A3" target="_blank">张国荣</a>
<a href="/item/%E5%A5%A5%E9%BB%9B%E4%B8%BD%C2%B7%E8%B5%AB%E6%9C%AC" target="_blank">奥黛丽·赫本</a>
<a href="/item/%E6%9E%97%E5%81%A5%E5%AF%B0" target="_blank">林健寰</a>
<a href="/item/%E6%96%AF%E7%89%B9%E7%BD%97%E6%81%A9%E4%B8%AD%E5%AD%A6" target="_blank">斯特罗恩中学</a>
<a href="/item/%E5%A4%9A%E4%BC%A6%E5%A4%9A%E5%A4%A7%E5%AD%A6" target="_blank">多伦多大学</a>
<a href="/item/%E5%8D%8E%E5%86%88%E8%89%BA%E6%A0%A1" target="_blank">华冈艺校</a>
<a href="/item/%E5%94%90%E5%AE%89%E9%BA%92" target="_blank">唐安麒</a>
<a href="/item/%E6%97%A5%E6%9C%AC%E5%86%8D%E5%8F%91%E7%8E%B0" target="_blank">日本再发现</a>
<a href="/item/%E4%BA%9A%E5%A4%AA%E5%BD%B1%E5%B1%95" target="_blank">亚太影展</a>
<a href="/item/%E6%A2%81%E6%9C%9D%E4%BC%9F" target="_blank">梁朝伟</a>
<a href="/item/%E9%87%91%E5%9F%8E%E6%AD%A6" target="_blank">金城武</a>
......在用属性字段sub_url["href"]过滤一下即可,/item/%E5%B9%B8%E7%A6%8F%E9%A2%9D%E5%BA%A6
/item/%E5%8C%97%E4%BA%AC%C2%B7%E7%BA%BD%E7%BA%A6
/item/%E5%A4%9A%E4%BC%A6%E5%A4%9A%E5%A4%A7%E5%AD%A6
/item/%E5%88%BA%E9%99%B5
/item/%E5%86%B3%E6%88%98%E5%88%B9%E9%A9%AC%E9%95%87
/item/%E5%8C%97%E4%BA%AC%C2%B7%E7%BA%BD%E7%BA%A6
/item/%E5%BC%A0%E5%9B%BD%E8%8D%A3
......

The suffix part of the internal link of the entry is obtained, and then spliced ​​together with the basic URL to obtain the complete internal link address.

In the same way, you can use the same method to crawl other content, such as jokes from Encyclopedia of Embarrassing Things, materials from professional websites, and entries from Baidu Encyclopedia. Of course, some text information is relatively messy, and this process requires some information screening process, such as Using regular expressions to match valuable information in a piece of text, the method is similar to the above.

Download pictures


Just like crawling internal links, you must be good at using the browser's checking function to check the links to the pictures inside the entry.

<img class="picture" alt="活动照" src="https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/s%3D220/sign=85844ee8de0735fa95f049bbae500f9f/dbb44aed2e738bd49d805ec2ab8b87d6267ff9a4.jpg" style="width:198px;height:220px;">

It was found that the image link is stored inside the tag, and the complete link to the image can be matched using the above method.

url = BASE_URL + START_PAGE
    response = sessions.post(url)
    response.encoding = response.apparent_encoding
    soup = BeautifulSoup(response.text, "html.parser")
    image_urls = soup.find_all("img", {"class": "picture"})
    for image_url in image_urls:
        print(image_url["src"])

The output is as follows,

    https://gss2.bdstatic.com/9fo3dSag_xI4khGkpoWK1HF6hhy/baike/s%3D220/sign=36dbb0f7e1f81a4c2232ebcbe7286029/a2cc7cd98d1001e903e9168cb20e7bec55e7975f.jpg
https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/s%3D220/sign=85844ee8de0735fa95f049bbae500f9f/dbb44aed2e738bd49d805ec2ab8b87d6267ff9a4.jpg

Then use requests to send a request to obtain the image data, and then store it locally by reading and writing files.

for image_url in image_urls:
    url = image_url["src"]
    response = requests.get(url, headers=headers)
    with open(url[-10:], 'wb') as f:
        f.write(response.content)

In addition to requests, you can also use urllib.request.urlretrieve to download images. urlretrieve is relatively more convenient, but for large files, requests can be read and written in segments, which is more advantageous.

The method introduced above is a relatively simple one. If you have limited energy, you can also try Selenium or Scrapy. These two tools are indeed very powerful, especially Selenium. It is originally an automated testing tool, but later I found that it can also be used. As a web crawler, it is a tool that allows the browser to help you automatically crawl data. It can browse web pages and capture data in the same way that users visit web pages. It is very efficient. If you are interested, you can try it.

Technical reserves about Python

Here I would like to share with you some free courses for everyone to learn. Below are screenshots of the courses. Scan the QR code at the bottom to get them all.

1. Python learning routes in all directions

Insert image description here

2. Learning software

If a worker wants to do his job well, he must first sharpen his tools. The commonly used development software for learning Python is here, saving everyone a lot of time.
Insert image description here

3. Study materials

Insert image description here

4. Practical information

Practice is the only criterion for testing truth. The compressed packages here can help you improve your personal abilities in your spare time.
Insert image description here

5. Video courses

Insert image description here

Well, today’s sharing ends here. Happy time is always short. Friends who want to learn more courses, don’t worry, there are more surprises~Insert image description here

Guess you like

Origin blog.csdn.net/Everly_/article/details/133341789