White quickly experience the reptiles crawl Sina trending

First, there must be some preparation, of course, the premise is the need to understand the basics of python.

Installation language environment and tools needed:
1, python3.6.5 Python I'm using version
2, three library requests
the installation command: install requests PIP3
3, Beautiful Soup
installation command: PIP3 install BS4
4, lxml
installation command: pip3 install lxml
5, Pycharm
a python of IDE official website address: https: //www.jetbrains.com/pycharm/
course also be directly encoded in the terminal

The code:
create pycharm with a python project, and then create a python file, such as test.py, and then paste the following code to run after.

import requests
from bs4 import BeautifulSoup

mheaders = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh-TW;q=0.7,zh;q=0.6",
    "Cache-Control": "max-age=0",
    "Connection": "keep-alive",
    "Host": "s.weibo.com",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"
}

targetUrl = 'https://s.weibo.com/top/summary?cate=realtimehot'

response = requests.get(targetUrl, headers=mheaders).text
soup = BeautifulSoup(response, 'lxml')
sort = 0
for hot_td in soup.find_all('td', class_="td-02"):
    # 标题
    hotTitle = hot_td.find('a').string
    sort += 1
    print('第%s位  %s ' % (sort, hotTitle))

Enter the result:
Here Insert Picture Description
a simple explanation:
1, Requests tripartite network is a python library that provides a simple http get \ post requests and other methods. requests.get (targetUrl, headers = mheaders) get here is to get representative mode request, is provided to the request headers disguised browser request, the request is intercepted to avoid off.

2, BeautifulSoup can use regular expressions instead of html tags to find us want to crawl. soup.find_all ( 'td', class _ = "td-02") on behalf Find all class = "td-02" of the td tag. View page source to see the contents of the target site can be crawled look as follows:

<td class="td-02">
   <a href="/weibo?q=%23%E6%B8%85%E6%98%8E%E8%BF%BD%E6%80%9D%E5%AE%B6%E5%9B%BD%E6%B0%B8%E5%BF%B5%23&Refer=new_time" target="_blank">清明追思家国永念
   </a>
</td>
...等等....

Similarly, hot_td.find ( 'a'). String is to find a label in the td tag takes it contains content that we want to crawl the content of the hot search.

Finally:
grab the library there are many real projects but also consider a lot, such as how to grab the next page, change ip, data warehousing and so on, here it is when the self-study after the python, crawling experience a small demo.

Today is 2020 April 4, Ching Ming Festival, people across the country to fight the epidemic heroes sacrifice of silence today! Do not say, in my heart! Come motherland!

Released seven original articles · won praise 14 · views 20000 +

Guess you like

Origin blog.csdn.net/u010823943/article/details/105308201