[Python][Crawler 02] Requests+BeautifulSoup Example: Grab NetEase Cloud Playlist

>Spider

    As mentioned earlier, a complete Spider should have two functions: scraping web data + parsing and extracting data.

     >Previous article< Using the "urllib+regular expression" method in the python standard library to extract part of the data on the home page of station B, this article will use the requests and BeautifulSoup methods introduced in the previous article to implement another crawler .


>Environmental Construction

    Since lxml is far faster than BeautifulSoup, adding the latter seems to have stopped updating, but because BS is quite popular, there are a lot of tutorials that rely on BS, so even if lxml can replace BeautifulSoup, BeautifulSoup is still worth learning and using. In addition, BeautifulSoup can also rely on using lxml as the engine for parsing, so lxml is installed here.

installation method:

pip install beautifulsoup4
pip install requests
pip install lxml


> crawler implementation

    Open a playlist at will: http://music.163.com/#/playlist?id=2208261223, grab it casually first, and find that the grabbed content actually contains a lot of JS scripts, and the entire page is dynamically loaded asynchronously , so there is no value in scraping this URL directly. ( Not totally worthless, you can actually use Phantomjs to simulate a browser and execute a JS script to get the data )

    Press F12 to open the browser console, find the following request playlist?id=2208261223, and find that the return result of this request has the result we need:


    Check out the headers and actual URL for this request: http://music.163.com/playlist?id=2208261223. Although there is only one # difference between this URL and the one above, the content is really different.


    Therefore, we will mock the corresponding headers and request data directly from this URL:

nn163_cloud_url = 'http://music.163.com/playlist?id=2208261223'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
request = urllib2.Request(nn163_cloud_url)
request.add_header('User-Agent', user_agent)
res = urllib2.urlopen(request, timeout=10)
root = re.search(r'<ul class="f-hide">(.*?)</ul>', res.read(),  re.S|re.M).group(1)
musics = re.findall(r'<a href="/song.*?">(.*?)</a>', root,  re.S|re.M)
for i in musics:
    print i

    The method of urllib2+regex is still used to extract information here. If requests+BeautifulSoup is used, we can extract information more conveniently:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
}
nn163_cloud_url = 'http://music.163.com/playlist?id=2208261223'

s = requests.session()
bs = BeautifulSoup(s.get(nn163_cloud_url, headers=headers).content, "lxml")
for i in bs.ul.children:
    print i.string

    Is it more concise?


>Help Documentation

requests

BeautifulSoup

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325644635&siteId=291194637