Python-Crawler first experience (with benefits)

The teaching video I watched in the NetEase Cloud Class , now let's consolidate the knowledge:
1. First determine the website you want to crawl, take the Sina News website as an example to confirmwrite picture description here

import requests  #跟java的导包差不多,python叫导入库
res = requests.get('http://news.sina.com.cn/china/')#爬取网页内容
res.encoding = 'utf-8' #将得到的网页内容转码,避免乱码
print(res.text) #将网页内容以text形式输出

write picture description here

2. Simple Learning BeautifulSoup

from bs4 import BeautifulSoup
a = '<a href="#" name=abc age=123>i am a link</a>'
soup = BeautifulSoup(a, 'html.parser') #html.parser为语法剖析器
print(soup.select('a')[0]) #得到a标签里的内容
print(soup.select('a')[0]['href']) #得到a标签里href的内容
print(soup.select('a')[0]['name']) #得到a标签里age的内容

write picture description here

3. Officially grab the Sina News website:

  • First observe the structure of the Sina news page, write picture description hereyou can see that the news title, time, and link are all placed under .news-item
  • grab
import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/china/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
for news in soup.select('.news-item'): #.news-item是一个class,如果是id的话可以改为'#news-item'
    print(news)

Observe the scraped contentObserve the crawled content, and then crawl further

import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/china/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
for news in soup.select('.news-item'):
    if len(news.select('h2')) > 0:
        h2 = news.select('h2')[0].text #得到h2标签里的文字
        time = news.select('.time')[0].text
        a = news.select('a')[0]['href'] #得到a标签里的链接
        print(time, h2, a) #打印抓取的内容

write picture description here
Crawling the Sina web page and write here first, and then learn new knowledge later and write the rest of the content

Now is welfare time, this is a python crawler program I saw on Zhihu: 50 lines of python crawling code, take you to open the new world of Zhihu correctly!
I just learned python, I just want to use this to practice my hands. It has the source code in it, but I changed it a little bit more concisely and the code amount has been reduced by more than half, haha

  • The first step is to observe the structure of the web page.
    write picture description here
    You can see that the pictures are all in the src link in the noscript tag.

Corrected code Ps: If you want to run, you must download the original code of Zhihu, it is best to read the source code analysis (there is on Zhihu), and then look at me, there is a folder structure in it, you can do it yourself Build, and then replace the original code with the following code. Note that it is python3



from selenium import webdriver
#import time

import urllib.request

from bs4 import BeautifulSoup



def main():
    # *********  Open chrome driver and type the website that you want to view ***********************

    driver = webdriver.Firefox()   # 打开浏览器

    # 列出来你想要下载图片的网站

    # driver.get("https://www.zhihu.com/question/35931586") # 你的日常搭配是什么样子?
    # driver.get("https://www.zhihu.com/question/61235373") # 女生腿好看胸平是一种什么体验?
    # driver.get("https://www.zhihu.com/question/28481779") # 腿长是一种什么体验?
    # driver.get("https://www.zhihu.com/question/19671417") # 拍照时怎样摆姿势好看?
    # driver.get("https://www.zhihu.com/question/20196263") # 女性胸部过大会有哪些困扰与不便?
    # driver.get("https://www.zhihu.com/question/46458423") # 短发女孩要怎么拍照才性感?
    driver.get("https://www.zhihu.com/question/26037846") # 身材好是一种怎样的体验?

    # ****************   Prettify the html file and store raw data file  *****************************************

    result_raw = driver.page_source # 这是原网页 HTML 信息
    result_soup = BeautifulSoup(result_raw, 'html.parser')
    count = 0
    with open("./output/rawfile/img_meta.txt", 'w',) as img_meta: #打开文件必须是先存在
        for rich in result_soup.select('noscript'):
            img_url = rich.select('img')[0]['src'] #得到下载图片的连接
            line = str(count) + "\t" + img_url  + "\n"
            img_meta.write(line) #可以学也可以不写
            urllib.request.urlretrieve(img_url, "./output/image/" + str(count) + ".jpg") # 一个一个下载图片,这个urllib具体用法还不知道
            count += 1

    img_meta.close()
    print("Store meta data and images successfully!!!")


if __name__ == '__main__':
    main()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325380837&siteId=291194637