用BeautifulSoup 爬人人词典中对应ANKI单词库内容

由于最近开始备考研,导致沉迷ANKI不能自拔。下载了一个历届考研真题的单词库,背着背着觉得不带劲。于是想到了情景记忆法背单词,自然就联想到美剧!要是能把每个单词找出美剧里对应的句子、语音和翻译该多好!

这不就是人人词典么!事不宜迟,立马就导出了单词库的单词列表,开爬人人词典上对应的内容。一开始还害怕人人词典会不好弄,没想到爬虫基础库都可以搞掂…

源码:

import pandas as pd
from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup as bs
import re
import urllib.request
import csv
import time
headers=['words','english','chinese','source']
file=pd.read_csv('./anki.csv',names=headers)
file.head()
words english chinese source
0 contend NaN NaN NaN
1 perceptive NaN NaN NaN
2 lameness NaN NaN NaN
3 mobilize NaN NaN NaN
4 plead NaN NaN NaN
file.drop_duplicates('words',inplace=True)
wordslist = file['words']
len(wordslist)
771
new_words = file[file['words']=='contend']
words_91dict =pd.DataFrame(new_words)
words_91dict
words english chinese source
0 contend NaN NaN NaN
url='http://www.91dict.com/words?'
def get_page(keyword):
    url_words = urlencode({'w':keyword})
    target_url = url+url_words
    return target_url
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
def get_content(target_url):
    response = requests.get(target_url, headers=headers).content
    content = bs(response, 'lxml')
    return content
def get_target(content):
    target = content.select("#flexslider_2 > ul.slides.clearfix > li:nth-of-type(1)")
    soup = bs(str(target),'lxml')
    words = {}
    image_pattern = re.compile(r"<img src=\"(.*?)\"/>").findall(str(target))
    audio_pattern =  re.compile( r"<audio src=\"(.*?)\">").findall(str(target))
    source_pattern = re.compile(r'''</audio>\s(.*)\s</div>''').findall(str(target))
    english_pattern = re.compile(r'''<div class="mBottom">\s(.*?)</div>''').findall(str(target))
    chinese_pattern = re.compile(r'''<div class="mFoot">\s(.*?)</div>''').findall(str(target))

    words['image']=image_pattern[0]
    words['audio']=audio_pattern[0]
    words['source']=source_pattern[0]
    words['english']=english_pattern[0]
    words['chinese']=chinese_pattern[0]
    return words

这里有个超级无敌大的坑!就是匹配数据的时候,一开始不论我用CSS SELECTOR 还是 正则表达式都匹配失败!研究了半天原来是html源码里要匹配的那段内容后面有个换行符!”\s”

所以当你发现怎么都匹配不到的时候就要小心考虑这些迷你隐藏坑!

for i in wordslist:
    try:
        target_url = get_page(i)
        content = get_content(target_url)
        words = get_target(content)
        urllib.request.urlretrieve(words['audio'], 'E:/91dict/'+i+'.mp3')
        urllib.request.urlretrieve(words['image'], 'E:/91dict/'+i+'.jpg')
        new_words = file[file['words']==i]
        new_words['english']=words['english']
        new_words['chinese']=words['chinese']
        new_words['source']=words['source']
        new_words['words']=i
        print(new_words)
        words_91dict = pd.concat([words_91dict,new_words])
        time.sleep(5)
    except:
        print('something went wrong')
     words                                            english  \
0  contend  In Syria, the governor has invading Parthians ...   

                chinese               source  
0  叙利亚得总督要对付的可是入侵的帕提亚人呢  来自《公元:《圣经故事》后传 第7集》  
        words                                            english      chinese  \
1  perceptive  Oh, top marks, like I said, you are <em>percep...  一点不错 果然有洞察力   

         source  
1  来自《X战警:逆转未来》  
words_91dict.drop_duplicates(inplace=True)
words_91dict.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 706 entries, 0 to 1350
Data columns (total 4 columns):
words      706 non-null object
english    705 non-null object
chinese    705 non-null object
source     705 non-null object
dtypes: object(4)
memory usage: 16.5+ KB
words_91dict.to_csv('./wors91dict.csv')

猜你喜欢

转载自blog.csdn.net/weixin_42616808/article/details/80926911