python crawling cat's eye movie Top100

1 View page structure

(1) determine the field needs to crawl

Movie name

Movie starring

Movie showtimes

Movie Ratings

(2) Analysis of the structure of the page

Hold down -------> F12 top right corner (as FIG. 2) ----> mouse clicks required field observation

(3) BeautifulSoup parses the source and set the filter attribute

1 soup = BeautifulSoup(htmll, 'lxml')

2 Movie_name = soup.find_all('div',class_='movie-item-info')

3 Movie_Score1=soup.find_all('p',class_='score')

(4) Debug view filter properties are correct

(5) extracting the corresponding field

 1  for cate,score in zip(Movie_name,Movie_Score1):
 2         data={}
 3         movie_name1 = cate.find('a').text.strip('\n')
 4         data['title']=movie_name1
 5         movie_actor = cate.find_all("p")[1].text.replace("\n"," ").strip()
 6         data['actors']=movie_actor
 7         movie_time=cate.find_all("p")[2].text.strip('\n').strip()
 8         data['data']=movie_time
 9         movie_score1=score.find_all("i")[0].string
10         movie_score2=score.find_all("i")[1].string
11         movie_score=movie_score1+movie_score2
12         data['score'] = movie_score
13         name = movie_name1 + "\t"+movie_actor+"\t" + movie_time+"\t"+movie_score
14         DATA.append(name)
15         with open('Movie1.txt', 'a+') as f:
16             f.write("\n{}".format(name))
View Code

(6) page crawling

Below, in accordance with step 123, such pages are found in the sub-rule. Such as offset = 0 offset = 10 ......

2 storage excel

1   for datas in DATA:
2         datas=datas.split('\t')#因为我之前解析字段拼接的时候就是采用\t分割
3         print(len(datas))
4         print(datas)
5         for j in range(len(datas)):#列表中的每一项都包含按照\t分割的字段
6             print(j)
7             sheet1.write(i, j, datas[j])
8         i = i + 1
9     f.save("d.xls")  # 保存文件
View Code

3 结果

 

Guess you like

Origin www.cnblogs.com/lanjianhappy/p/11930088.html