What to do if one line of python code is too long, one line of code in python is too long

This article mainly introduces that the length of each line of python code cannot exceed 100 characters, which has certain reference value, and friends who need it can refer to it. I hope that you will gain a lot after reading this article. Let the editor take you to understand it together.

30 lines of python code to achieve Douban movie ranking crawling

Today we want to crawl the rankings of Douban movies.
insert image description here
As shown in the figure above, we hope to crawl the relevant information of movies through crawlers and write them into documents.

Implementation process

#导入库
import requests
from lxml import etree

The above needs to install the two libraries of requests and lxml, which can be completed through library installation

headers={
      'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
      }
j=1#用于计算电影数
fl=open("电影排行.doc","w",encoding='utf-8')

The above implements the establishment of HTTP request headers and documents

for i in range(0,250,25):#i表示页数,1页25影片
      url='https://movie.douban.com/top250?start='+str(i)+'&filter='
      response=requests.get(url,headers=headers)
      text=response.text
      html=etree.HTML(text.encode('utf-8'))
      ul=html.xpath('//div[@class="info"]')
      for div in ul:
            t="".join(div.xpath('./div[@class="hd"]/a/span[@class="title"]/text()'))
            s="".join(div.xpath('./div[@class="hd"]/a/span[@class="other"]/text()'))
            title=t+s
            d="".join(div.xpath('./div[@class="bd"]/p/text()'))
            director=d.replace("              ","").replace("\n","").replace("主演","\n主演")
            quote="".join(div.xpath('./div[@class="bd"]/p[@class="quote"]/span[@class="inq"]/text()'))
            content=str(j)+"."+title+"\n"+director+"\n主题:"+quote+"\n"
            fl.write(content)
            j=j+1
fl.close()

The above code realizes the crawling of rankings .
in:

url='https://movie.douban.com/top250?start='+str(i)+'&filter='

It can be found by turning the page that the url format of each page of this leaderboard is the above method.

response=requests.get(url,headers=headers)
      text=response.text
      html=etree.HTML(text.encode('utf-8'))
      ul=html.xpath('//div[@class="info"]')
      for div in ul:
            t="".join(div.xpath('./div[@class="hd"]/a/span[@class="title"]/text()'))
            s="".join(div.xpath('./div[@class="hd"]/a/span[@class="other"]/text()'))
            title=t+s
            d="".join(div.xpath('./div[@class="bd"]/p/text()'))
            director=d.replace("              ","").replace("\n","").replace("主演","\n主演")
            quote="".join(div.xpath('./div[@class="bd"]/p[@class="quote"]/span[@class="inq"]/text()'))

The above code part is to realize the positioning of the text content, which needs to be found by right-clicking the mouse, selecting inspection, and viewing the html code.
Use the .join() function in this section to join the text content .
The specific use of the .join() function is as follows:

#Python join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串。
str = "-";
seq = ("a", "b", "c"); # 字符串序列
print (str.join( seq ));
#输出:a-b-c

Use the .replace() function in this section to replace the text content .
The specific use of the .replace() function is as follows:

#Python replace() 方法把字符串中的 old(旧字符串) 替换成 new(新字符串),如果指定第三个参数max,则替换不超过 max 次。
str = "this is string example....wow!!! this is really string";
print(str.replace("is", "was"));
print(str.replace("is", "was", 3));
#输出:
#thwas was string example....wow!!! thwas was really string
#thwas was string example....wow!!! thwas is really string

full code

import requests
from lxml import etree
headers={
      'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
      }
j=1#用于计算电影数
fl=open("电影排行.doc","w",encoding='utf-8')
for i in range(0,250,25):#i表示页数,1页25影片
      url='https://movie.douban.com/top250?start='+str(i)+'&filter='
      response=requests.get(url,headers=headers)
      text=response.text
      html=etree.HTML(text.encode('utf-8'))
      ul=html.xpath('//div[@class="info"]')
      for div in ul:
            t="".join(div.xpath('./div[@class="hd"]/a/span[@class="title"]/text()'))
            s="".join(div.xpath('./div[@class="hd"]/a/span[@class="other"]/text()'))
            title=t+s
            d="".join(div.xpath('./div[@class="bd"]/p/text()'))
            director=d.replace("              ","").replace("\n","").replace("主演","\n主演")
            quote="".join(div.xpath('./div[@class="bd"]/p[@class="quote"]/span[@class="inq"]/text()'))
            content=str(j)+"."+title+"\n"+director+"\n主题:"+quote+"\n"
            fl.write(content)
            j=j+1
fl.close()

Realize the effect:
insert image description here

Guess you like

Origin blog.csdn.net/mynote/article/details/132261560