Gradually arrogant, use python to collect the data of this article and save the PDF

foreword

Hello! Hello everyone, this is the Demon King~**

Necessary materials for this time:

  • wkhtmltopdf [software]
  • material code

Third Party Libraries:

  • requests >>> pip install requests
  • parsel >>> pip install parsel
  • pdfkit >>> pip install pdfkit

Development environment:

  • Version: python3.8
  • Editor: pycharm

win + R Enter cmd Enter the installation command pip install If the module name is popular, it may be because the network connection timed out to switch the domestic mirror source

Collection process:

1. Analyze the content of the data you want, where can you get it

Packet capture analysis is performed through developer tools. After analysis, we can get it. If we want the data content, we can actually request the url address of the navigation bar.

2. Code implementation steps:

Get multiple article content (get all article url addresses)

  1. send request, for article directory page send request
  2. Get data, get web page source code data text data
  3. Parse data, extract article url address

Get article content code

  1. Send a request, send a request for the url address
  2. Get data, get web page source code data
  3. Parse data, extract article content
  4. Save the data, first save it as an html file, and then convert the html file to PDF

code

# import requests  # 数据请求模块
# import parsel   # 数据解析模块
# import re  # 正则表示
# import pdfkit
# import subprocess
# for page in range(4, 6):
#     url = f'https://blog.csdn.net/fei347795790/article/list/{page}'  # 确定请求网址
#     # headers 请求头, 主要用于伪装python, 防止程序被服务器识别出来
#     headers = {
    
    
#         'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36'
#     }
#     # 用requests模块里面get方式发送请求
#     response = requests.get(url=url, headers=headers)
#     # print(response.text)  # <Response [200]> 响应对象 200 表示请求成功
#     selector = parsel.Selector(response.text)  # <Selector xpath=None data='<html lang="zh-CN">\n<head>\n    <meta ...'> 返回对象
#     # css 是解析方式之一 根据标签属性内容提取数据 a::attr(href) 获取a标签里面href属性
#     href = selector.css('#articleMeList-blog > div.article-list > div > h4 > a::attr(href)').getall()
#     # print(href)
#     for index in href:
#         try:
#             print(index)
#             html_data = requests.get(url=index, headers=headers).text
#             selector_1 = parsel.Selector(html_data)
#             title = selector_1.css('#articleContentId::text').get()
#             cmd = f'C:\\01-Software-installation\\wkhtmltopdf\\bin\\wkhtmltopdf.exe {index} pdf_1\\{title}.pdf'
#             subprocess.run(cmd, shell=True)
#         except Exception as e:
#             print(e)


import requests

url = 'https://blog.csdn.net/phoenix/web/v1/comment/submit'
like_url = 'https://blog.csdn.net//phoenix/web/v1/article/like'
headers = {
    
    
    'cookie': 'uuid_tt_dd=10_29360288410-1640936706807-857482; __gads=ID=1a4feb23074a3469-22da76a196cf0001:T=1640936708:RT=1640936708:S=ALNI_MawGCakjM400IbVY204TvKfKLhDlg; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1645514550; __gpi=UID=0000049689281fe2:T=1649317424:RT=1649317424:S=ALNI_MYlX9R83NQ5EzlFY5UgNF09G45dPw; c_dl_prid=-; c_dl_rid=1650090830371_447095; c_dl_fref=https://so.csdn.net/so/search; c_dl_fpage=/download/qq_43651710/10848772; c_dl_um=distribute.pc_search_result.none-task-blog-2%7Evipall%7Efirst_rank_ecpm_v1%7Erank_v31_ecpm-1-114898691.nonecase; dc_session_id=10_1650262926080.949004; c_first_ref=www.baidu.com; c_first_page=https%3A//blog.csdn.net/fei347795790/category_11731395.html; c_segment=10; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1650090803,1650095679,1650112607,1650262927; firstDie=1; hide_login=1; dc_sid=70fca81ac8fa563314905c0e38f533b9; unlogin_scroll_step=1650263871780; c_pref=default; SESSION=eb13b53e-41e8-43e0-aa6c-54811bb65d0c; c_ref=https%3A//blog.csdn.net/fei347795790/article/details/110070943; ssxmod_itna=Qqfx2DB7D=DQexCq0LpO8D9i8DORYYrQN7Yd7DlOiQxA5D8D6DQeGTTRdY=T1zCep+uDQDRgyfKlFpO2GWKk7YawWsUnO4GLDmKDyKA=ueDxOq0rD74irDDxD39D7PGmDiWZRD72=1lSgK8DWKDKx0kDY5Dw=AGDiPD7gFeCB9w1g911pBGd4D1qCvxKBKD9x0CDlPxf9GkDDyf69isyo3EDmb3A1BhDCKDjg71s6YDUeysgaFU/j0aAnT5YQxxLQi4Kg0Dt=2DK2GYGQpN1nredjDxfsrFTnTqDDpxpywx4D===; ssxmod_itna2=Qqfx2DB7D=DQexCq0LpO8D9i8DORYYrQN7YdD6h8iQD0vxLx03qKru2d+UOqcnUg8xhCDRoHKH1SQqrUY0iFWAxm=RhDFIOD8xod7VS8Bv0+m23mlQcq+912jIp1r/8bM1z9ZgSyzg5CKBhHsmH8BeHiq8wHMDp1prTH5eoO5FE83p976COKCP57q35OWchz=iuDVBi5KB4GeDIbenWenPaKBYrmQWWek4qqcAFWKnxt0/M=u0pK0nDH5M+rPa1eVQQxRaZDREMbBYBbi5mb17K13xzFV+en8OpHAqw+pp5dK4=R7caLRTTSb5K91ea5UFt8D4QRiIhqRrfRvY+eu3qEY9QQR0z44fK=RGxd4eDPiR+10hu+FCIxaBe1Ue=QB7YnpQc/FwEWvP=mO+4sAHn95OQwbC9H/p+mTa9E/lIP2bcWFk+mwB9N/Ej9ID2xYE+aLSiPkWWT=iiK+aT0bWKAsYGdWnDDgDcIQr4ORGCBGmQPG7O2Y7VmmARgGWWKoqszEmiwB0m7gWRz91N+QE4wXTt78wCo3LWZRxCkoO7m1KT4rmvfKxZ+NITqbgw/hrixDKd9D7=DYFqeD===; UserName=weixin_43239784; UserInfo=b58cf84406a84acebf2c3f36442f1c59; UserToken=b58cf84406a84acebf2c3f36442f1c59; UserNick=%E6%97%A0%E9%9B%A8%E0%B8%88%E0%B8%B8%E0%B9%8A%E0%B8%9A; AU=1D5; UN=weixin_43239784; BT=1650268841955; p_uid=U010000; c_page_id=default; dc_tos=raizmv; log_Id_pv=153; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1650268904; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22weixin_43239784%22%2C%22scope%22%3A1%7D%7D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_29360288410-1640936706807-857482!5744*1*weixin_43239784; log_Id_view=478; log_Id_click=110',
    'origin': 'https://blog.csdn.net',
    'referer': 'https://blog.csdn.net/fei347795790/article/details/110070943',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36',
    'x-requested-with': 'XMLHttpRequest',
    'x-tingyun-id': 'im-pGljNfnc;r=268943811',
}
data = {
    
    
    'commentId': '',
    'content': '自游老师真帅',
    'articleId': '124196275',
}
like_data = {
    
    
    'articleId': '110070943'
}
# response = requests.post(url=url, data=data, headers=headers)
response = requests.post(url=like_url, data=like_data, headers=headers)
print(response)

epilogue

Well, this article of mine ends here!

If you have more suggestions or questions, feel free to comment or private message me! Let's work hard together (ง •_•)ง

Follow the blogger if you like it, or like and comment on my article! ! !

Guess you like

Origin blog.csdn.net/python56123/article/details/124253397