Pathon selectively crawling in aaai19 article - Code World

Pathon selectively crawling in aaai19 article

Others 2020-02-12 17:32:36 views: null

Crawl articles downloaded to the specified directory

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import os
import re


# 生成文件时要对文件名字做处理
def recorrect_title(title):
    rstr = r"[\/\\\:\*\?\"\<\>\|]"  # '/ \ : * ? " < > |'
    new_title = re.sub(rstr, "_", title)  # 替换为下划线
    return new_title


save_path = 'E://文档//AAAI2019//'
url = 'http://www.aaai.org/Library/AAAI/aaai19contents.php'
find_text = 'Segmentation'
resp = requests.get(url)
html_doc = resp.text
soup = BeautifulSoup(html_doc, 'html.parser')
content = soup.find(class_='content')
soup1 = BeautifulSoup(content.prettify(), 'html.parser')
text_arr = soup1.findAll(class_='left')
find_text_arr = [x for x in text_arr if x.text.find(find_text) != -1]
down_url_arr = [[recorrect_title(x.find('a').text.replace('\n', '').strip()),
                 x.find('i').text.replace('\n', '').strip(),
                 x.find_all('a')[-1].get('href')] for x in find_text_arr]
print(down_url_arr)
for i in tqdm(down_url_arr):
    r = requests.get(i[-1])
    with open(save_path + i[0] + '.pdf', "wb") as code:
        if not os.path.exists(save_path + i[0] + '.pdf'):
            code.write(r.content)

Hekai

Published 163 original articles · won praise 117 · views 210 000 +

Private letter concerns

Guess you like

Origin blog.csdn.net/u010095372/article/details/102949595

Pathon selectively crawling in aaai19 article

CSDN article crawling

Pathon 선택적 aaai19 문서에서 크롤링

Pathonを選択aaai19記事でクロール

Shares crawling title of the article specified period of time

Lecture 19: Pyppeteer crawling in action

Pathon training four

pathon base 2

pathon language learning

Use pathon to access websites

the use of selenium python crawling Baidu library word article

Query data link address from crawling content of the article jsoup

Park crawling blog article two cases, write sql server database

Web crawlers - project combat (crawling all embarrassments Wikipedia article)

Python case article: Crawling and analyzing Python posts on large recruitment websites

[Swift] LeetCode10000. This is crawling article, click the following description link 9

[Swift] LeetCode10000. This is crawling article, click the following description link 8

[Swift] LeetCode10000. This is crawling article, click the following description link 5

Implement a web crawler small application with NodeJS - crawling blog article list Home Park

Scrapy framework combat (1): crawling well-known technical article websites

day04 pathon-Selenium basis

Pathon3 crawls web novels

In ASP.NET selectively validate form input line

[tensorflow] Selectively read loading weights, fine-tuning

How to selectively remove arsenate and arsenite from drinking water

How to read many files in a folder and selectively keep some files

[Project Issue] git cherry-pick selectively merges branches

10 hot words classified information in the field analysis and interpretation fourth step hot words Quote: crawling with hot words related article link

[python] Crawling Douyu live broadcast photos and saving them to a local directory [Source code attached + free book at the end of the article]

Pathon обучение четыре

Recommended

Ranking

#2019110700005

What materials and procedures are required for patent transfer

What is the blockchain Ethereum triplet state root transaction root receipt root

Front-end study notes 04 --- About the insertion of html pictures and videos

Documents required for the filing of WeChat Mini Programs in special industries, the filing process of WeChat Mini Programs in special industries, how to file WeChat Mini Programs in special industries

2017 Qingdao-site tournament I The Squared Mosquito Coil

[BZOJ3165][HEOI2013]Segment (line segment tree without marking)

Kettle series: KettleEasyExpand, an open source Kettle universal plugin by Ma Jinju

The latest tutorial on making framework for iOS

DAX Section 6: Statistical Functions

Daily

More

2024-05-14(9)

2024-05-13(8)

2024-05-12(28)

2024-05-11(32)

2024-05-10(34)

2024-05-09(32)

2024-05-08(18)

2024-05-07(34)

2024-05-06(6)

2024-05-05(0)