Getting Started with Python Crawler - Code World

Getting Started with Python Crawler

Others 2022-04-21 01:30:40 views: 0

import requests
import re

# TODO Download the homepage url of each novel

# TODO big loop

# 1. Download the novel's homepage 
novel_url = ' http://www.jingcaiyuedu.com/book/15205/list.html ' 
response = requests.get(novel_url)
 # Process the explicit specification of character encoding, 
response.encoding = ' utf -8 ' 
html = response.text # string 
# print(html) 
# 2. Extract chapter url non-greedy match 
title = re.findall(r ' <meta name="keywords" content="《(.*?)》' ,html)[0]
 # print(title) 
# id = list dl has two 
dl = re.findall(r ' <dl id="list">.*?</dl> ',html)[1]
# print(dl)
chapter_info_list = re.findall(r'<a.*?href="(.*?)".*?>(.*?)</a>',dl)
# print(chapter_info_list)

#Data persistence is written to txt 
fb = open( ' %s.txt ' %title, ' w ' ,encoding= ' utf-8 ' )

# 3. Loop through each chapter and extract the content 
for chapter_info in chapter_info_list:
    chapter_url = chapter_info[0]
    chapter_title = chapter_info[1 ]
     #Handle relative url 
    if  ' http '  not  in chapter_url:
        chapter_url = ' http://www.jingcaiyuedu.com%s ' % chapter_url #download
     chapter page 
    chapter_response = requests.get(chapter_url)
    chapter_response.encoding = " utf-8 " 
    chapter_html = chapter_response.text
     # print(chapter_response.text) 
    #Extract content 
    chapter_content = re.findall(r ' <script>a1\(\);</script>(.*?) <script>a2\(\);</script> ' ,chapter_html)[0] #Clean
     the data and process the extra characters 
    chapter_content = chapter_content.replace( '  ' , '' )
    chapter_content = chapter_content.replace('<br/>','')
    chapter_content = chapter_content.replace('<br>','')
    chapter_content = chapter_content.replace(' ','')
    # print(chapter_content)
    # 写入文件
    fb.write(chapter_title)
    fb.write('\n')
    fb.write(chapter_content)
    fb.write('\n')
    # chapter_response.close()
    print(chapter_url)

    # exit()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324482846&siteId=291194637

Getting Started with Python Crawler

Getting Started with Python Crawler: An Overview

Getting Started with Python Crawler (Analysis)

Getting Started with Python Crawler (Analysis)

Basic understanding of crawler for getting started with Python crawler

Getting Started with Python Crawler: Basic Understanding of Crawler

Getting started with the Python crawler framework scrapy

Getting Started Crawler (Function Encapsulation) - Python

Getting Started with Python Crawler: URLError Exception Handling

Getting started with Python crawler 5: Simulate the browser to visit the website

Getting started with the most popular Python3 web crawler

Getting started with python crawler (1)-crawling the source code of the entire webpage

Getting started with python scrapy, complete a crawler in 10 minutes

Getting Started with Python Crawler | 6 Store the crawled data locally

Sesame HTTP: Regular Expressions for Getting Started with Python Crawler

Sesame HTTP: Advanced Usage of Urllib Library for Getting Started with Python Crawler

10 Python Complete Small Project Getting Started Crawler Examples

Getting Started with Python Crawler Xiaobai (reading this article is enough)

Getting Started with Python Requests - Crawler - Method One Type and Six Attributes

Getting started with python crawler basics - using requests and BeautifulSoup

[Source Code] 10 Getting Started Examples of Python Crawler!

Python crawlers without stepping on pits: Python crawler development and project combat, getting started with Python from crawlers

[Python crawler framework] Getting started with Selenium library: How to use Python to automate web page testing?

Getting started with maven druid muysql crawler (3)

Getting started with maven druid mysql crawler (4)

Getting started with maven druid mysql crawler (4)

Web Crawler | Regular Expressions for Getting Started Tutorial

[Web Crawler] Getting Started - Understanding of Crawlers

Python MongoDB Getting started

Getting Started with Python basics

Recommended

Ranking

SpringBoot-integrate redis

[Sword pointing to offer] Interview question 03: Repeated numbers in an array

Arrangement "Offer Penalty for prove safety" string

Browser prevent the automatic generation of fill and Echo have been saved account solutions

Work hard and never slacken——2022 Yinmai Information Year-End Summary

Install jdk7 on Linux system

App common dependency management tools

EduCoder-Web程序设计基础-html5— 给表单组件添加说明-第1关：label标签相关概念

Machine learning - clustering - density clustering algorithm notes

Ant's large model is exposed, AI+ finance enters the "big model" era

Daily

More

2024-04-30(36)

2024-04-29(5)

2024-04-28(12)

2024-04-27(29)

2024-04-26(22)

2024-04-25(32)

2024-04-24(30)

2024-04-23(30)

2024-04-22(5)

2024-04-21(0)