Life is short, the US group net crawling all city data with Python, friends do not sell bid 5000

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/weixin_45523154/article/details/102750379

In recent Python reptile group saw a lot of people inside the data online face of the US group is very interested in, but also gives some people the price is also very impressive, crawling data beauty group then bid 5000? ? ? ? At that time ignorant force, and when I was crawling all the data and found that 5000 feeling less!

Reptile ideas

There are many reptiles framework, I used the following rough ideas to achieve incremental crawling.

  • requests (selenium) crawling transactions;

  • Determining whether crawling data already exists in the database;

  • Save in dataframe object;

  • Inserted into the database.

After obtaining all of the businesses to url, now to our last step, but it should be noted that different types of data pages is different. For example, hotels

So for different types, you need to write different analytic functions. The last time was crawling should not pursue fast, very strict restrictions the US group, the best multi-threaded request a few seconds. Then slowly let it run

The basic configuration of the environment

Version: Python3.6

System: Windows

Module: csv, time, requests, json

Part of the code

Crawling results are divided into four categories:

Cinema 8195

Hotel 211 129

Food categories 490 928

Life category 432 803


对Python感兴趣或者是正在学习的小伙伴,可以加入我们的Python学习扣qun:784758214,看看前辈们是如何学习的!从基础的python脚本到web开发、爬虫、django、数据挖掘等,零基础到项目实战的资料都有整理。送给每一位python的小伙伴!每天都有大牛定时讲解Python技术,分享一些学习的方法和需要注意的小细节,点击加入我们的 python学习者聚集地
总共 115万 条数据

看到这么多的数据,我突然感觉5K都少了呀!

Guess you like

Origin blog.csdn.net/weixin_45523154/article/details/102750379