In recent Python reptile group saw a lot of people inside the data online face of the US group is very interested in, but also gives some people the price is also very impressive, crawling data beauty group then bid 5000? ? ? ? At that time ignorant force, and when I was crawling all the data and found that 5000 feeling less!
Reptile ideas
There are many reptiles framework, I used the following rough ideas to achieve incremental crawling.
-
requests (selenium) crawling transactions;
-
Determining whether crawling data already exists in the database;
-
Save in dataframe object;
-
Inserted into the database.
After obtaining all of the businesses to url, now to our last step, but it should be noted that different types of data pages is different. For example, hotels
So for different types, you need to write different analytic functions. The last time was crawling should not pursue fast, very strict restrictions the US group, the best multi-threaded request a few seconds. Then slowly let it run
The basic configuration of the environment
Version: Python3.6
System: Windows
Module: csv, time, requests, json
Part of the code
Crawling results are divided into four categories:
Cinema 8195
Hotel 211 129
Food categories 490 928
Life category 432 803
对Python感兴趣或者是正在学习的小伙伴,可以加入我们的Python学习扣qun:784758214,看看前辈们是如何学习的!从基础的python脚本到web开发、爬虫、django、数据挖掘等,零基础到项目实战的资料都有整理。送给每一位python的小伙伴!每天都有大牛定时讲解Python技术,分享一些学习的方法和需要注意的小细节,点击加入我们的 python学习者聚集地
总共 115万 条数据
看到这么多的数据,我突然感觉5K都少了呀!