python crawling "get" App eBook Information

Foreword

The text of text and images from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.

Author: Jing Cui Qing seek only

PS: If necessary Python learning materials can be added to a small partner click the link below to obtain their own

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

mitmdump crawling "get" App eBook Information

"Get" App is Luo Ji thinking produced a fragmented time learning App, there are a lot of learning resources within the App. But "get" App does not correspond to the web version, so the information must only be available through App. This time we use mitmdump to practice crawling through its App.

Crawling target

Our goal is crawling information in an electronic book App section of an electronic book, and the information is saved to MongoDB, as shown in FIG.

Here Insert Picture Description

We need the name of the book, Introduction, cover, crawling down the price, but this time the focus is still crawling understand usage mitmdump tool, it temporarily involve automated crawling, App of the operation or manually. mitmdump responsible for capturing response data extraction and save.

2. Preparation

Make sure you have properly installed the mitmproxy and mitmdump, mobile phone and PC is in the same local area network, and configured the CA certificate mitmproxy and installed MongoDB and run their services, installation PyMongo library, the specific configuration can refer to Chapter 1 instructions.

3. Analysis crawl

First explore what the current page URL and returns the contents, we write a script like this:

1 def response(flow):
2     print(flow.request.url)
3     print(flow.response.text)

 

这里只输出了请求的 URL 和响应的 Body 内容,也就是请求链接和响应内容这两个最关键的部分。脚本保存名称为 script.py。

接下来运行 mitmdump,命令如下所示:

mitmdump -s script.py

 

打开 “得到” App 的电子书页面,便可以看到 PC 端控制台有相应输出。接着滑动页面加载更多电子书,控制台新出现的输出内容就是 App 发出的新的加载请求,包含了下一页的电子书内容。控制台输出结果示例如图 所示。 Console output

可以看到 URL 为 https://dedao.igetget.com/v3/discover/bookList 的接口,其后面还加了一个 sign 参数。通过 URL 的名称,可以确定这就是获取电子书列表的接口。在 URL 的下方输出的是响应内容,是一个 JSON 格式的字符串,我们将它格式化,如图 所示。

Here Insert Picture Description

格式化后的内容包含一个 c 字段、一个 list 字段,list 的每个元素都包含价格、标题、描述等内容。第一个返回结果是电子书《情人》,而此时 App 的内容也是这本电子书,描述的内容和价格也是完全匹配的,App 页面如图所示。

Here Insert Picture Description

这就说明当前接口就是获取电子书信息的接口,我们只需要从这个接口来获取内容就好了。然后解析返回结果,将结果保存到数据库。

4. 数据抓取

接下来我们需要对接口做过滤限制,抓取如上分析的接口,再提取结果中的对应字段。

这里,我们修改脚本如下所示:

 1 import json
 2 from mitmproxy import ctx
 3  
 4 def response(flow):
 5     url = 'https://dedao.igetget.com/v3/discover/bookList'
 6     if flow.request.url.startswith(url):
 7         text = flow.response.text
 8         data = json.loads(text)
 9         books = data.get('c').get('list')
10         for book in books:
11             ctx.log.info(str(book))

 

重新滑动电子书页面,在 PC 端控制台观察输出,如图所示。

Here Insert Picture Description

控制台输出

现在输出了图书的全部信息,一本图书信息对应一条 JSON 格式的数据。

5. 提取保存

接下来我们需要提取信息,再把信息保存到数据库中。方便起见,我们选择 MongoDB 数据库。

脚本还可以增加提取信息和保存信息的部分,修改代码如下所示:

 1 import json
 2 import pymongo
 3 from mitmproxy import ctx
 4  
 5 client = pymongo.MongoClient('localhost')
 6 db = client['igetget']
 7 collection = db['books']
 8  
 9 def response(flow):
10     global collection
11     url = 'https://dedao.igetget.com/v3/discover/bookList'
12     if flow.request.url.startswith(url):
13         text = flow.response.text
14         data = json.loads(text)
15         books = data.get('c').get('list')
16         for book in books:
17             data = {'title': book.get('operating_title'),
18                 'cover': book.get('cover'),
19                 'summary': book.get('other_share_summary'),
20                 'price': book.get('price')
21             }
22             ctx.log.info(str(data))
23             collection.insert(data)

 

重新滑动页面,控制台便会输出信息,如图所示。

Here Insert Picture Description

现在输出的每一条内容都是经过提取之后的内容,包含了电子书的标题、封面、描述、价格信息。

最开始我们声明了 MongoDB 的数据库连接,提取出信息之后调用该对象的 insert() 方法将数据插入到数据库即可。

滑动几页,发现所有图书信息都被保存到 MongoDB 中,如图所示。

Here Insert Picture Description

目前为止,我们利用一个非常简单的脚本把 “得到” App 的电子书信息保存下来。

代码部分

 1 import json
 2 import pymongo
 3 from mitmproxy import ctx
 4  5 client = pymongo.MongoClient('localhost')
 6 db = client['igetget']
 7 collection = db['books']
 8  9 10 def response(flow):
11     global collection
12     url = 'https://dedao.igetget.com/v3/discover/bookList'
13     if flow.request.url.startswith(url):
14         text = flow.response.text
15         data = json.loads(text)
16         books = data.get('c').get('list')
17         for book in books:
18             data = {
19                 'title': book.get('operating_title'),
20                 'cover': book.get('cover'),
21                 'summary': book.get('other_share_summary'),
22                 'price': book.get('price')
23             }
24             ctx.log.info(str(data))
25             collection.insert(data)

 

Guess you like

Origin www.cnblogs.com/Qqun821460695/p/11949933.html