Python数据分析之第四章

1、网页数据抓取

name	age
Mon	22
LIlt	223

import urllib.request;

from bs4 import BeautifulSoup;

response = urllib.request.urlopen('file:///D:/nodepad/Notepad++/uc.html');

html=response.read();

html

soup=BeautifulSoup(html);

soup

soup.find('tr');

soup.find_all('tr');

2、JSON简述

JSON全程为JavaScript对象表示法（JavaScript Object Notation），是存储和交换文本信息的语法。具有文本量更小、更快，更易解析的特点。

JSON和HTML不一样，HTML主要用于展示数据，JSON主要用于传递数据，所以一般作为数据的查询接口。

{

"employees":[

{"firstName":"Bill","lastName":"Gates"},

{"firstName":"George","lastName":"Bush"},

{"firstName":"Thomas","lastName":"Carter"}

]

}

import json;

import urllib.request;

response = urllib.request.urlopen('file:///C:/Users/zxysnowy/Desktop/json.json')

response

jsonString=response.read();

jsonString

jsonObject=json.loads(jsonString.decode()) #字典

jsonObject['employees']

jsonObject['employees'][0]

jsonObject['employees'][0]['lastName']

3、解析网页

解析HTML函数

BeautifulSoup(html)

find(name,id=id,attrs={})

find_all(name,attrs={})

getText()

参数说明：

html：html格式文档；

name：需要检索的标签名；

attrs：html标签内的属性值，可以用这些属性进行过滤。

解析JSON函数：

json.loads(jsonString)

jsonList[index]

jsonObject['propertyName']

参数说明：

jsonString，JSON的字符串格式数据，如果没有解码，调用decode()即可；

index，JSON序列对象的索引值，从0开始，可以通过len方法获取总长度；

propertyName，JSON对象的属性名，可以通过这个方法访问JSON的属性值。

数据框的递增操作：

data=DataFrame(columns=['Feature','Property'])

data=data.append(Series([f, p], index=['Feature','Property']),ignore_index=True);

参数说明：

columns，数据框的列；

index，序列对应的序号，通过指定和数据框一样的列名，往数据框内追加数据；

ignore_index，是否忽略原来的序号，一般设置为True，则重新设置序号。

4、案例

import json;

import urllib.request;

from pandas import Series;

from pandas import DataFrame;

from bs4 import BeautifulSoup;

response=urllib.request.urlopen('http://item.jd.com/1185291.html');

html=response.read();

soup=BeautifulSoup(html);

divSoup=soup.find(id="detail");

data=DataFrame(columns=['Feature','Property'])

trs=divSoup.find_all('dl');

for tr in trs:

tds=tr.find_all('dt');

dds=tr.find_all('dd');

for i in range(0,len(tds)):

f=tds[i].getText();

p=dds[i].getText();

#if len(tds)==5:

# f=tds[0].getText();

# p=tds[1].getText();

#q=tds[2].getText();

#g=tds[3].getText();

data=data.append(

Series(

[f,p],

index=['Feature','Property']

),ignore_index=True

);

len(data)

#如何获取价格，价格是异步加载的

response=urllib.request.urlopen('http://p.3.cn/prices/get?skuid=J_5712532')

jsonString=response.read();

jsonObject=json.loads(jsonString.decode())

jsonObject[0]['p']

结果：

填写图片摘要（选填）

注意点：

由于html是静态的，而价格是异步加载的，所以，通过Google浏览器，点击审查，输入price，找到对应的id即可。

填写图片摘要（选填）

Python数据分析之第四章

猜你喜欢