Web crawler data analysis (JSON and JsonPATH)

JSON

definition
  • JSON (JavaScript Object Notation) is a lightweight data exchange format that makes it easy for people to read and write. At the same time, it is convenient for the machine to analyze and generate. It is suitable for data interaction scenarios, such as data interaction between the front desk and the backend of a website.
  • The comparison between JSON and XML is comparable.
Object {}: JSONObject
  • Object: The object is expressed as content enclosed in {} in js, and the data structure is the structure of key-value pairs {key: value, key: value,… }. In an object-oriented language, key is the attribute of the object, and value It is the corresponding attribute value, so it is easy to understand. The value method is object.key to get the attribute value. The type of this attribute value can be numbers, strings, arrays, and objects.
Array[]: JSONArray
  • Array: Array is the content enclosed in square brackets [] in js, the data structure is ["Python", "javascript", "C++", …], the value method is the same as in all languages, using index to get, field value The types can be numbers, strings, arrays, and objects.
method
  • load open file

    # 读取文件中json形式的字符串元素 转化成python类型
    
    obj = json.load(open('book.json', 'r', encoding='utf-8'))
    print(type(obj)
    
  • loads string

    # 把Json格式字符串解码转换成Python对象
    # 从json到python的类型转化对照如图所示
    
    with open('./book.json',mode='r',encoding='utf-8') as f:
        json_string = f.read()
    # 将json格式字符串转化为对象
    obj = json.loads(json_string)
    print(type(obj))
    
  • string

    str = '''{"has_more": false, "message": "success", "data": [{"single_mode": true, "abstract": "\u8c22\u8c22\u5927\u5bb6\u559c\u6b22\u6bcf\u65e5\u64b8"}]}'''
    
  • dumps

    # 实现python类型转化为json字符串,返回一个str对象 把一个Python对象编码转换成Json字符串
    # 从python原始类型向json类型的转化对照
    
    import json
    str = '''{"has_more": false, "message": "success", "data": [{"single_mode": true, "abstract": "\u8c22\u8c22\u5927\u5bb6\u559c\u6b22\u6bcf\u65e5\u64b8"}]}'''
    
    print(type(str))
    print(json.dumps(str,ensure_ascii=False))
    
  • dump

    # 将Python内置类型序列化为json对象后写入文件
    
    import json
    
    dictStr = {
          
          "city": "北京", "name": "大刘",'info':'\u8c22\u8c22\u5927'}
    # Serialize ``obj`` as a JSON formatted stream to ``fp
    json.dump(dictStr, open("dictStr.json","w",encoding='utf-8'), ensure_ascii=False,)
    
JSON and Python data types play against each other
JSON Python
object dict
array list、tuple
string Unicode
number(int) int、long
number(real) float
true True
false False
null None

JsonPath

definition
  • JsonPath is an information extraction library. It is a tool for extracting specified information from JSON documents. It provides multiple language versions, including: Javascript, Python, PHP and Java.

  • JsonPath for JSON is equivalent to XPATH for XML

Syntax comparison between JsonPath and XPath
XPath JsonPath description
/ $ Root node
. @ Current node
/ . or [] Take child node
Take the parent node, JsonPath does not support
// Select all eligible conditions in all locations
* * Match all element nodes
@ According to attribute access, Json does not support, because Json is a key-value recursive structure, no need
[] [] Iterator identification (you can do simple iterative operations in it, such as array subscripts, select values ​​based on content, etc.)
| [,] Support multiple selection in iteration
[] ?() Support filtering operation
() Support expression calculation
() Grouping, JsonPath does not support

Grammar example comparison

XPath JsonPath result
/store/book/author $.store.book[*].author Author of all books in the bookstore
//author $…author All authors
//store/* $.store.* All elements of the store. All books and bicycles
/store//price $.store…price The price of everything in the store
//book[3] $…book[2] Third book
//book[last()] $…book[(@.length-1)] Last book
//book[position()< 3] s…book[0,1] or s…book[:2] The first two books
//book[isbn] $…book[?(@.isbn)] Filter out all the books that contain isbn
//book[price<10] $…book[?(@.price<10)] Filter out books whose price is less than 10
//* $…* All elements
Basic use of JsonPath
import json
# xpath
# pip install jsonpath 专门解析json类型的数据
import jsonpath
import requests

# s = '''{"key":"Hello","dict":"World"}'''
# print(type(s))
# json_obj = json.loads(s,encoding='utf-8')
# print(json_obj)
# print(type(json_obj))
# print(json_obj['key'])
url = 'https://www.lagou.com/lbs/getAllCitySearchLabels.json'
if __name__ == '__main__':
    headers = {
    
    
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive',
        'Cookie': 'user_trace_token=20170911115921-976c1ee9-96a5-11e7-8e78-525400f775ce; LGUID=20170911115921-976c23ed-96a5-11e7-8e78-525400f775ce; _ga=GA1.2.999441537.1505102359; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221680c0e7fe84b5-09d8e2e6bf1c25-5e442e19-1327104-1680c0e7fe9c9%22%2C%22%24device_id%22%3A%221680c0e7fe84b5-09d8e2e6bf1c25-5e442e19-1327104-1680c0e7fe9c9%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22m_cf_cpc_baidu_pc%22%7D%7D; LG_HAS_LOGIN=1; gate_login_token=8586a02d54456365d23dc7b47a95ba949ed3cf351688cd37; LG_LOGIN_USER_ID=410cde6b7ecf76cb8f5c12869d780fb3c721f5b00f5a1a87; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=2; index_location_city=%E5%8C%97%E4%BA%AC; privacyPolicyPopup=false; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1571903416,1571970618,1572354935; _gid=GA1.2.387582430.1572354935; JSESSIONID=ABAAABAAAIAACBI3C82CE49769577D6151513758B1646B9',
        'Host': 'www.lagou.com',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', }

    # response = requests.get(url=url, headers=headers)
    # data = response.text
    # json_obj = json.loads(data,encoding='utf-8')
    # print(json_obj['content']['data']['allCitySearchLabels']['B'][0]['name'])
    # result = jsonpath.jsonpath(json_obj,'$..[name,id,code]')
    # print(result)
    # # 返回列表,只有一个数据
    # # data = jsonpath.jsonpath(json_obj,'$.content[data]')
    # data = jsonpath.jsonpath(json_obj,'$.content.data')
    # print(data)
    # print(len(data))

    data = '''{ "store": {
    "book": [ 
      { "category": "reference",
        "author": "李白",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "杜甫",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "白居易",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "苏轼",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}'''
    json_obj = json.loads(data, encoding='utf-8')

    # print(jsonpath.jsonpath(json_obj,'$.store.book[*].author'))
    # print(jsonpath.jsonpath(json_obj,'$..author'))
    # print(jsonpath.jsonpath(json_obj,'$.store.book[?(@.price>12)]'))
    # jsonpath 索引从0开始的
    # print(jsonpath.jsonpath(json_obj,'$.store.book[0]'))
    # @当前,当前列表长度 - 1 最后一个对象
    # print(jsonpath.jsonpath(json_obj,'$.store.book[(@.length -1)]'))
    print(jsonpath.jsonpath(json_obj, '$.store.book[?(@.isbn)]'))

Instance
  • Example 1: Lagou city data

    # 拉勾网城市JSON文件 http://www.lagou.com/lbs/getAllCitySearchLabels.json 为例,获取所有城市
    
    import requests
    import jsonpath
    import json
    
    url = 'https://www.lagou.com/lbs/getAllCitySearchLabels.json/'
    response = requests.get(url,verify = False)
    html = response.text
    
    # 把json格式字符串转换成python对象
    jsonobj = json.loads(html)
    
    # 从根节点开始,匹配name节点
    citylist = jsonpath.jsonpath(jsonobj,'$..name')
    
    # 把一个Python对象编码转换成Json字符串
    content = json.dumps(citylist, ensure_ascii=False)
    
    with open('city.json','wb') as f:
        f.write(content.encode('utf-8'))
    
    • cities = jsonpath.jsonpath(obj,’$…[name,id,code]’)
    • content = jsonpath.jsonpath(obj,’$.content[data]’)
  • Example two book json data

    import json
    import jsonpath
    
    obj = json.load(open('book.json', 'r', encoding='utf-8'))
    print(type(obj))
    
    # 通过如下函数使用jsonpath
    # 参数1:json对象,参数2:jsonpath
    # $ 代表的是根节点
    # . 就类似于xpath里面的 /
    # 【路径含义】从根开始一步一步找到指定书本的作者,如果写*代表所有的book,写下标代表的是指定book,注意,下标从0开始,查找所有book的作者,必须写*
    ret = jsonpath.jsonpath(obj, '$.store.book[*].author')
    ret = jsonpath.jsonpath(obj, '$..author')
    ret = jsonpath.jsonpath(obj, '$.store..price')
    # 查找最后一本书
    ret = jsonpath.jsonpath(obj, '$..book[(@.length-1)]')
    ret = jsonpath.jsonpath(obj, '$..book[:2]')
    print(ret)
    

Guess you like

Origin blog.csdn.net/qq_42546127/article/details/106404136