How to use API to crawl data, what is the difference between it and web crawler?

background

Many Internet companies often have a database that stores their user information, and the data in the database is basically cleaned by the engineering department (the use of crawler technology or the underlying data analysis is mainly the work of the development department or the data collection engineering department), so Many business data analysts can use the tools such as HSQL to easily obtain the massive amounts of data they need.

However, there are some small and medium-sized B2B companies on the market. They are not like B2C companies, and they rely on their own user information for product iteration or business growth. They mainly rely on the engineering department to parse the third-party data and send the results to the product development department for business platform construction and visualization. The process of obtaining third-party data requires the use of third-party API interfaces, which is why these companies require candidates who have applied for several positions to have API calling experience, because even if the company has its own database, it is after all an engineering department After the results of selective crawling and sorting, you may face the dilemma that the database does not have the data you want. At this time, you need to use a similar method to collect the data yourself.
Insert picture description here

API Introduction

The process of calling the API is also a kind of crawler. In crawlers, there are usually two places where APIs are mentioned, one is library API and the other is data API.

Library API

Library API usually refers to developers who have developed a library (such as a python library) and provide an interface for users to call this library. This is like, if we want to go to the parcel locker to pick up our courier, we must enter the correct information to get our courier. This information is an API interface to help us accurately locate this library and call it.

Data API

In product development or web development, the data API is like a data line passed from the back end to the front end. The back-end staff has sorted out the data that they want to display, and only needs to transmit this data line to the front-end developer. The front-end can then visualize it as required. And this interface can also be used by the outside world.

Unlike the web crawler, the design of the data API is simpler and more efficient. This interface already stores the data that everyone needs, and we do not need to spend too much energy to parse the web page. And webpage data crawling will often cause pressure on the server. If your code is not set up with a reasonable human-like webpage browsing frequency, there will be a risk of IP being blocked.

But the data API also has some disadvantages. Although there are many API products on the market that can be used by the outside world, many free interfaces have great restrictions on the amount of crawling. If your crawling needs are large, you need to pay.

Simple API crawler example

Below I will use the Facebook Graph API to crawl data as an example to roughly record the common data API call process.

  1. Call interface information: Provide the API call address (usually in URL format). This address is similar to helping us locate which row and column of the storage cabinet we want to fetch.
  2. Request to get data: Use the HTTP protocol to request data transfer, usually call the get function in the request package in python.
  3. Set the request parameters: You need to provide the request parameters, that is, you need to tell the API interface what kind of information you want to get. For example, in this example, I need created_time (post time), post_id (post_id) and other information.

Facebook Graph API introduction document

Although a get function can help us implement HTTP protocol requests, many times requesting this URL requires identity authentication. For example, the following code will report an error:

import requests
r = requests.get('https://facebook.com/user_id')
r.json()

And return: Identity permission is required. Therefore, when using any API interface, it is best to read the website's usage documentation in advance to see what specific information is required by the request protocol.

{'documentation_url': ''https://facebook.com/user_id/#get-the-access-token',
 'message': 'Requires authentication'}

You can also see in the official user manual of Facebook API, if I want to get the po information of a Facebook account, I need to achieve the token of this account. If you want to get other account information besides yourself, you also need to get their tokens in advance. There will be a corresponding introduction to get the token-> how to get the access token of Facebook account .

After obtaining the token, test the above code to obtain the following results:
Insert picture description here

Code example

Purpose: To obtain the number of views of each video blog from 2020.03.01 to 2020.03.07, the viewing time of each country, each gender and each age group, and analyze the performance of this blogger's video content in this week.

1. 首先我需要拿到3.1至3.7之间每条博文的ID(post_id)
2. 筛选出仅为视频的博文ID
3. 需要获取每条视频博文的观看次数,各个国家的观看时长
4. 将json格式的数据整理成dataframe格式 
5. 将观看时长可视化

1. 导入所需的包

import pandas as pd
import json, datetime, requests
from datetime import date, timedelta, datetime
import numpy as np
from pandas.core.frame import DataFrame

2.创建爬取数据的函数

def get_list_of_fb_post_ids(fb_page_id, fb_token, START, END):
    
    ''' 
    Function to get all the post_ids from a given facebook page during certain time range
    '''
    
    posts=[]  #用来存储所有博文的post_id
    graph_output = requests.get('https://graph.facebook.com/'+fb_page_id+'?fields=posts.limit(10){created_time}',params=fb_token).json()
    posts+=graph_output['posts']['data'] 

    graph_output = requests.get(graph_output['posts']['paging']['next']).json()
    posts+=graph_output['data'] 

    while True: #一直读取next_page,直到某次读取的记录的时间小于你设置的开始时间
        try:
            graph_output = requests.get(graph_output['paging']['next']).json()
            posts+=graph_output['data']
            if graph_output['data'][-1]['created_time']<=START:
                break
        except KeyError:
            break
            
    df_posts=pd.DataFrame(posts)
    df_posts=df_posts[(df_posts.created_time>=START)&(df_posts.created_time<=(datetime.strptime(END, "%Y-%m-%d")+timedelta(days=1)).isoformat())]
    df_posts.columns = ["timestamp", "post_id"]
    return df_posts 

3.给变量赋值,抓取post_id

fb='EAAIkwQUa1WoBAFWmq90xbMfLHecpRga****************'
fb_token = {'access_token': fb}
user_id='1404******'
output=get_list_of_fb_post_ids(user_id,fb_token,'2020-03-01','2020-03-07')

The results are as follows:
Insert picture description here

4.获取每条视频博文的相关数据

## 设定好想要获取的每条视频博文的信息
Fields = '?metric='+'post_video_views,post_video_view_time_by_region_id'
list_metrics=[ 'post_video_views','post_video_view_time_by_region_id']
def get_video_insights(output,fb_token):
   final_output=pd.DataFrame()
   for i in output.index.values:
       post_id = output['post_id'][i]
       Type = requests.get('https://graph.facebook.com/'+post_id+'?fields=type',params=fb_token).json()
       if Type['type'] == 'video': #筛选出仅为视频的博文记录
           try:
               insights_output = requests.get('https://graph.facebook.com/'+post_id+'/insights{}&period=lifetime'.format(Fields),params=fb_token).json()
               list1=list_metrics
               metrics=[]
               for j in range(0,len(list_metrics)):
                   metrics.append(insights_output['data'][j]['values'][0]['value'])
               #print("metrics get")
               metrics=DataFrame(metrics).T
               metrics.columns=list1
               col_name = metrics.columns.tolist()
               col_name.insert(col_name.index('post_video_view_time_by_region_id'),'timestamp')
               col_name.insert(col_name.index('timestamp'),'post_id')
               metrics=metrics.reindex(columns=col_name).reset_index()
               metrics['post_id']=output['post_id'][i]
               metrics['timestamp']=output['timestamp'][i]
               final_output=final_output.append(metrics)
           except:
               pass
   return final_output

The output is:
Insert picture description here

5.将json格式的数据整理成dataframe格式
From the above results, we can see that within this week, the blogger has published a total of 6 video blog posts, of which the most played is one of the 3.3, up to more than 400,000 times. However, the format of the watch time indicator in various regions is json format, and we need to deal with it separately.

First look at the data structure of the region column:
Insert picture description here

As can be seen from the above results, we need to traverse each row of the column, generate a new da'ta'frame, use each country as a column, and store the corresponding values ​​one by one.

#遍历 post_video_view_time_by_region_id 这一列的每一行
Region=pd.DataFrame()
metrics.index=range(len(metrics)) #重新定义index
metrics['post_video_view_time_by_region_id']=metrics['post_video_view_time_by_region_id'].astype('str')
for j in metrics.index.values:
    e=eval(metrics['post_video_view_time_by_region_id'][j]) #返回字符串内的值,直接返回一个dict
    single_graph = []
    for i in e.keys():
        single_graph.append(e[i])
    single=pd.DataFrame(single_graph).T
    single.columns=list(e.keys())
    Region=Region.append(single)

The results are as follows:
Insert picture description here

6. 将观看时长可视化
First, we need to process the data in the Region table. We need to split each column name into countries and cities. This is done in order to more easily map the above data to a map of the United States and visually determine which state users prefer the blogger's video content.

o=Region.mean().to_dict() #算出每个城市的平均观看时长
dataset=pd.DataFrame(pd.Series(o),columns=['view time'])
dataset=dataset.reset_index().rename(columns={'index':'region'})
dataset['state']=dataset['region'].map(lambda x: x.split('-')[0]) #将列名分为国家和地区
dataset['country']=dataset['region'].map(lambda x: x.split('-')[1])#将列名分为国家和地区

The processed result is:
Insert picture description here

Finally, we only need to filter out the regions of the United States and convert the data to a map of the United States, and we are done.
Insert picture description here

Insert picture description here

From the map, we can analyze that the blogger's video is loved by the people of Texas and California. In addition to constantly trying to cater to the video strategies of users in these two states, we need to further investigate why other states Of videos are far shorter than them. Based on the above information, bloggers can make it more geographically characteristic in later video production, and recommend bloggers to use Facebook's regional distribution function to distribute different characteristics of video content to different regions to cater more efficiently Users' tastes and hobbies optimize their own media marketing work.

The above are the steps to crawl data using Facebook Graph API. If there are inadequacies or incorrect explanations, please ask friends to criticize and correct. :)

Published 2 original articles · praised 2 · visits 988

Guess you like

Origin blog.csdn.net/weixin_43944997/article/details/105502469