PyODPS DataFrame Cartesian product treated in several ways

PyODPS Providing DataFrame API to perform large-scale data analysis and pre-processing, similar pandas herein, describes how to use the interface PyODPS perform operations Cartesian product.

Cartesian product of the most common scenario is the need to compare between any two or operation. To calculate the geographical distance, for example, assume that a large table Coordinates1 storage target point latitude and longitude coordinates, a total of M rows of data, small storage table Coordinates2 starting point latitude and longitude coordinates, a total of N rows of data, now need to calculate the coordinates of all the most recent point of departure from the target point. For a target point, we all need to calculate the starting point of the distance to the target point, and then find the minimum distance, so the whole process needs to generate intermediate M * N pieces of data, which is a Cartesian product problem.

haversine official

First, a brief background, the known coordinates of the location of the two points of latitude and longitude, distance between two points can be solved using haversine formula, using the Python expression as follows:

def  haversine(lat1,  lon1,  lat2,  lon2):
        #  lat1,  lon1  为位置  1  的经纬度坐标
        #  lat2,  lon2  为位置  2  的经纬度坐标
        import  numpy  as  np

        dlon  =  np.radians(lon2  -  lon1)
        dlat  =  np.radians(lat2  -  lat1)
        a  =  np.sin(  dlat  /2  )  **2  +  np.cos(np.radians(lat1))  *  np.cos(np.radians(lat2))  *  np.sin(  dlon  /2  )  **2
        c  =  2  *  np.arcsin(np.sqrt(a))
        r  =  6371  #  地球平均半径,单位为公里
        return  c  *  r

MapJoin

The most recommended method is to use mapjoin, PyODPS used mapjoin way is very simple, only you need to specify when the Join two dataframe mapjoin=True, have the right to do mapjoin table when performing operations.

In  [3]:  df1  =  o.get_table('coordinates1').to_df()                                                                                                                                                                                        

In  [4]:  df2  =  o.get_table('coordinates2').to_df()                                                                                                                                                                                        

In  [5]:  df3  =  df1.join(df2,  mapjoin=True)                                                                                                                                                                                                        

In  [6]:  df1.schema                                                                                                                                                                                                                                                      
Out[6]:  
odps.Schema  {
    latitude                    float64              
    longitude                  float64              
    id                                string                
}

In  [7]:  df2.schema                                                                                                                                                                                                                                                      
Out[7]:  
odps.Schema  {
    latitude                    float64              
    longitude                  float64              
    id                                string                
}

In  [8]:  df3.schema                                                                                                                                                                                                                                                      
Out[8]:  
odps.Schema  {
    latitude_x                        float64              
    longitude_x                      float64              
    id_x                                    string                
    latitude_y                        float64              
    longitude_y                      float64              
    id_y                                    string                
}

The default will be seen among the heavy plus _x and _y suffixes, when executed by a join suffixespassed in parameter defined from a binary tuple suffix, when the table after the join has, through the DataFrame PyODPS self-built function can calculate the distance, very concise and very efficient.

In  [9]:  r  =  6371  
      ...:  dis1  =  (df3.latitude_y  -  df3.latitude_x).radians()  
      ...:  dis2  =  (df3.longitude_y  -  df3.longitude_x).radians()  
      ...:  a  =  (dis1  /  2).sin()  **  2  +  df3.latitude_x.radians().cos()  *  df3.latitude_y.radians().cos()  *  (dis2  /  2).sin()  **  2  
      ...:  df3['dis']  =  2  *  a.sqrt().arcsin()  *  r                                                                                                                                                                                              
                                                                                                                                                                                                        
In [12]: df3.head(10)                                                                                                                        
Out[12]: 
    latitude_x  longitude_x id_x  latitude_y   longitude_y id_y       dis
0   76.252432    59.628253    0   84.045210     6.517522    0  1246.864981
1   76.252432    59.628253    0   59.061796     0.794939    1  2925.953147
2   76.252432    59.628253    0   42.368304    30.119837    2  4020.604942
3   76.252432    59.628253    0   81.290936    51.682749    3   584.779748
4   76.252432    59.628253    0   34.665222   147.167070    4  6213.944942
5   76.252432    59.628253    0   58.058854   165.471565    5  4205.219179
6   76.252432    59.628253    0   79.150677    58.661890    6   323.070785
7   76.252432    59.628253    0   72.622352   123.195778    7  1839.380760
8   76.252432    59.628253    0   80.063614   138.845193    8  1703.782421
9   76.252432    59.628253    0   36.231584    90.774527    9  4717.284949

In [13]: df1.count()                                                                                                                         
Out[13]: 2000

In [14]: df2.count()                                                                                                                         
Out[14]: 100

In [15]: df3.count()                                                                                                                         
Out[15]: 200000

df3There is already a M * N pieces of data, and need to know if the next minimum distance directly df3call connected groupby minminimum distance aggregation function can be obtained for each target point.


In [16]: df3.groupby('id_x').dis.min().head(10)                                                                                              
Out[16]: 
       dis_min
0   323.070785
1    64.755493
2  1249.283169
3   309.818288
4  1790.484748
5   385.107739
6   498.816157
7   615.987467
8   437.765432
9   272.589621

DataFrame custom function

If we need to know the point corresponding to the minimum distance of the city, that is, the table corresponding id, you can call MapReduce after mapjoin, but we still have another way is to use DataFrame the apply method. To use the custom function data for one line, a method can apply, axis parameter must be 1, showing the operation in the row.

Table Resources

Pay attention to apply UDF is executed on the server side, it can not be used in a similar function within df=o.get_table('table_name').to_df()an expression to get table data, refer to the specific principles PyODPS DataFrame code where to run . In the case where an example herein, to Table 1 and Table 2 records all calculations, it is necessary to Table 2 as a resource table , and then reference the resource from the table definition. PyODPS use of resources is also very convenient table, simply a collection of incoming resourcesparameters. collection is iterables, not a DataFrame object DataFrame not directly call the interface , each iteration is a namedtuple, or can take the offset values corresponding to the field name.

## use dataframe udf

df1 = o.get_table('coordinates1').to_df()
df2 = o.get_table('coordinates2').to_df()

def func(collections):
    import pandas as pd
    
    collection = collections[0]
    
    ids = []
    latitudes = []
    longitudes = []
    for r in collection:
        ids.append(r.id)
        latitudes.append(r.latitude)
        longitudes.append(r.longitude)

    df = pd.DataFrame({'id': ids, 'latitude':latitudes, 'longitude':longitudes})
    def h(x):        
        df['dis'] = haversine(x.latitude, x.longitude, df.latitude, df.longitude)
        return df.iloc[df['dis'].idxmin()]['id']
    return h

df1[df1.id, df1.apply(func, resources=[df2], axis=1, reduce=True, types='string').rename('min_id')].execute(
    libraries=['pandas.zip', 'python-dateutil.zip', 'pytz.zip', 'six.tar.gz'])

In the custom function, the resource table read as pandas DataFrame by the circulation, the use of pandas loc can easily find the row corresponding to the minimum, thereby obtaining the starting point nearest id. Further, if the package need to use constituents (e.g. pandas in the present embodiment) in the custom function can refer to this article .

Global Variables

When a small amount of data tables very small, we can even small table data as global variables used in a custom function.

df1 = o.get_table('coordinates1').to_df()
df2 = o.get_table('coordinates2').to_df()
df = df2.to_pandas()

def func(x):
    df['dis'] = haversine(x.latitude, x.longitude, df.latitude, df.longitude)
    return df.iloc[df['dis'].idxmin()]['id']

df1[df1.id, df1.apply(func, axis=1, reduce=True, types='string').rename('min_id')].execute(
    libraries=['pandas.zip', 'python-dateutil.zip', 'pytz.zip', 'six.tar.gz'])

When the upload function, the function will be to use global variables (the above code df) the pickle to the UDF. But note that in this way very limited usage scenario, because the file size upload ODPS resource is limited, so the amount of data too large lead UDF generated resource is too large and thus could not be uploaded, and the best way to ensure that customers tripartite package end version is consistent with the server-side, or is the sequence of the problems that may arise, it is recommended to use only the amount of data is very small.

to sum up

Use PyODPS solve the Cartesian product problems are mainly divided into two ways, one is mapjoin, intuitive, good performance, general can solve mapjoin we recommend using mapjoin, and is best calculated using the built-in function, you can reach the highest efficiency, but it is not flexible enough. Another is to use a custom function DataFrame, more flexible, performance almost opposite (can be obtained using numpy pandas or performance improvement), by using the resource table, small table as table resource incoming DataFrame custom function, whereby Cartesian product operation is completed.

Guess you like

Origin yq.aliyun.com/articles/705184