PyODPS Providing DataFrame API to perform large-scale data analysis and pre-processing, similar pandas herein, describes how to use the interface PyODPS perform operations Cartesian product.
Cartesian product of the most common scenario is the need to compare between any two or operation. To calculate the geographical distance, for example, assume that a large table Coordinates1 storage target point latitude and longitude coordinates, a total of M rows of data, small storage table Coordinates2 starting point latitude and longitude coordinates, a total of N rows of data, now need to calculate the coordinates of all the most recent point of departure from the target point. For a target point, we all need to calculate the starting point of the distance to the target point, and then find the minimum distance, so the whole process needs to generate intermediate M * N pieces of data, which is a Cartesian product problem.
haversine official
First, a brief background, the known coordinates of the location of the two points of latitude and longitude, distance between two points can be solved using haversine formula, using the Python expression as follows:
def haversine(lat1, lon1, lat2, lon2):
# lat1, lon1 为位置 1 的经纬度坐标
# lat2, lon2 为位置 2 的经纬度坐标
import numpy as np
dlon = np.radians(lon2 - lon1)
dlat = np.radians(lat2 - lat1)
a = np.sin( dlat /2 ) **2 + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.sin( dlon /2 ) **2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # 地球平均半径,单位为公里
return c * r
MapJoin
The most recommended method is to use mapjoin, PyODPS used mapjoin way is very simple, only you need to specify when the Join two dataframe mapjoin=True
, have the right to do mapjoin table when performing operations.
In [3]: df1 = o.get_table('coordinates1').to_df()
In [4]: df2 = o.get_table('coordinates2').to_df()
In [5]: df3 = df1.join(df2, mapjoin=True)
In [6]: df1.schema
Out[6]:
odps.Schema {
latitude float64
longitude float64
id string
}
In [7]: df2.schema
Out[7]:
odps.Schema {
latitude float64
longitude float64
id string
}
In [8]: df3.schema
Out[8]:
odps.Schema {
latitude_x float64
longitude_x float64
id_x string
latitude_y float64
longitude_y float64
id_y string
}
The default will be seen among the heavy plus _x and _y suffixes, when executed by a join suffixes
passed in parameter defined from a binary tuple suffix, when the table after the join has, through the DataFrame PyODPS self-built function can calculate the distance, very concise and very efficient.
In [9]: r = 6371
...: dis1 = (df3.latitude_y - df3.latitude_x).radians()
...: dis2 = (df3.longitude_y - df3.longitude_x).radians()
...: a = (dis1 / 2).sin() ** 2 + df3.latitude_x.radians().cos() * df3.latitude_y.radians().cos() * (dis2 / 2).sin() ** 2
...: df3['dis'] = 2 * a.sqrt().arcsin() * r
In [12]: df3.head(10)
Out[12]:
latitude_x longitude_x id_x latitude_y longitude_y id_y dis
0 76.252432 59.628253 0 84.045210 6.517522 0 1246.864981
1 76.252432 59.628253 0 59.061796 0.794939 1 2925.953147
2 76.252432 59.628253 0 42.368304 30.119837 2 4020.604942
3 76.252432 59.628253 0 81.290936 51.682749 3 584.779748
4 76.252432 59.628253 0 34.665222 147.167070 4 6213.944942
5 76.252432 59.628253 0 58.058854 165.471565 5 4205.219179
6 76.252432 59.628253 0 79.150677 58.661890 6 323.070785
7 76.252432 59.628253 0 72.622352 123.195778 7 1839.380760
8 76.252432 59.628253 0 80.063614 138.845193 8 1703.782421
9 76.252432 59.628253 0 36.231584 90.774527 9 4717.284949
In [13]: df1.count()
Out[13]: 2000
In [14]: df2.count()
Out[14]: 100
In [15]: df3.count()
Out[15]: 200000
df3
There is already a M * N pieces of data, and need to know if the next minimum distance directly df3
call connected groupby min
minimum distance aggregation function can be obtained for each target point.
In [16]: df3.groupby('id_x').dis.min().head(10)
Out[16]:
dis_min
0 323.070785
1 64.755493
2 1249.283169
3 309.818288
4 1790.484748
5 385.107739
6 498.816157
7 615.987467
8 437.765432
9 272.589621
DataFrame custom function
If we need to know the point corresponding to the minimum distance of the city, that is, the table corresponding id, you can call MapReduce after mapjoin, but we still have another way is to use DataFrame the apply method. To use the custom function data for one line, a method can apply, axis parameter must be 1, showing the operation in the row.
Table Resources
Pay attention to apply UDF is executed on the server side, it can not be used in a similar function within df=o.get_table('table_name').to_df()
an expression to get table data, refer to the specific principles PyODPS DataFrame code where to run . In the case where an example herein, to Table 1 and Table 2 records all calculations, it is necessary to Table 2 as a resource table , and then reference the resource from the table definition. PyODPS use of resources is also very convenient table, simply a collection of incoming resources
parameters. collection is iterables, not a DataFrame object DataFrame not directly call the interface , each iteration is a namedtuple, or can take the offset values corresponding to the field name.
## use dataframe udf
df1 = o.get_table('coordinates1').to_df()
df2 = o.get_table('coordinates2').to_df()
def func(collections):
import pandas as pd
collection = collections[0]
ids = []
latitudes = []
longitudes = []
for r in collection:
ids.append(r.id)
latitudes.append(r.latitude)
longitudes.append(r.longitude)
df = pd.DataFrame({'id': ids, 'latitude':latitudes, 'longitude':longitudes})
def h(x):
df['dis'] = haversine(x.latitude, x.longitude, df.latitude, df.longitude)
return df.iloc[df['dis'].idxmin()]['id']
return h
df1[df1.id, df1.apply(func, resources=[df2], axis=1, reduce=True, types='string').rename('min_id')].execute(
libraries=['pandas.zip', 'python-dateutil.zip', 'pytz.zip', 'six.tar.gz'])
In the custom function, the resource table read as pandas DataFrame by the circulation, the use of pandas loc can easily find the row corresponding to the minimum, thereby obtaining the starting point nearest id. Further, if the package need to use constituents (e.g. pandas in the present embodiment) in the custom function can refer to this article .
Global Variables
When a small amount of data tables very small, we can even small table data as global variables used in a custom function.
df1 = o.get_table('coordinates1').to_df()
df2 = o.get_table('coordinates2').to_df()
df = df2.to_pandas()
def func(x):
df['dis'] = haversine(x.latitude, x.longitude, df.latitude, df.longitude)
return df.iloc[df['dis'].idxmin()]['id']
df1[df1.id, df1.apply(func, axis=1, reduce=True, types='string').rename('min_id')].execute(
libraries=['pandas.zip', 'python-dateutil.zip', 'pytz.zip', 'six.tar.gz'])
When the upload function, the function will be to use global variables (the above code df
) the pickle to the UDF. But note that in this way very limited usage scenario, because the file size upload ODPS resource is limited, so the amount of data too large lead UDF generated resource is too large and thus could not be uploaded, and the best way to ensure that customers tripartite package end version is consistent with the server-side, or is the sequence of the problems that may arise, it is recommended to use only the amount of data is very small.
to sum up
Use PyODPS solve the Cartesian product problems are mainly divided into two ways, one is mapjoin, intuitive, good performance, general can solve mapjoin we recommend using mapjoin, and is best calculated using the built-in function, you can reach the highest efficiency, but it is not flexible enough. Another is to use a custom function DataFrame, more flexible, performance almost opposite (can be obtained using numpy pandas or performance improvement), by using the resource table, small table as table resource incoming DataFrame custom function, whereby Cartesian product operation is completed.