A method to read the billion-level doris database | JD Cloud technical team

  1. At work, it is often necessary to synchronize online doris to the market. Reading doris data is basically the same as reading regular mysql. If the data row is less than ten million, it is relatively simple to directly connect, read and store on a single node. The Python example is as follows:
def get_data(sql,host='',port=2000,user='',password='',db=''):
    # 支持doris
    import pymysql
    connect = pymysql.connect(host=host,port=port,user=user,password=password,db=db,charset='utf8')
    cursor = connect.cursor()
    cursor.execute('SET query_timeout = 216000;') #单位秒
    cursor.execute(sql)
    result = cursor.fetchall()
    for row in result:
        pass # 存储格式可以自行控制 
    cursor.close()
    connect.close()
    return result
  1. If the amount of data is relatively large, exceeding 10 million or even 100 million, single-node reading will encounter problems of timeout and low timeliness. You can use spark.read.jdbc to distribute multi-node concurrent reading. Spark reading supports two methods.

Introduction of main parameters:

read.jdbc(url=url,table=remote_table,column='item_sku_id',numPartitions=50,lowerBound=lowerBound, upperBound=upperBound,properties=prop)

url: Format such as 'jdbc: mysql://**.jd.com:2000/database name?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true&failOverReadOnly=false&zeroDateTimeBehavior=convertToNull&useSSL=false&serverTimezone=Asia/Shanghai '

table: It can be a table name or query sql (that is, conditional query is supported). If it is sql, the format is like "(SELECT count(*) sku FROM rule_price_result where dt='2023-05-10') AS tmp"

numPartitions: Control the number of concurrent nodes

Choose one of lowerBound+upperBound and properties to control the range of data read by each node.

lowerBound+upperBound mode: Specify the lowest and highest values ​​to be read, and spark will combine the number of partitions and the lowest and highest boundaries to mechanically divide.

If the data distribution is skewed, you can control the range yourself through the predicates list.

Author: JD Retail Zhao Qimeng

Source: JD Cloud Developer Community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10085066