In the daily development, we sometimes need to insert data to customize. For example, if the table has not written a certain recording a new record, or the record is not inserted into the table, otherwise it updated. We called the former TryInsert
, the latter is InsertOrUpdate
(also called upsert
). In general, a lot of orm
the framework will be included with such a function, but if you want to bulk insert data, orm
built-in function is not quite good enough. Let us fight to achieve from the perspective of SQL manual TryInsert
and InsertOrUpdate
.
Given the current popularity of the two major open source RDBMS
to the SQL standard support is relatively backward, while the earlier standard and no standard syntax for this area, so we divided into MySQL
articles and Postgres
papers to their respective dialects were used to solve two problems mentioned above.
MySQL articles
Analytical principle
insert ignore into
If the error insertion (primary key or the Unique
key repeat), will turn into error warnings, the number of rows impact returns to 0 at this time, can be used to achieve TryInsert()
.
replace into
replace
With insert
syntax basically the same, it is Mysql
the extended syntax, the official InsertOrUpdate
, replace
the basic logic of the statement is as follows:
ok:=Insert()
if !ok {
if duplicate-key { // key重复就删掉重新插入
Delete()
Insert()
}
}
From here we can see replace
the number of rows impact statement, if it is inserted, the impact of the number of rows is 1; if it is updated, deleted, and then insert, affect the number of rows is two.
Insert into ... on duplicate key update
MySQL also extended syntax. ... on duplicate key update
The logic replace
is almost the only difference is that if you insert a new value and the old values, the default number of rows returned impact is zero, so the logic here is that if the old value and the new value is not the same as treatment.
The sample code
The following is golang
an example, given the example:
type User struct {
UserID int64 `gorm:"user_id"`
Username string `gorm:"username"`
Password string `gorm:"password"`
Address string `gorm:"address"`
}
func BulkTryInsert(data []*User) error{
str:=make([]string, 0, len(data))
param:=make([]interface{},0,len(data)*4) // 4个属性
for _,d:=range data {
str=append(str,"(?,?,?,?)")
param=append(d.UserID)
param=append(d.Username)
param=append(d.Password)
param=append(d.Address)
}
stmt:=fmt.Sprintf("INSERT IGNORE INTO table_name(user_id,username,password,address) VALUES %s",strings.Join(str,",") )
return DB.Exec(stmt, param...).Error
}
func BulkUpsert(data []*User) error{
str:=make([]string, 0, len(data))
param:=make([]interface{},0,len(data)*4) // 4个属性
for _,d:=range data {
str=append(str,"(?,?,?,?)")
param=append(d.UserID)
param=append(d.Username)
param=append(d.Password)
param=append(d.Address)
}
stmt:=fmt.Sprintf("REPLACE INTO table_name(user_id,username,password,address) VALUES %s",strings.Join(str,",") ) // 与上面的区别仅在这行的SQL
return DB.Exec(stmt, param...).Error
}
Postgres articles
Analytical principle
Insert into ... on conflict (...) do nothing
on conflict
Needed to bring back the key conflict, such as a primary key or Unique
constraint. This SQL meaning it literally, when there is a conflict of repeat certain key times and do nothing, that is TryInsert
.
Insert into ... on conflict (...) do update set (...)
This is more complicated SQL, Postgres
this syntax surface than the MySQL
higher degree of freedom, in fact, very complicated bulky, not as MySQL
pragmatic. set
Mean, you need to specify which properties are updated when the conflict, which is mandatory and must detail each field, really unfriendly ah. Supposedly to be written this way, which refers to that record on behalf EXCLUDED be inserted:
INSERT INTO ... on conflict (user_id, address) do update set password=EXCLUDED.password and username=EXCLUDED.username
The sample code
This time we imagine a practical scenario, python
often used as scientific computing, pandas
is our favorite computing package, pandas
the io
section provides a fool-function to read and write files and database data, such as writing the database to_sql
, but this function has limitations, it can do TryInsert
and then insert the empty data table for upsert
the powerless. For now, we can only achieve it manually.
According to the above resolution, we need to give each table is set up UniqueConstraint
to use this syntax. An example is given below:
# 使用的是sqlalchemy
Base = declarative_base()
# 将一个list分割成m个大小为n的list
def chunks(a, n):
return [a[i:i + n] for i in range(0, len(a), n)]
class DBUser(Base):
__tablename__ = 'user' # UniqueConstraint和PrimaryKey至少要有一个
__table_args__ = (UniqueConstraint('user_id', 'address'),
{'schema': 'db'})
user_id = Column(BigInteger)
username = Column(String(200))
password = Column(String(200))
address = Column(String(200))
def dtype(self): # pandas需要的dtype
d = {c.name: c.type for c in self.__table__.c}
if 'id' in d:
el d['id'] # 一般id都是自动生成的,提供给pandas的dtype应该剔除id
return d
def fullname(self):
return self.__table_args__[-1]['schema'] + '.' + self.__tablename__
# 只要DBUser再提供一个Unique Constraint的属性列表,下面这两个函数就可以写成通用的函数
# 这里只是给出例子,点到为止
def bulk_try_insert(self, engine, data):
col = self.dtype().keys()
col_str = ','.join(col)
col_str = '(' + col_str + ')'
update_col = []
for c in col:
update_str = '{0}=EXCLUDED.{1}'.format(c, c)
update_col.append(update_str)
value_str = []
value_args = []
for d in data:
tmp_str = '(' + col.__len__() * '%s,'
tmp_str = tmp_str[:-1] + ')'
value_str.append(tmp_str)
for k in col:
value_args.append(d[k])
stmt= 'insert into ' + self.fullname() + col_str + 'values ' + ','.join(
value_str) + 'on conflict (user_id, address) do update set ' + ",".join(update_col)
engine.execute(stmt, value_args)
def bulk_insert_chunk(self, engine, data, n=1000):
d_list = chunks(data, n)
for a in d_list:
self.bulk_insert(engine, a)