Original link: https://blog.thinkeridea.com/201907/go/csv_like_data_logs.html
Our business requires daily record of large amounts of log data, and these data are quite important, which is the main basis for revenue settlement and a major data analysis department available data sources, for such an important journal, and the high frequency of the log, we need a high performance and secure logging component, to ensure the integrity of each line of log formats, we have designed a class csv log stitching component that code here Datalog .
It is a list of components and can guarantee the integrity of each log and efficient splicing field, any support column and row separators, but also supports an array of field, but the log-many needs, without a plurality of log records, do not record multiple Row. It is a response []byte
data, to facilitate connection with other primary key data is written to the log file, or network.
Instructions for use
API list
NewRecord(len int) Record
Create a fixed-length loggingNewRecordPool(len int) *sync.Pool
Create a fixed-length logging buffer poolToBytes(sep, newline string) []byte
Added to the end use newline newline sep connection Record, andArrayJoin(sep string) string
Use sep connection Record, as a result, the value of the array fieldArrayFieldJoin(fieldSep, arraySep string) string
Use fieldSep connection Record, as the result of an array elementClean()
Empty Record all the elements of, if sync.Pool before emptying back into the Pool Record should, avoid memory leaksUnsafeToBytes(sep, newline string) []byte
Use sep connection Record, and newline newline added at the end, using in situ replacement string will destroy log fields referencedUnsafeArrayFieldJoin(fieldSep, arraySep string) string
Use fieldSep connection Record, as the result of an array element, using the in situ replacement string will destroy log fields referenced
Use the bottom of type Record []string
the string sections as a line or an array field, when it should be using fixed-length, because the data is formatted logs often, each column has its own meaning, use NewRecord(len int) Record
or NewRecordPool(len int) *sync.Pool
create components, I recommend the use of each log NewRecordPool
in Creating a buffer pool when the program is initialized, access from the cache once the program is running Record
will be more efficient, but each time returned to Pool
need to call when Clean
empty Record
to avoid reference to the string can not be recycled, resulting in a memory leak.
practice
We need to ensure that the meaning of a log to each column of data, we create a fixed-length Record
, but how to ensure that each column of data consistency, use enumeration constants can go a good guarantee, for example, we define constants log columns:
const (
LogVersion = "v1.0.0"
)
const (
LogVer = iota
LogTime
LogUid
LogUserName
LogFriends
LogFieldNumber
)
LogFieldNumber
Is the number of columns the log, that is, Record
the length, then use NewRecordPool
to create a buffer pool, then use the constant name as a subscript logging, so do not worry because the check or carelessness cause problems log columns of the disorder.
var w bytes.Buffer // 一个日志写组件
var pool = datalog.NewRecordPool(LogFieldNumber) // 创建一个缓存池
func main() {
r := pool.Get().(datalog.Record)
r[LogVer] = LogVersion
r[LogTime] = time.Now().Format("2006-01-02 15:04:05")
// 检查用户数据是否存在
//if user !=nil{
r[LogUid] = "Uid"
r[LogUserName] = "UserNmae"
//}
// 拼接一行日志数据
data := r.Join(datalog.FieldSep, datalog.NewLine)
r.Clean() // 清空 Record
pool.Put(r) // 放回到缓存池
// 写入到日志中
if _, err := w.Write(data); err != nil {
panic(err)
}
// 打印出日志数据
fmt.Println("'" + w.String() + "'")
}
The program will run over output:
Since the delimiter character is not visible, below the use, instead of the field separator, use; \ n-newline place using / field separator instead of the array, is - instead of an array of delimiters.
'v1.0.0,2019-07-18,11:39:09,Uid,UserNmae,;\n'
Even if we do not record LogFriends
data column, but in the log it still has a placeholder, if user
Shi nil
, LogUid
and LogUserName
does not require special treatment, do not need to write data, which still occupy their place, do not worry about the log and therefore confusion.
Use pool can be a good use of memory, will not bring too much memory allocation, and each field value is a string of Record, a simple assignment does not bring too much overhead, it will point to the string itself data, there will be no additional memory allocation, see detailed string errors and optimization suggestions .
Use Record.Join
can be efficient connection line logging, the log files to facilitate us to quickly write in the back designed to explain the section will detail Join
the design.
An array containing the log
Sometimes not all records of some single value, such as the above LogFriends will record the current record information related to friends, this could be a set of data, Datalog also provides some simple helper functions can be combined to achieve the following example:
// 定义 LogFriends 数组各列的数据
const (
LogFriendUid = iota
LogFriendUserName
LogFriendFieldNumber
)
var w bytes.Buffer // 一个日志写组件
var pool = datalog.NewRecordPool(LogFieldNumber) // 每行日志的 pool
var frPool = datalog.NewRecordPool(LogFriendFieldNumber) // LogFriends 数组字段的 pool
func main(){
// 程序运行时
r := pool.Get().(datalog.Record)
r[LogVer] = LogVersion
r[LogTime] = time.Now().Format("2006-01-02 15:04:05")
// 检查用户数据是否存在
//if user !=nil{
r[LogUid] = "Uid"
r[LogUserName] = "UserNmae"
//}
// 拼接一个数组字段,其长度是不固定的
r[LogFriends] = GetLogFriends(rand.Intn(3))
// 拼接一行日志数据
data := r.Join(datalog.FieldSep, datalog.NewLine)
r.Clean() // 清空 Record
pool.Put(r) // 放回到缓存池
// 写入到日志中
if _, err := w.Write(data); err != nil {
panic(err)
}
// 打印出日志数据
fmt.Println("'" + w.String() + "'")
}
// 定义一个函数来拼接 LogFriends
func GetLogFriends(friendNum int) string {
// 根据数组长度创建一个 Record,数组的个数往往是不固定的,它整体作为一行日志的一个字段,所以并不会破坏数据
fs := datalog.NewRecord(friendNum)
// 这里只需要中 pool 中获取一个实例,它可以反复复用
fr := frPool.Get().(datalog.Record)
for i := 0; i < friendNum; i++ {
// fr.Clean() 如果不是每个字段都赋值,应该在使用前或者使用后清空它们便于后面复用
fr[LogFriendUid] = "FUid"
fr[LogFriendUserName] = "FUserName"
// 连接一个数组中各个字段,作为一个数组单元
fs[i] = fr.ArrayFieldJoin(datalog.ArrayFieldSep, datalog.ArraySep)
}
fr.Clean() // 清空 Record
frPool.Put(fr) // 放回到缓存池
// 连接数组的各个单元,返回一个字符串作为一行日志的一列
return fs.ArrayJoin(datalog.ArraySep)
}
The program will run over output:
Since the delimiter character is not visible, below the use, instead of the field separator, use; \ n-newline place using / field separator instead of the array, is - instead of an array of delimiters.
'v1.0.0,2019-07-18,11:39:09,Uid,UserNmae,FUid/FUserName-FUid/FUserName;\n'
So when parsing array can be resolved as a field, which greatly increases the flexibility of great data log,
but does not recommend the use of excessive levels, data logs should be clear and concise, but can use some special scenes one nested.
Best Practices
Use ToBytes
and ArrayFieldJoin
connection strings will replace the data field when an empty string, so the datalog which defines four separators, which are invisible characters, rarely appear in the data, but we need to replace the data these characters are connected, the log structure to avoid damage.
Although the various connector components support, but in order to avoid destruction of data, we should choose some and not visible rare single-byte character as a delimiter. Newline rather special, because most log reads components are used \n
as a line separator, if the data is very rare \n
it can be used \n
, Datalog defined \x03\n
as a line break, it is compatible with the general log reading component, which takes do a little work you can correctly parse the log.
UnsafeToBytes
And UnsafeArrayFieldJoin
performance will be better, and their names, they are not safe because they use exbytes.Replace do situ replacement separator, which will destroy the original string of data points. Unless there will be very many of our log data delimiter needs to be replaced, not those who do not recommend using them, because they only improve performance when replacing.
I use a lot in the service UnsafeToBytes
and UnsafeArrayFieldJoin
I always log at the end of a request, I ensure that all relevant data is no longer used, so do not worry cause problems in situ replace other data are not perceived changed, this may be a good practice, but I still do not recommend using them.
Designed to explain
datalog does not offer very much constrained feature that contains only a practice and a set of auxiliary tools, before using it, we need to understand these practices.
It helps us to create a fixed-length or a log line sync.Pool
, we need to combine enumeration constants record data, which help us to connect to each column of data logging required data format.
It provides helper methods have proven practical projects, taking into consideration many details, high-performance core design goal, use it to greatly reduce development costs related components, then this section will analyze its parts.
I think it is worth, he said it provides a Join
method, with respect to strings.Join
save memory allocation twice, from now begin to analyze it.
// Join 使用 sep 连接 Record, 并在末尾追加 suffix
// 这个类似 strings.Join 方法,但是避免了连接后追加后缀(往往是换行符)导致的内存分配
// 这个方法直接返回需要的 []byte 类型, 可以减少类型转换,降低内存分配导致的性能问题
func (l Record) Join(sep, suffix string) []byte {
if len(l) == 0 {
return []byte(suffix)
}
n := len(sep) * (len(l) - 1)
for i := 0; i < len(l); i++ {
n += len(l[i])
}
n += len(suffix)
b := make([]byte, n)
bp := copy(b, l[0])
for i := 1; i < len(l); i++ {
bp += copy(b[bp:], sep)
bp += copy(b[bp:], l[i])
}
copy(b[bp:], suffix)
return b
}
Input parameters are often log component []byte
type, it returns a direct []byte
, rather than strings.Join
in response to a string, is required at the end of the internal buf
type conversion, resulting in additional memory overhead. We not only need to log in each row delimiter connecting the columns, but also as the end of a line separator, it offers a suffix suffix
without us after Join
stitching line separator again after the results, which can reduce an additional memory allocation.
This is precisely the datalog essence of the design, it does not use a number of methods standard library, but the design more in line with the method of the scene, in order to achieve higher performance and better user experience.
// ToBytes 使用 sep 连接 Record,并在末尾添加 newline 换行符
// 注意:这个方法会替换 sep 与 newline 为空字符串
func (l Record) ToBytes(sep, newline string) []byte {
for i := len(l) - 1; i >= 0; i-- {
// 提前检查是否包含特殊字符,以便跳过字符串替换
if strings.Index(l[i], sep) < 0 && strings.Index(l[i], newline) < 0 {
continue
}
b := []byte(l[i]) // 这会重新分配内存,避免原地替换导致引用字符串被修改
b = exbytes.Replace(b, exstrings.UnsafeToBytes(sep), []byte{' '}, -1)
b = exbytes.Replace(b, exstrings.UnsafeToBytes(newline), []byte{' '}, -1)
l[i] = exbytes.ToString(b)
}
return l.Join(sep, newline)
}
ToBytes
As a very important interworking function, which is the highest frequency component using a function, it replaces each field separator field and line before connecting the fields, where a check is made in advance whether to include field separators, if used comprising []byte(l[i])
copy data of the column, and then using exbytes.Replace provide high performance in situ Alternatively, because the input data is a copy of the reallocation, so do not worry situ replacement can affect other data.
After using previously introduced Join
method of connecting each column of data, if used strings.Join
will be []byte(strings.Join([]string(l), sep) + newline)
that which will increase many times the memory allocation, the component through clever design to circumvent these additional costs, to improve performance.
// UnsafeToBytes 使用 sep 连接 Record,并在末尾添加 newline 换行符
// 注意:这个方法会替换 sep 与 newline 为空字符串,替换采用原地替换,这会导致所有引用字符串被修改
// 必须明白其作用,否者将会导致意想不到的结果。但是这会大幅度减少内存分配,提升程序性能
// 我在项目中大量使用,我总是在请求最后记录日志,这样我不会再访问引用的字符串
func (l Record) UnsafeToBytes(sep, newline string) []byte {
for i := len(l) - 1; i >= 0; i-- {
b := exstrings.UnsafeToBytes(l[i])
b = exbytes.Replace(b, exstrings.UnsafeToBytes(sep), []byte{' '}, -1)
b = exbytes.Replace(b, exstrings.UnsafeToBytes(newline), []byte{' '}, -1)
l[i] = exbytes.ToString(b)
}
return l.Join(sep, newline)
}
UnsafeToBytes
And ToBytes
similar but no delimiter check because exbytes.Replace already contains inspection, and used directly exstrings.UnsafeToBytes the string to turn []byte
this will not happen copy data very efficient, but it does not support string literal, but I believe the data in the log is assigned running from, if unfortunate contains literal strings, do not worry too much, as long as the use of a special character as a delimiter, often our programming literal string does not contain these characters, execution exbytes.Replace did not happen replacement is safe.
// Clean 清空 Record 中的所有元素,如果使用 sync.Pool 在放回 Pool 之前应该清空 Record,避免内存泄漏
// 该方法没有太多的开销,可以放心的使用,只是为 Record 中的字段赋值为空字符串,空字符串会在编译时处理,没有额外的内存分配
func (l Record) Clean() {
for i := len(l) - 1; i >= 0; i-- {
l[i] = ""
}
}
Clean
The method is simpler, it just replace each data column is an empty string, an empty string as a special character, will be processed at compile time, there will be no additional overhead, they all point to the same piece of memory.
// ArrayJoin 使用 sep 连接 Record,其结果作为数组字段的值
func (l Record) ArrayJoin(sep string) string {
return exstrings.Join(l, sep)
}
// ArrayFieldJoin 使用 fieldSep 连接 Record,其结果作为一个数组的单元
// 注意:这个方法会替换 fieldSep 与 arraySep 为空字符串,替换采用原地替换
func (l Record) ArrayFieldJoin(fieldSep, arraySep string) string {
for i := len(l) - 1; i >= 0; i-- {
// 提前检查是否包含特殊字符,以便跳过字符串替换
if strings.Index(l[i], fieldSep) < 0 && strings.Index(l[i], arraySep) < 0 {
continue
}
b := []byte(l[i]) // 这会重新分配内存,避免原地替换导致引用字符串被修改
b = exbytes.Replace(b, exstrings.UnsafeToBytes(fieldSep), []byte{' '}, -1)
b = exbytes.Replace(b, exstrings.UnsafeToBytes(arraySep), []byte{' '}, -1)
l[i] = exbytes.ToString(b)
}
return exstrings.Join(l, fieldSep)
}
ArrayFieldJoin
Will be replaced when the array is directly connected to the respective unit delimiter strings, directly after exstrings.Join connection string, exstrings.Join opposing strings.Join
an improved function, as it only once memory allocation, more strings.Join
saving time, are interested can see its source code implementation.
to sum up
datalog provides a practical as well as some aids that can help us to quickly record data log, more concerned about the data itself. Specific program performance can be handed over to datalog to achieve, it guarantees the performance.
I'll be late plans to offer an efficient log reading component, in order to read the log analysis data, it reads higher than the general document will be more efficient and convenient, targeted to optimize the efficiency of log analysis, stay tuned right.
Reprint:
Author: Qi Silver ( thinkeridea )
This link: https://blog.thinkeridea.com/201907/go/csv_like_data_logs.html
Disclaimer: This blog All articles unless otherwise specified, are used CC BY 4.0 CN agreement license agreement. Please indicate the source!