A single machine can process up to 1 billion pieces of data per second! eBay open source data processing framework Accelerator

Curating Editor | Natalie
compile | ignorant
Editor | Emily
AI Frontline Guide: Recently, eBay announced the official open source of Accelerator, a proven data processing framework that provides fast data access, parallel execution, and automatic organization of source code, input data, and results. It can be used for day-to-day data analysis as well as on real-time recommender systems containing hundreds of thousands of large data files.

Accelerator runs on laptops or rack-mounted servers and easily processes billions of rows of data, processing thousands of input files, calculations, and results in an orderly fashion.

Accelerator's data throughput is typically in the millions of rows per second. If running on a high-speed computer, it can process up to billions of rows of data per second.

For more dry goods, please pay attention to the WeChat public account "AI Frontline", (ID: ai-front)

Originally developed by Swedish artificial intelligence company Expertmaker, Accelerator was officially released in 2012, and since then it has been a core tool in numerous research projects and real-time recommender systems. In 2016, Expertmaker was acquired by eBay, which is currently open-sourcing Expertmaker Accelerator under the Apache License Version 2.

Design goals

The main design goals of the Accelerator are as follows:

  • Simplifies processing data in parallel on multiple CPUs.

  • Data throughput should be as fast as possible, even a small laptop can easily handle millions of rows of data.

  • If possible, try to reuse computation results rather than recompute them. Likewise, sharing results among multiple users should be effortless.

  • A data science project may have many (hundreds of thousands) input files and a large amount of source code and intermediate results.

  • Accelerator should avoid manually managing and documenting data files, calculations, results, and the relationships between them.

The main function

The main atomic operation of the Accelerator is to create a job. Creating a job is the process of executing some program with input data and parameters and storing the results (i.e. outputs) on disk along with all the information needed for the computation. The job directory will contain the calculation results and all the information needed for the calculation results.

Jobs can be simple or complex computations, or containers of large datasets. Jobs can be linked to each other, and new jobs can depend on one or more old jobs.

Key Features

Accelerator provides two key functions, result reuse and data flow.

result reuse

Before creating a new job, Accelerator checks to see if the same job has been run before. Accelerator will not create the job if it already exists, but will return a link to the existing job. This not only saves execution time, but also helps to share results among users. More importantly, it provides visibility and certainty.

Accelerator provides a mechanism to save job information from a session to a database, which helps manage jobs and their relationships to each other.

data flow

It is more efficient to transfer a continuous stream of data from disk to CPU than to perform random queries in the database. Streaming is the best way to achieve high bandwidth from disk to CPU. It doesn't require caching and makes good use of the operating system's RAM-based disk buffers.

Overall structure

Now let's look at the overall architecture of the Accelerator.

is a client/server based application. There is a runner client on the left and two servers on the right called daemon and urd, where urd is optional. runners start jobs on the daemon server by executing scripts (build scripts). This server will load and store the information and results of all jobs performed using the workdirs filesystem based database. At the same time, all the information about the job in the build script will be stored by the urd server into the database of the job log file system. urd is responsible for managing jobs, including storing and retrieving sessions or lists of previously executed related jobs.

Operation

Jobs are created by executing small programs called methods. method is written in Python 2 or Python 3, and sometimes in C.

Simplest homework: "Hello, World"

We illustrate how to create a job (method) with a simple "Hello World" program:

def synthesis():
  return "hello world"

This program doesn't need any input parameters, just returns a string and exits. To execute it, we also need to create a build script like this:

def main(urd):
  jid = urd.build('hello_world')

After executing this method, the user gets a link called jobid. jobid points to the directory where the execution results are stored, along with all the information needed to run the job.

If we try to execute the job again, it will not be executed, but will return the jobid pointing to the last executed job, because the Accelerator remembers that a similar job has been executed before. To execute the job again, we have to change the source code or input parameters.

link job

Let's assume that the hello_world job we just created is very computationally intensive and has already returned the results we want. For simplicity, we demonstrate how this works by creating a method called print_result that simply reads the result of the previous job and prints the result to stdout.

import blob

jobids = ('hello_world_job',)

def synthesis(): 
  x = blob.load(jobid=jobids.hello_world_job)
  print(x)

To create this job, we need to extend the build script:

def main(urd):
  jid = urd.build('hello_world') 
  urd.build('print_result', jobids=dict(hello_world_job=jid))

When the build script is executed, only the print_result job is created because the hello_world job has been created before.

Job execution flow and result delivery

So far, we've seen how to create, link, and execute simple jobs. Now we turn our focus to methods. When executing the method, the Accelerator calls three functions, prepare(), analysis(), and synthesis(). A method can call all three functions at the same time, or at least one.

The return values ​​of all three functions can be stored in the job's directory and used in other jobs.

data set

Datasets are Accelerator's default storage type, designed for parallel processing and high performance. Datasets are built on top of jobs, so datasets are created with methods and stored in the job directory. A single job can contain any number of datasets.

Internally, the data in the dataset is stored in row-column format. All columns can be accessed independently, avoiding unnecessary data read. Data is also divided into a fixed number of fragments, providing parallel access. Data sets may be hashed, a hash function that combines rows of data with the same hash value into the same segment.

Import Data

Let's look at common operations for importing files (creating datasets). The csvimport method can be used to import many different file types, it can parse a large number of files in CSV format, and store the data as datasets. The created dataset is stored in the resulting job. The name of the dataset defaults to jobid plus the string default, or a custom string can be used.

Link dataset

Just like jobs, datasets can also be linked to each other. Since datasets are built on jobs, linking datasets is simple. For example, let's say we just imported file0.txt into imp-0, and file1.txt has more data stored. We can import the latter file and provide a link to the previous dataset. Since the datasets are linked, all data files imported from both datasets can now be accessed using the imp-1 (or imp-1/default) dataset reference.

Links are handy when working with data that grows over time, such as log data. We can expand a dataset with more rows by chaining, which is a very lightweight operation.

Add a new column to the dataset

Adding columns is a common operation, and Accelerator handles new columns through chaining.

The principle is simple, let's say we have a "source" dataset, and we want to add a new column, we just need to create a new dataset with just the new column, and when creating it have the Accelerator link all the source dataset's columns to new dataset.

parallel execution

Accelerator is designed for parallel processing, primarily through a combination of sharded datasets and parallel analysis() calls.

The iterator operates inside the analysis() function, which forks once for each dataset fragment. The return value of the analysis() function will be used as input to the synthesis() function. We can merge the results explicitly, but analysis_res comes with a rather magical method merge_auto() which merges the results of all the fragments into one based on the data type.

urd

We've seen how Accelerator keeps track of jobs that have already been created and reuses them if necessary. This saves time and links related computations together, however, there is another layer on top of this that further improves visibility and job reuse, and it's the urd server.

urd stores job listings and their dependencies in a log file based database. Everything that happens in the build script can be logged to urd. In order to do this, we need a list to store the information, a key, and in most cases a date for easy lookup later.

Performance Testing

New jobs start in fractions of a second. Below are the processing times for some different job types.

Preparing the data: importing, casting, and hashing

The sample data file is 1.1TB (280GB compressed) and contains 6.3 billion rows and 14 columns. Accelerator runs on mainframes with 72 cores and fast disks.

The above values ​​are based on all data. Import job (A): Import gz compressed files. Interestingly, the import was 30% faster than the normal zcat file.gz > /dev/null. On FreeBSD, zcat is faster. Type conversion job (B): 5 json-lists, 5 numbers, 2 date and 2 unicode columns with an average of 172 bytes per row. The job reads more than half a gigabyte per second while saving nearly the same amount of data to disk, so disk bandwidth is higher than 1 gigabyte per second. Since hashing speed depends on the columns being hashed, the value shown (C) is the average of four hashing jobs.

Data processing

To compute Σ(a×b×c), we read three columns through a method, multiply their values ​​and write the result to a new column. The second job adds values ​​for the new column.

As you can see, multiplying three float64s and writing them back to disk is actually fast - 77 million rows per second. Aggregating the values ​​together is even faster - over a billion values ​​per second. In Python, it takes 6 seconds to do the same.

in conclusion

Accelerator is a tool for fast data processing. On a single machine, millions of rows of data can be processed per second, and if the task is simple, 1 billion rows per second can be processed. In addition to being fast, Accelerator reduces the work of manually managing source files, data files, calculations, and associated results. It has been used successfully in several projects and is now officially open source by eBay.

Related Links:

ExpertMaker Accelerator code repository (https://github.com/eBay/accelerator)

Installer repository (https://github.com/eBay/accelerator-project_skeleton)

Accelerator User Reference Manual (https://berkeman.github.io/pdf/acc_manual.pdf)


For more dry goods, please pay attention to the WeChat public account "AI Frontline", (ID: ai-front)


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325377312&siteId=291194637