(Turn) the problem of how to get data mushroom Street marked by the build platform?

In many academic and industrial community efforts, it has born the open data sets in a number of areas, from MNIST entry, and then to the famous Image Net, covers common scenarios. But in actual business usually run into some segments did not open data sets, such as the type and style of clothing, which requires training to build their own data set, or self-reliance, outsourcing or hiring temporary staff (provide their own tools), and even full entrusted to the professional label company (without providing annotation tools, high costs). Mushroom Street marked a lot of data needs, overall cost and efficiency considerations, we build a unified platform for tagging, annotation support of many businesses, some kind of diagram, see below

 

Common labeling scene

From the field perspective, mushrooms street machine learning business can be divided into CTR (click through rate), computer vision and NLP three categories, including related businesses CTR recommended as sorting, data collected through this kind of business Buried. For most computer vision and NLP training mission, need to mark the construction of a data set.

Not only in the mushroom Street, look to the industry, the common label scene can be divided into the following two categories (audio scene is relatively small, is not within the scope of discussion):

Computer vision

Category: pictures and video classification, such as clothing color, type classification.

Split: splitting the picture, such as image segmentation traffic from the road, divided pants, jackets and other clothing from the image.

Target detection: generally rectangular frame shown the target object and label, such as Circle clothing shoes image, in the image of car traffic.

NLP

Category: text classification, such as sentiment classification.

Entity recognition: the text extracted from the entity having specific meanings, such as trade name label from the commodity description, adjectives like product.

Translation: conversion between different languages, such as English to Chinese.

Open source markup tools

Github marked the birth of many open source tools, covering computer vision, NLP and many other areas, particularly in terms of visual computing, a lot of good open source projects are springing up in general, many of whom like OpenCV community and Microsoft and other commercial companies. I combed the visual part of the calculation, the field part of the NLP and other popular items marked good, as shown in the following table.   


 

These excellent open source project focused on segments of a suitable label in a common scenario, the user through the installation and configuration can be carried out labeling work, and some even support the generation of training samples, which for smaller, more single algorithm business team it suits well. But for complex arithmetic operations, data volume, the scene of a special company, based directly on these tools may bring huge maintenance and management costs.

1) Service Management: In Multiplayer (particularly off-site outsourcing) marked, using desktop tools will bring a lot of problems to deploy and maintain, and involve large amounts of data distribution and allocation issues, complex and easily introduce errors. So marked the best service should provide a unified Web portal, where staff mark no matter, what operating system uses, simply log in the web interface to work; for R & D personnel, maintain a unified front and back end services the workload is much lower than maintain multiple annotation tool.

2) Data Management: Some tools save data to a local xml, some to save MySQL or NoSQL, different items of data formats, there are significant differences, lead to higher management costs and risks. Therefore, data should be highly reliable unified storage, not only can streamline maintenance costs, but also conducive to the convergence sample building blocks.

3) User management: user and rights management is an important requirement under the label people, but also most of the missing features annotation tools. It ensure safety, while also recording the data annotation and moderators to facilitate traceability and settlement.

It can be seen for many labeling operations team, it is necessary to build a unified tagging platform to standardize processes and styles, providing easy to use outside services to support people tagging, reduce maintenance costs; while the centralized data storage, data standardized structure, laid a good foundation for the sample generation module downstream.

Mushroom Street label design platform

This section focus on the design of mushroom Street label platform, our goal is to build a unified, scalable, and easy to use Web annotation platform, support staff, outsourcing labeling and audit.

Design Points

Process-oriented business-oriented vs

At first we tried to abstract around business, be desirable to provide a unified framework for the front and rear end, in-depth research found different label scene, its front-end technology stack and realize there is a huge difference data structure, abstract high degree of difficulty, low feasibility. But from the perspective of the process, all the process is very similar labeling tasks can be carded as follows:


 

All operations denoted we follow the above procedure, wherein exactly the same part of the process, to a common set of code logic, implemented by a portion of the difference in the flow label for each service their own, independent of each other and, for example, be introduced into the data during the data parsing, labeling and auditing to achieve the page, and so on. Therefore, each access a new label business, only to realize data analysis, annotation and review page associated logic can be. Especially for the front page, since the label independent of each service, the front end of the frame may each be selected as desired, while the relevant code to facilitate migration from the open source projects, with good scalability.

Data Management

Annotation data can be divided into pictures, videos, NLP text, metadata (including annotation result), etc., which occupy a larger picture and video storage space, accounting for tens of millions of pictures of the number of TB of storage space, compared to local storage, object storage using cloud services is the better choice. For NLP text, metadata annotation result and the like, with less space, and strong structural characteristics, suitable for storing in the database.

Implementation

Selection Framework

Most of the open source Web annotation tools are based on the realization of Django, Django conducive to open source projects using Python transplantation; in addition to Python algorithm for students to master the language better than the other, so to facilitate the students to participate in part of the logic algorithm development, such as parsing the data and so on.

Whenever a new access annotation items following command is used to create a Django application, to differentiate a flow in the corresponding directory.

python manage.py startapp {mark_app}

Mushroom Street to mark the platform as an example, these differentiation process reflected in the following aspects:

Parsing data files to be imported annotation: different labeling tasks that are quite different raw data, such as images often picture URL on the CDN, NLP was text characters, so the need to implement specific parsing logic.

Annotation result serialization and deserialization: Different tasks that are denoted annotation result quite different, need to implement the serialization and deserialization writing and reading DB.

Labeling and audit Web page: different labeling tasks quite different page, you can select the appropriate front-end framework for implementation.

Data Management

Just import a URL when we put all the images and videos stored in the object storage cloud service, the object storage to ensure high reliability, each of the pictures and videos are globally unique URL, so the imported data to be annotated. In the annotation and review process, according to the front and URL CDN download data from display, convenient and efficient.

Metadata is stored in MySQL, there are two tables, one for the user associated table for managing users and permissions. Further marked as a related table basic information about records of data, including raw data (Data), annotation result (label), denoted by, reviewers, and state and other information relevant to the project, the column following several key :

id` the INT `(. 11)
` data` # MEDIUMTEXT the URL of the video image, the NLP text or pictures and text mixing
`label` MEDIUMTEXT # annotation result, usually json format for serializing and deserializing
` project` VARCHAR (60) corresponding to the marked item #
`importuser` VARCHAR (60) # annotation data to be imported by
` markuser` VARCHAR (60) # denoted by
`checkuser` VARCHAR (60) # moderator
... # others, such as time, status and other information

data fields marked different business fields and label content vary widely, generally URL of the image data of video, text, text NLP, or a combination of images and text; the difference is even greater field label, data structure classification items tend to be simple, and the data structure is divided many complex projects. So we have two data and the label field is set to text, a sequence of data structures and data definitions, deserialized in the respective project implemented, greatly enhance scalability. Since all data is marked share this table, to avoid for each project maintains a data sheet, reducing access and maintenance costs.

Deployment Architecture

Label platform architecture is simple, the data stored in MySQL and object store, service deployment in statefulset K8S in by statefulset ensure high reliability. User access K8S the Service, the Service will then load balancing forwarded to the Pod.

Incidentally provide the building blocks of samples, it acquires basic data from MySQL, and download images or video from the corresponding CDN, the finally generated training sample format, such as TFRecords

 

 

To ensure data quality, the algorithm students usually take some audit work. However, due to the amount of data tagging is too great, even with a 10% rate of checks, each project needs to be reviewed often tens of thousands of pieces of data. For some tagging project, our platform supports data using a model of early batches of training, follow-up review of the preliminary results of the labeling for outliers, and then a second round of audits by hand, improve audit efficiency.

In the field of image segmentation and NLP, part of the open source markup tools include intelligent algorithms, tagging support staff to improve efficiency labeling, I think this is a marked trend in future projects, especially for the higher cost of tagging, large amount of data project, could be considered intelligent auxiliary label to reduce costs, improve efficiency

 



Guess you like

Origin www.cnblogs.com/chenyusheng0803/p/12240930.html