Training data is also outsourced? The company "contract" a lot of training data annotation, the original is doing ...... ...

 Author | Lionbridge AI

Translator | Him Zebian | Xu Veyron

Cover photo | CSDN│ download the visual China

Produced |   AI technology base camp (ID: rgznai100)

In the field of machine learning, data preparation training is one of the most important and time-consuming task. In fact, many scientists claimed that the data of a large part of the scientific data preprocessing, and some studies have shown that the quality of the training data is more important than the type of algorithm you use.

In fact, more and more companies enter the market of artificial intelligence, to help meet the demand for training data.

 How do you get a machine learning training data?

The main method to get the training data are the following three ways:

  • Open source data set : by Kaggle, Google search data sets or data sets aggregator sites such as online search.

  • Build your own data set : collection / create data and make notes inside.

  • Outsourcing data collection and annotation services : training data service provider.

For individual projects or school work, sometimes open dataset can provide a sufficient amount of data you need to accomplish. However, when building artificial intelligence and training solutions for commercial, open-source data sets are usually not available for your use case, it can not be used for commercial profit.

In addition, when you have thousands of pieces of data and only a small number of employees, internal procurement training and annotation data are often inefficient. This gives us a third option: outsourcing training data services .

 

Machine learning training data services

 

 Lionbridge learning training data services through a variety of machines to help customers improve their models.

Currently, there is a related company: Lionbridge to doing this type of work. When we get to know, we discover some of the core services are as follows: 

  • Data collection : voice / data words, handwritten data, bot training phrases.

  • Image and video annotation : bounding box, polygon, circle, straight line, the key point.

  • Text Notes : emotional, physical, physical links, classification.

  • Audio Notes : verbatim records verbatim records, smart, audio classification.

  • Content Rating : advertising assessment, assessment of search, location data evaluation.

From translation to the training data

Lionbridge use their global data scientists, computational linguists, professional translators and commentators of knowledge, learning and training data to create a machine with a variety of cases.

Why translation company for data annotations?

For example Lionbridge, is aware of their global community is an ideal labor force data annotation.      

Especially for natural language processing (NLP), a professional linguist is entity extraction, classification search queries and other commentators ideal language-based annotation project. After thorough testing and training, these same workers can easily perform a variety of image annotation task to implement computer vision.

 

Translation quality equal to the quality of the training data it?

        

Not necessarily equal quality. However, the translation quality assurance process is very similar to the quality inspection protocol artificial intelligence training data.

For example, one of the quality inspection process localization project is to editorial review. During translation, we usually need one or more editors to review output translators. Similarly, in many of our artificial intelligence project, we have multiple contributors notes to check whether the same data is consistent. 

In many cases, quality management means the management of the contributors . To ensure accuracy, your data must go through many processes.

  • Output Management       


Communities need to have a lot of protocol ensures that each contributor can do everything. Such as checking agreement between the commentators, to ensure that each comment is correct. This process can also help verify that the data itself is clear and the task is simple. For some projects, a maximum of five contributors to the same data annotation. You can also implement self-checking protocol, to ensure that each contributor to their work consistent.

A good example of machine learning training data quality assurance, are their words / voice data collection process:

  • First, we have to ensure that each contributor sound engineers say the phrase correctly, they ensure that contributors have not missed any word, and speak in a natural tone (contrast with the monotonous reading).

  • Next, we will send an audio file to native speakers of each language, according to their script to check the sound clip.

  • Finally, they send files to audio quality checks to ensure that there is no noise, and other standards required by customers within a certain threshold.

The only part of the quality inspection measures they have implemented in looks is continuing perfected.

 

Data quality is subjective

After all, the definition of data quality depends on the project. " When it comes to the quality of the training data, there is no objective definition. It depends on what the user wants to try to do. " Cedric Wagrez Lionbridge AI service director of Japan said. "The ultimate goal of the quality of the user and the various factors, such as the user's KPI, precision and customized use cases."

High-quality training data is based on machine learning can help users achieve goals collect, annotate and calibration data.

Before beginning the quality of management, we must first understand what it wishes to users Yes.

  • Pilot projects

Before the start of the project will provide free consultation to explain the best way to collect data or comments.

Next, run tests and pilot projects to meet client expectations. Suppose you have data to be annotated 10,000. To ensure that everyone is on the same page, they will get the first 100 data, set up the project in the system, and let the community add data labels. If the final result you can imagine exactly the same, it will continue to process the remaining data. If you need to change, it will be recalibrated according to the feedback.

Importantly, the quality of the data is not just about clear images and tight bounding box. Select the tag data you have to consider personnel, providing data and guidelines for environmental data collection.

 Data collection and annotation tools text, audio, images and video

       

Employees have to tag your data, but need a platform for its label? Today, this type of requirement is also met, has service providers will publish data annotation platform for consumer products.

AI line industry is expected to increase to $ 1.5 trillion over the next decade for the world economy. With the continued growth of the market, the demand for training data will continue to grow. Therefore, we may see even more similar service companies to enter training machine learning data industry.

Everything is in development, the industry will become increasingly richer and more worth the wait!

original:

https://hackernoon.com/get-machine-learning-training-data-using-the-lionbridge-method-a-how-to-guide-ay4f32xi

【end】

Force plans

"Force plan [the second quarter] - learning ability Challenge" started! From now until March 21, must flow to support the original author! Exclusive [more] medal waiting for you to challenge

Recommended Reading

    Your point of each "look", I seriously as the AI

Released 1354 original articles · won praise 10000 + · views 6.25 million +

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/104831794