Station B video (subtitles are automatically generated, but they are almost sufficient)
Why care about data? There is a vote, and most of the most important energy of algorithm engineers is spent on data.
The main content includes: Sources/Labeling/Storage/Versioning/Processing
1. Sources
How to find training data source?
Most deep learning tasks now rely on a large amount of labeled data.
Of course, there are special cases, such as RL/GANs/semi-supervised learning, but generally these technologies are not widely used in the industry.
The data sources are as follows:
Public data set: Because everyone can use it, the model training is not competitive, but it can be used as the data at the beginning of the project.
It usually costs money or time to label the data.
Data flyingwheel: Continuous data collection (business means let users help with annotation), annotation, and model optimization after the application is launched.
Semi-supervised learning
Data enhancement
Simulation to generate data (common in the field of robotics and autonomous driving)
2. Labeling
User interface (feeling is the basic function of the annotation tool)
Generally include bbox, segmentation, key points, three-dimensional cube
Generally supports complex classification results
It is very important to train annotators to ensure the quality of annotations
Who will label
Self-employment (part-time): high cost, difficult to find, personnel management required, but high quality, safety, and speed (after personnel are fixed)
Crowdsourcing (such as Mechanical Turk): Cheap and easy to find more people, but it is not safe and requires a lot of quality inspection work
Find a data labeling company. When looking for a company, you need to pay attention to:
You need to mark some data yourself as a gold standard.
Contact a few more companies and look at the labeling examples of the same type of data.
Ensure that both parties have a unified understanding of the labeling standards.
Full outsourcing is more expensive, and some writing companies provide annotation tools (not including manpower)
Suggest:
Find outsourcing if you have enough money
If the money is not enough, at least use some existing software to mark
Compared to important, it’s better to find a part-time job on your own
3. Storage
How to save data? The main content includes: file system, object storage, database,Data Laker
File system (filesystem): the foundation of the storage system
The basic storage unit is file, text or binary, and can be easily overwritten without version control.
Basic form such as ordinary hard disk.
Can build a network file system (store data between polymorphic machines in the local area network)
Can build a distributed file system, namely HDFS
Need to pay attention to the access mode, fast, but not parallel
Object storage
The API based on the file system mainly includes addition, deletion, and checking operations.
The basic storage unit is an object, which is generally a binary file, such as image, audio, video, etc.
Version control, possibly redundant storage
Can be read in parallel, but not too fast
Commonly used are AWS's S3 and local Ceph
Database (database): fast, scalable, building retrieval, persistent storage of structured data
Mental Model means that all data is stored in RAM through rapid replacement (replacement of memory and hard disk), but it is ensured that all information and logs are backed up on the local hard disk.
The basic data unit is a row, with an independent ID, and the data is summarized in columns.
Not used to store binary data, data is often extracted repeatedly
Recommend to use Postgres
It is recommended to know SQL
"Data Laker": The concept of a data lake has not been understood before. The guess is to store everything in it, and then take it out as needed.
What type of data is stored in each type of storage method
Binary data (images, audio, compressed text) are saved as objects
Metadata is stored in the database
If you need some other features that cannot exist in the database, use the data lake
The data during training either exists locally or in NFS
Recommended materials for further study: this book
4. Versioning
If version control of self-built data sets
Level 0: directly stored in the file system without version control.
It is highly not recommended.
Since the data is not version controlled, the model is also not version controlled.
The accuracy of the historical model cannot be reproduced.
Level 1: Each training saves a copy of the database
It is highly not recommended.
The model can be version controlled.
Level 2: Use assets and code to simultaneously version control data
Recommended way
Big data is saved in the file system
Training data is stored through JSON or similar methods, that is, only tags, sample locations, user behaviors, etc. are stored
JSON may be large, but it can be version controlled through git.
The lazy function can be realized, that is, the data can be generated after it is needed.
git signiture is the version of our database, more information can be saved in git message
Level 3: Professional data version control tool
If you understand these tools, it is recommended to use them; if you don’t understand the tools, then don’t use them.
Common ones include DVC, Pachyderm, Dolt, etc.
5. Processing
Data preprocessing generates training data.
Why is there such a demand
In some scenarios, a new model needs to be trained every night to predict the popularity of photos.
For each photo, the training data includes metadata (upload time, title, shooting location, etc.)
Run the classifier.
Some need to read relevant information from logs.
Task dependence, some tasks have mutual dependence.
Distributed management can be used.
Airflflow: Training scheduling framework, which can be run on multiple machines.
Try to be concise. Gave an interesting example
For a terabyte file, you need to view the training results of the text file
Hadoop takes 26 minutes, and if you use the linux command line, it only takes 18 seconds or 70 seconds