CS194 Full Stack Deep Learning(3) Data Management

0. Preface

image-20210312010416890

1. Sources

  • How to find training data source?
    • Most deep learning tasks now rely on a large amount of labeled data.
    • Of course, there are special cases, such as RL/GANs/semi-supervised learning, but generally these technologies are not widely used in the industry.
  • The data sources are as follows:
    • Public data set: Because everyone can use it, the model training is not competitive, but it can be used as the data at the beginning of the project.
    • It usually costs money or time to label the data.
    • Data flyingwheel: Continuous data collection (business means let users help with annotation), annotation, and model optimization after the application is launched.
    • Semi-supervised learning
    • Data enhancement
    • Simulation to generate data (common in the field of robotics and autonomous driving)

2. Labeling

  • User interface (feeling is the basic function of the annotation tool)
    • Generally include bbox, segmentation, key points, three-dimensional cube
    • Generally supports complex classification results
    • It is very important to train annotators to ensure the quality of annotations
  • Who will label
    • Self-employment (part-time): high cost, difficult to find, personnel management required, but high quality, safety, and speed (after personnel are fixed)
    • Crowdsourcing (such as Mechanical Turk): Cheap and easy to find more people, but it is not safe and requires a lot of quality inspection work
    • Find a data labeling company. When looking for a company, you need to pay attention to:
      • You need to mark some data yourself as a gold standard.
      • Contact a few more companies and look at the labeling examples of the same type of data.
      • Ensure that both parties have a unified understanding of the labeling standards.
      • Full outsourcing is more expensive, and some writing companies provide annotation tools (not including manpower)
  • Suggest:
    • Find outsourcing if you have enough money
    • If the money is not enough, at least use some existing software to mark
    • Compared to important, it’s better to find a part-time job on your own

3. Storage

  • How to save data? The main content includes: file system, object storage, database,Data Laker
    • File system (filesystem): the foundation of the storage system
      • The basic storage unit is file, text or binary, and can be easily overwritten without version control.
      • Basic form such as ordinary hard disk.
      • Can build a network file system (store data between polymorphic machines in the local area network)
      • Can build a distributed file system, namely HDFS
      • Need to pay attention to the access mode, fast, but not parallel
    • Object storage
      • The API based on the file system mainly includes addition, deletion, and checking operations.
      • The basic storage unit is an object, which is generally a binary file, such as image, audio, video, etc.
      • Version control, possibly redundant storage
      • Can be read in parallel, but not too fast
      • Commonly used are AWS's S3 and local Ceph
    • Database (database): fast, scalable, building retrieval, persistent storage of structured data
      • Mental Model means that all data is stored in RAM through rapid replacement (replacement of memory and hard disk), but it is ensured that all information and logs are backed up on the local hard disk.
      • The basic data unit is a row, with an independent ID, and the data is summarized in columns.
      • Not used to store binary data, data is often extracted repeatedly
      • Recommend to use Postgres
      • It is recommended to know SQL
    • "Data Laker": The concept of a data lake has not been understood before. The guess is to store everything in it, and then take it out as needed.
  • What type of data is stored in each type of storage method
    • Binary data (images, audio, compressed text) are saved as objects
    • Metadata is stored in the database
    • If you need some other features that cannot exist in the database, use the data lake
    • The data during training either exists locally or in NFS
  • Recommended materials for further study: this book

4. Versioning

  • If version control of self-built data sets
  • Level 0: directly stored in the file system without version control.
    • It is highly not recommended.
    • Since the data is not version controlled, the model is also not version controlled.
    • The accuracy of the historical model cannot be reproduced.
  • Level 1: Each training saves a copy of the database
    • It is highly not recommended.
    • The model can be version controlled.
  • Level 2: Use assets and code to simultaneously version control data
    • Recommended way
    • Big data is saved in the file system
    • Training data is stored through JSON or similar methods, that is, only tags, sample locations, user behaviors, etc. are stored
    • JSON may be large, but it can be version controlled through git.
    • The lazy function can be realized, that is, the data can be generated after it is needed.
    • git signiture is the version of our database, more information can be saved in git message
  • Level 3: Professional data version control tool
    • If you understand these tools, it is recommended to use them; if you don’t understand the tools, then don’t use them.
    • Common ones include DVC, Pachyderm, Dolt, etc.

5. Processing

  • Data preprocessing generates training data.
  • Why is there such a demand
    • In some scenarios, a new model needs to be trained every night to predict the popularity of photos.
    • For each photo, the training data includes metadata (upload time, title, shooting location, etc.)
    • Run the classifier.
    • Some need to read relevant information from logs.
  • Task dependence, some tasks have mutual dependence.
  • Distributed management can be used.
  • Airflflow: Training scheduling framework, which can be run on multiple machines.
  • Try to be concise. Gave an interesting example
    • For a terabyte file, you need to view the training results of the text file
    • Hadoop takes 26 minutes, and if you use the linux command line, it only takes 18 seconds or 70 seconds

Guess you like

Origin blog.csdn.net/irving512/article/details/114721417