Use AI datasets like git! Free data hosting tools are online, allowing models to be trained with "live" data sets...

Guhao Nan Temple from the bottom of the recessed non-
qubit reports | Public number QbitAI

Many people who engage in machine learning suffer from data management.

Either the data set is old and needs to be revised manually.

Or there are many versions of the same data set that have been adjusted by others, and there is no way to start.

Or, there is simply no suitable data set and you need to create it yourself.

Abroad, a programmer named Simon Lousky finally couldn't bear it, and developed a set of data version control tools (Data Version Control, DVC) for machine learning .

One-click to call the data set, one-click to view the editing history... The most important thing is that behind the DVC tool, there is a data hosting community like GitHub .

"Invigorate" the data set

When Simon Lousky was working on projects in his student days, he already felt the pain points of inconvenience in the management of machine learning data sets.

At that time, his model needed a plant and flower data for training, and the open source data set could not get reasonable results anyway.

So he spent several hours on his own, correcting a large number of outdated and unreasonable annotations in the data set one by one, and the training results were very satisfactory.

In addition to this project, he later did a lot of data set revisions, additions, and creations. He called these time-consuming and laborious processes "data set adjustments and errors" and began to deliberately record operation history.

He gradually discovered that in his projects, data management is always messed up, but relying on the code hosted by GitHub has always been organized.

So why not build a tool similar to GitHub that specializes in data management?

DVC was born.

This is a pre-installed tool library, which implements functions including calling data sets, viewing historical operation information, and so on.

Its appearance means that the way researchers train models on local "dead" data sets has completely changed.

You can link the project to a data set (or any file) hosted online to establish real-time and accurate connections. Any update and change of the data set can be informed in time to facilitate the development of the project.

For example, there is now a Repository A, which is a "live" data set, in which metadata files point to real large files stored on a dedicated server.

Users can organize data set files into directories and add code files with utils functions to facilitate calling.

In addition, there is a Repository B, which corresponds to a machine learning project. The project code contains instructions for importing data sets using DVC.

As long as a data registry is created, the connection between A and B can be established:

mkdir my-dataset && cd my-dataset
git init
dvc init

At this point, the data set directory will look like this:

When you need to view information about the data set, enter the command:

dvc add annotations
dvc add images
git add . && git commit -m “Starting to manage my dataset”

The preview of the dataset will be saved to a directory, which will also be tracked by DVC.

Then users only need to push the code and data to the managed warehouse, so that they can access it anytime, anywhere and share it with others.

Of course, for DVC to play a role, the DAGsHub behind it is naturally indispensable .

DAGsHub is a data management version of GitHub, composed of three parts, git warehouse, DVC, and machine learning process platform mlflow.

Users can submit their own projects, DAGsHub will automatically scan the submission, and extract useful information such as experimental parameters, data files and model links, and combine them into a simple interface.

DAGsHub can browse and compare code, data, models and experiments without downloading anything.

In addition, it can also generate visual data pipelines, data operation history, and record model performance, which is automatic and beautiful.

How to use "live" datasets in machine learning projects

To use DAGsHub, just register and log in.

Install DVC with the following instructions:

pip3 install dvc

Find a data set on DAGsHub, how to use it in your own model?

First, import a directory from the hosted data set and treat it as the original file:

mkdir -p data/raw
dvc import -o data/raw/images \
https://dagshub.com/Simon/baby-yoda-segmentation-dataset \
data/images
dvc import -o data/raw/annotations \
https://dagshub.com/Simon/baby-yoda-segmentation-dataset \
data/annotations

Then, the pictures and notes will be downloaded to your own project, and the historical information will be preserved.

When you want to know the change history of the data set, just run the command:

dvc update

You can return the visualization results to the default directory to save:

Is it convenient?

By the way, both DVC and DAGsHub are open source and free, come and try

Portal:

DVC tutorial: https://dagshub.com/docs/experiment-tutorial/2-data-versioning/
DAGsHub homepage: https://dagshub.com/

Ends  -

This article is the original content of the NetEase News•NetEase Featured Content Incentive Program account [qubit]. Unauthorized reprinting is prohibited.

Join the AI ​​community and expand your network in the AI ​​industry

Qubit "AI Community" is recruiting! AI practitioners and friends who are concerned about the AI ​​industry are welcome to scan the code to join, and pay attention to the development of the artificial intelligence industry & technological progress with 50,000+ friends :

Qubit  QbitAI · headlines on the signing of

վ'ᴗ' ի Track new trends in AI technology and products

One-click three consecutive "Share", "Like" and "Looking"

The frontiers of science and technology are seeing each other~

Guess you like

Origin blog.csdn.net/QbitAI/article/details/112914030