How to develop a data acquisition solution?

Data acquisition remains a major bottleneck for artificial intelligence (AI) build teams. The reasons vary: there may be insufficient data for the use case, new machine learning (ML) techniques such as deep learning require more data, or the team has not established the proper process for obtaining the required data. Regardless, the need for accurate and scalable data solutions continues to increase.

Best practices for high-quality data collection

As an AI practitioner, you need to ask the right questions when planning your data acquisition.

What kind of data do I need?

The problem you choose to solve indicates the type of data you need. For example, for a speech recognition model, you need to collect speech data that is representative of all expected customers. That is, voice data covers all languages, accents, ages and characteristics of the target customer.

Where can I get the data from?

First, understand what data is already available internally and whether it can be used to solve existing problems. For more data, there are some publicly available online datasets . You can also work with data partners to generate data through crowdsourced sources. Synthetic data can also be created to fill gaps in the dataset. But please keep in mind that for a long time after the model is put into production, you need a stable data source to ensure that data can continue to be provided for model optimization after the model is put into production.

How much data do I need?

The amount of data depends on the problem you want to solve and your budget, but in general, more data is better. When you first start building a machine learning model, you usually don't have much data. You need to ensure that you have enough data to cover all potential use cases of your model (including edge cases).

How can I ensure my data is of high quality?

Before using a dataset to train a model, clean the dataset first . That said, the first step is to remove irrelevant or incomplete data (and check to see if you really don't need it). Next, it is necessary to accurately label the data . Many companies turn to crowdsourcing to obtain large numbers of annotators; the more diverse the data annotators, the more inclusive the data annotation. If your data labeling requires domain-specific knowledge, you need to find experts in that field to label your data. With the answers to the above questions clear, you can start building data pipelines that enable you to efficiently collect high-quality, accurately labeled data. Finally, ensure repeatable, consistent data pipelines to help you scale.

Where responsible AI comes in

You want to insist on collecting data from a responsible AI perspective, because building ethical AI starts with data. Clean data provenance should be a top priority, which means you need to source your data ethically. This is especially true when dealing with secure and confidential information, such as medical records or financial situations. Please comply with the data protection regulations in your region and industry , and when selecting data partners, verify that these partners also comply with these regulations. Your data partner should work with you to develop security protocols to ensure customer data is treated respectfully and responsibly.

Expert Insights from David Brudenell, Vice President, Solutions and Advanced Research Group

tolerance is better than prejudice

Over the past 18 months, Appen has seen a dramatic shift in the way customers interact with Appen. As AI continues to evolve and become more ubiquitous, gaps in how it is built have also become apparent. Training data plays an important role in reducing bias in AI. We recommend that customers organize a group of representative and inclusive annotators to collect data and build faster, better, and more cost-effective AI models. Since almost all training data is collected by humans, we recommend that customers focus on inclusiveness first when designing samples. This increases the workload and the number of experimental designs, but the return on investment can be significantly improved compared to a simple sample design. In short, you get more diverse and accurate ML/AI models that hold more specific demographics, and in the long run, it's cheaper than trying to produce ML/AI models by eliminating Much better to "fill in the blanks" with the biases in it.

Prioritize users

A well-designed data acquisition scheme consists of several components. While an inclusive sample framework is fundamental, the key to driving throughput and data quality is to bring a user-centric approach to the entire engagement process: project invitations, qualifications, onboarding (including trust and safety) experimental experience. Many times, teams forget about the people who completed these projects. If you forget this, you will end up with poor uptake and data on your project due to sub-average written experiments and user experience. When designing experiments and user flows, ask yourself if you would be willing to do the work. Also, make sure you always test your experiment end-to-end yourself. If you're stuck or the results aren't what you expected, you need to make improvements.

Interlock quota - from 6000 to 60000

If you take the US Census as an example and experiment around 6 data points: age, gender, state, race, and cell phone ownership, do you have over 60,000 quotas to manage? This is due to the impact of the interlock quota. Interlocking quotas i.e. the number of interviews/participants required in an experiment are in cells requiring multiple features. Using the US Census example above, there is a cell that requires n users with the following characteristics: Male, 55+, Wyoming, African American, owns a 2021 generation Android smartphone. This is an extreme, low-occurrence example, but by creating your own interlocking matrix before pricing, writing an experiment, or going into the field, you can uncover hard-to-combine or nonsensical combinations of features that could impact a project success.

Compensation matters more than ever

Finally, and most importantly, review what you're paying users for completing experiments. It's common to weigh commercial benefits when designing data collection experiments, but you can't do that to cut user incentives. Users are the most important part of the team to provide you with suitable, high-quality data. If you choose to pay your users less, you'll end up with lower uptake and poor data for your project, and you'll have to pay more in the long run. If you are on a budget, seek advice on global Purchasing Power Parity (PPP); can your money be more productive in various regions of the world? Reduce your quota requirements - can you split the 24-40 year olds into one group instead of two? These are just some of the approaches you can take to get the most business value for your project.