Ethical issues throughout the stages of the AI lifecycle: data labeling

As AI becomes more widely adopted in the market and implemented as a tool in various use cases, more challenges arise. AI projects run into a long-standing critical issue, ethical AI and the handling of bias in data. In the early days of AI development, this problem was not obvious. Data bias is when an element in a data set is over- or under-represented. Using biased data to train AI or machine learning models can lead to biased, unfair, and inaccurate results. Appen is delving into what ethical AI data looks like at each stage of the AI ​​lifecycle. At every step of the data journey, there is the potential for common errors that lead to data bias. Thankfully, there are ways to avoid these pitfalls. In this series of articles, we will explore data bias in the following four phases of the AI ​​lifecycle :

  • data collection
  • data preparation
  • Model Training and Deployment
  • Artificial Model Evaluation

Not all datasets are created equal, but we want to help you navigate the complex issues of data ethics in the AI ​​lifecycle so you can create the best, most useful, and most reliable datasets for your AI models.  

Bias in Data Preparation

Before using data to train AI models, the data must be readable and usable. The second stage of the AI ​​data life cycle is data preparation, which is to obtain a set of raw data, sort, label, clean and review it. Appen provides customers with data preparation services such as manual labeling and AI automatic data labeling . The combination of the two delivers high-quality data with the lowest possible bias. In the data preparation stage, each piece of data is first checked by annotators and provided with labels or annotations. Depending on the data type, there may be the following labeling methods:

  • Add bounding boxes around objects in an image
  • Transcribe audio files
  • Translate written text from one language to another
  • Annotate text or image files

Once our human annotators around the world have finished labeling the data, the data moves on to the next stage of data preparation: quality assurance. The quality assurance process requires human annotators and machine learning models to check the accuracy of the data. Data is removed from the dataset if it is not suitable for the project or if it is mislabeled. At the end of the data preparation phase, the dataset then goes into the model training phase. Before a dataset can enter this stage, it must be consistent, complete and clean. High-quality data creates high-quality AI models. Bias can be introduced into the data preparation process in a number of ways and create ethical issues that are then carried into AI models. The most common types of data bias in data preparation include:

  • data gap
  • Data labelers are poorly trained
  • Inconsistent labeling
  • personal bias
  • too much or too little data

There are gaps in the data

One of the most common instances of bias lurking in AI datasets is data gaps and underrepresentation of data. If certain groupings or types of data are missing from a dataset, it can lead to bias in the data and the resulting AI model output. Common data gaps include underrepresentation of minority groups. Data gaps can also be underrepresentation of certain classes of data or rare use case examples. Data gaps are often unintentional, so it is essential to check the data during the preparation phase to detect these data gaps. If data gaps are not addressed by adding more representative data, there will be data gaps in the data used to train AI models, and the models will, in turn, produce less accurate results.

Data annotators are not well trained

Another common situation where bias is introduced during data preparation is using untrained data annotators to annotate data. If data labelers are undertrained and don't understand the importance of their work, there is a greater chance of labeling errors or cutting corners during the labeling process. Providing data annotators with thorough training and supporting supervision can limit the number of errors that can occur during data preparation. During the data labeling process, untrained data labelers can introduce bias in several ways, including labeling inconsistencies and personal biases.

Inconsistent labeling

If multiple annotators annotate a dataset, it is important to train all annotators to be consistent in annotating each data point. When similar types of data are labeled inconsistently, recall bias occurs, leading to reduced accuracy of AI models.

personal bias

Another way data annotators introduce bias during the annotation process is by incorporating their own personal biases. Each of us has a unique set of biases and understandings of the world around us. While annotators' unique understanding of the world can help them annotate data, it can introduce bias into the data. For example, if an annotator annotates an emotional image with facial expressions, annotators from two different countries may provide different annotations. Such biases are inherent in data preparation, but can be controlled through comprehensive quality assurance processes. In addition, companies can also provide training for data labelers to avoid unconscious bias, and try to reduce the impact of bias on data labeling.

Use only human annotation or only machine annotation

In the past, the only way to label data was to manually examine each piece of data and annotate it with a label. More recently, machine learning programs have been able to label data and create training datasets. The debate around the two annotation methods is always fierce: which method is better? We want to have a two-pronged approach, using human annotators to annotate the data, while also using machine learning programs to perform quality assurance checks on the data annotations. Doing so allows you to build top-quality datasets.

too much or too little data

Another important thing to consider when evaluating data in preparation is making sure you have the right amount of data. There may be too little training data, or too much. If there is too little training data, the algorithm will not be able to understand the patterns in the data. This is called underfitting. If there is too much training data, the output of the model will be inaccurate because it cannot determine which is noise and which is real data. Feeding a model with too much data is called overfitting. Creating a dataset of the right size for an AI model will improve the quality of the model's output. Excluding " irrelevant " data  During data preparation, it is important to carefully examine the data and remove from the dataset data that is not suitable for future models. Always double-check before deleting data, as data that initially or may seem "irrelevant" to someone may not actually be. Removing "insignificant" data at this stage can lead to bias in exclusion. Just because a part of a dataset is small or uncommon doesn't mean it's unimportant.  

Solutions to the problem of bias in data preparation

While there are various ways in which bias can be introduced into a dataset during data preparation, there are many solutions. Below are some ways you can avoid bias during data preparation.

Hire a diverse and representative workforce

One of the most important ways to remove bias from the data preparation process is to ensure a broad representation of decision makers and participants. Hiring a diverse workforce can go a long way toward reducing bias in AI training datasets. Hiring a diverse workforce is the first step, and we can go a step further by providing unconscious bias training to all employees. Unconscious bias training helps employees better identify their own personal biases and consciously look for them in labeled data.

Adding Bias Checks to the Quality Assurance Process

If there was only one thing that could be done to reduce bias in data preparation, it would be to add bias checks to the quality assurance process. Most biases are unintentional. This means bias creeps into the data because no one is aware of it, or no one bothers to look it up. Bias checking can be done intentionally by adding bias checking to the quality assurance process. This helps remind employees to explicitly look for bias in the data and to think critically about what the data should and should not represent. Providing employees with unconscious bias training will make it easier for them to find and remove bias during data preparation.

Provide good compensation and fair treatment for labelers

Bias is pervasive in AI data. Identifying data gaps requires a keen eye and thorough training. One simple way for companies to address bias in AI training datasets is to ensure that their data annotators are well paid and treated fairly. Employees with well-paid jobs are more likely to focus on producing high-quality content. When companies treat their employees well, employees are more likely to return with high-quality work. Essentially, ethical AI begins with those who annotate and clean data for training AI models. When these people are not paid satisfactorily for their work, the chances of bias spreading are greater. To build a more ethical and better world for AI models, we need to go back to the beginning: starting with data. The AI ​​lifecycle includes four data processing stages, all of which have the potential to introduce bias into training datasets. During the data preparation phase, it is critical to have well-trained, well-paid staff who can identify unconscious bias and can help remove as many of them as possible.

Guess you like

Origin blog.csdn.net/Appen_China/article/details/131683781