Small data vs big data: actionable data for AI

In the context of artificial intelligence, you may have heard the buzzword "big data", but what about the term "small data", have you heard of it? Whether you’ve heard it or not, small data is everywhere: online shopping experiences, airline recommendations, weather forecasts, and more all rely on small data. Small data is data that is in an accessible and actionable format that is easy for humans to understand. Typically, a data scientist analyzes the status quo with small data. In the field of machine learning (ML), the use of small data is increasing, most likely due to the general increase in data availability and experimentation with new data mining techniques. With the development of AI in all walks of life, data scientists are paying more and more attention to small data, because small data requires only a low level of computing power and is easy to use.

Small Data vs. Big Data

How exactly is small data different from big data? Big data consists of large chunks of structured and unstructured data. Big data is huge in size, harder to understand and analyze than small data, and requires high levels of computer processing power to interpret. Small data can provide companies with actionable insights, rather than requiring complex algorithms like big data analysis. Therefore, companies do not need to invest too much in the data mining process. By applying computer algorithms, big data can be transformed into small data. These computer algorithms transform data into actionable little chunks, each of which is an integral part of a larger data set. An example of big data being transformed into small data: social media monitoring during a brand launch. There are tons of social media posts popping up online every second. Data scientists need to filter data based on publishing platform, time period, keywords or other relevant characteristics. This process transforms big data into more manageable chunks from which relevant insights can be derived.

Advantages of small data

Above, we mentioned the advantages of small data over big data, but there are still a few points worth emphasizing. Difficulty in managing big data: Large-scale use of big data is a daunting task, and data analysis requires powerful computer capabilities. Low difficulty in managing small data: The analysis of small data blocks is highly efficient and does not require too much time and effort. This means that small data is more actionable than big data. Small data is everywhere : Small data is already widely used in many industries. For example, social media provides a wealth of actionable data that can be used for various purposes, such as marketing or otherwise. Small data focuses on end users: Through small data, researchers can focus on end users and put user needs first. Small data can be used to explain the behavior motivation of end users. In many application scenarios, small data is a fast and effective analysis method that can help us gain a deep understanding of customers in various industries.

Small data processing methods in machine learning

Supervised learning is the most traditional machine learning method, which refers to using a large amount of labeled training data to train the model. But beyond that, there are many methods for model training. Among them, there are also many training methods that are cost-effective and less time-consuming, and are becoming more and more popular. While these methods often rely on small data, in this case data quality becomes critical. Data scientists use small data when the model requires only a small amount of data or when the model is trained on insufficient data. At this point, data scientists can use any of the following machine learning techniques.

few-shot learning

With few-shot learning techniques, data scientists provide machine learning models with a small amount of training data. Few-shot learning techniques are often applied in the field of computer vision. In computer vision, a model may not need many examples to recognize an object. For example, if you have a facial recognition algorithm for unlocking your smartphone, you don't need thousands of photos of you to unlock it. A phone only needs a few photos to turn on the security feature. Few-shot learning techniques are low-cost and low-effort. When the model is in a fully supervised learning state and the training data is insufficient, few-shot learning is very suitable for use.

knowledge map

The knowledge map belongs to the secondary data set, because the knowledge map is formed by screening the original big data. A knowledge graph consists of a set of data points or labels that have defined meaning and describe a specific domain. For example, a knowledge graph might consist of a series of data points for the names of famous actresses, with lines (or edges) connecting actresses who have worked together. A knowledge graph is a very useful tool for organizing knowledge in a highly interpretable and reusable way.

transfer learning

Transfer learning techniques are used when one machine learning model is used as a starting point for training another model to help the model complete a related task . Essentially, it is transferring knowledge from one model to another. Using the original model as a starting point, the model is further trained with additional data to develop the ability of the model to handle new tasks. Parts of the original model can also be removed if they are not required for the new task. Transfer learning techniques are particularly effective in fields that require large amounts of computing power and data, such as natural language processing and computer vision. Applying transfer learning techniques can reduce the effort and time required for tasks.

self-supervised learning

The principle of self-supervised learning is to let the model collect supervision signals from existing data. Models use existing data to predict unobserved or hidden data. For example, in natural language processing, a data scientist might feed a sentence with missing words into a model and ask the model to predict the missing word. After getting enough context clues from unhidden words, the model learns to recognize hidden words in sentences.

synthetic data

Synthetic data can be exploited when there is a gap in a given dataset that cannot be filled by existing data. A common example is facial recognition models. Facial recognition models require facial image data covering all human skin tones; the problem is that there are fewer photos of darker faces than lighter faces. Instead of creating a model that struggles to recognize dark faces, data scientists can artificially create dark face data to achieve equality in their representation. But machine learning experts must test these models more thoroughly in the real world and add additional training data when computer-generated datasets fall short. The methods mentioned in this article are not exhaustive, but they also show the promise of machine learning in many directions. In general, data scientists are reducing their use of supervised learning techniques and instead experimenting with methods that rely on small data.

Professional Insights from Data Science Director Rahul Parundekar

It is particularly important to clarify that the "small" of small data does not mean that the amount of data is small. Small data refers to the use of data types that meet requirements to build models to generate business insights and realize automated decision-making. We often see that some people have too high expectations for AI functions, and expect to obtain a high-quality model after only collecting a few image data, but this is not what we are going to discuss here. What we are talking about is to find out the most suitable data for model building, and when it is actually deployed, the model can output the correct content to meet your needs. Here are some things to keep in mind when creating "small" datasets:

data dependency

Clarify the type of data that constitutes the data set and select the correct data. You should ensure that your dataset contains only the types of data your model is exposed to in practice (or in production). For example, if you perform defect detection on a product on a production conveyor line, you should prepare a dataset of images containing defective and non-defective parts, with no objects on the conveyor belt, and input it to the production conveyor line in the camera.

Data Diversity and Repetition

It is important to cover all data application scenarios that the model may come into contact with in practice, and to ensure the balance between various types of data. Do not populate the dataset with existing data to avoid the problem of overpopulation. In the defect detection example, you want to make sure that the model can capture non-defective items, items with different types of defects, whether in different lighting conditions on the factory floor, in various rotations and positions on a conveyor belt, or in maintenance mode where possible Several samples will appear. Because the finished product is the same without defects, you don't need to overpopulate this type of data. Another example of unnecessary repetition is video frames with little or no change.

Built on Powerful Technology

The small data processing technology mentioned above provides a solid technical foundation. Perhaps, you could benefit from transfer learning techniques, which transfer the knowledge of a trained and better-performing model to another model in a related domain, and use small data to refine the new model. For the defect detection example, this could be another defect detection model that you trained previously, rather than a modified model trained on the MS COCO dataset , unlike your defect detection on the conveyor line scenario.

Data-Centric AI vs. Model-Centric AI

Recent research in the AI industry shows that model performance is more impacted if the model is trained with the right data. By finding edge data and data gaps, you can produce better results than using multiple hyperparameters, different model architectures, in short, assuming that a competent data scientist will "figure it out". If your defect detection model cannot accurately detect certain types of defects, you should invest more effort in augmenting these types of image data rather than trying different model architectures or hyperparameter optimizations.

Work with training data experts

With data-centric AI, you also want to focus debugging efforts on data work, which domain experts are better at, rather than model work, which data scientists are better at. In the case of model failure, work with domain experts to identify patterns and hypothesize possible reasons for model failure. This helps you determine the correct data you need. For example, an Object Defects Engineer expert can help you prioritize the right data your model needs, clean up the noise or unwanted data mentioned above, and might even point out what a data scientist might use to choose a better model architecture nuance. All in all, small data is more "dense" than big data. You want the highest quality data in the smallest possible dataset, make the data cost-effective, and build your "champion" model through any of the techniques described above.