Training a Pallet Detection Model with Synthetic Data [Machine Learning]

Imagine you're a robotics or machine learning (ML) engineer tasked with developing a model to detect pallets so a forklift can maneuver them. ‌You are familiar with traditional deep learning pipelines, have organized manually labeled datasets, and have trained successful models.

insert image description here

Recommendation: Use NSDT Designer to quickly build programmable 3D scenes.

You're ready for the next challenge, big piles of densely packed pallets. You may be wondering, where should one start? ‌Would 2D bounding box detection or instance segmentation be most useful for this task? ‌Should 3D bounding box detection be done? If yes, how would I label it? ‌Is it best to use monocular cameras, stereo cameras, or lidar for detection? Considering the huge number of pallets that occur in natural warehouse scenarios, manual labeling is not an easy task. If I'm wrong, it could be costly.

Here's what I think about when I'm in a similar situation. Fortunately, I have an easy way to start with relatively low investment: synthetic data.

1. Overview of Synthetic Data

Synthetic Data Generation (SDG: Synthetic Data Generation) is a technique that uses rendered images instead of real images to generate data to train neural networks. ‌The advantage of using synthetic rendering data is that you implicitly know the full shape and position of objects in the scene and can generate annotations such as 2D bounding boxes, keypoints, 3D bounding boxes, segmentation masks, etc. ‌‌‌

Synthetic data is a great way to bootstrap deep learning projects because it enables you to iterate on ideas quickly before doing a lot of manual data labeling work, or when data is limited, restricted, or non-existent. For this case, you may find that synthetic data with domain randomization is a good first try for your application out of the box, and saves time as well.

Or, you may find that you need to redefine the task or use a different sensor modality. Using synthetic data, these decisions can be tried without costly labeling efforts.

In many cases, you can still benefit from using some real-world data. The good part is that by experimenting with synthetic data, you will become more familiar with the problem and can put your labeling efforts where it matters most. Every machine learning task has its own challenges, so it can be difficult to determine exactly how synthetic data will fit, whether real data needs to be used, or a mix of synthetic and real data.

2. Train the pallet segmentation model using synthetic data

When considering how to use synthetic data to train a pallet detection model, our team started small. Before we think about 3D box detection or anything complex, we first want to see if we can detect anything using a model trained on synthetic data. To do this, we render a simple scene dataset consisting of just one or two pallets with a box on top. ‌We use this data to train a semantic segmentation model.

We choose to train a semantic segmentation model because the task is well-defined and the model architecture is relatively simple. It is also possible to visually identify where the model fails (mis-segmented pixels).

To train the segmentation model, the team first rendered a rough synthetic scene (Figure 1).
insert image description here

Figure 1. Rough composite rendering of two trays with a box on top

The team suspects that these rendered images alone lack the diversity to train meaningful pallet detection models. ‌We also decided to experiment with augmenting synthetic rendering with generative AI to produce more realistic images. ‌‌Prior to training, we apply generative artificial intelligence to these images to add variation, which we believe will improve the model's ability to generalize to the real world.

This is done using a depth-conditional generative model that roughly preserves the pose of objects in the rendered scene. Note that generative AI is not required to use the SDGs. You can also experiment with traditional domain randomization, such as changing the compositing texture, color, position and orientation of the pallet. ‌You may find that traditional domain randomization by changing render textures is sufficient for your application.

insert image description here

Figure 2. Composite rendering enhanced with generative AI

After rendering about 2,000 synthetic images, we trained a resnet18-based Unet segmentation model using PyTorch. Soon, the results showed great promise on real-world images (Figure 3).

insert image description here

Figure 3. Real pallet images tested with segmentation model

The model can accurately split the pallet. Based on this result, we are more confident in our workflow, but the challenge is far from over. So far, the team's method did not distinguish between instances of pallets, nor did it detect pallets that were not placed on the floor. ‌For images like those shown in Figure 4, the results are barely usable. This probably means we need to adjust our training distribution.

insert image description here

Figure 4. Semantic segmentation model fails to detect stacked pallets

3. Iteratively increase data diversity to improve accuracy

To improve the accuracy of the segmentation model, the team added more images of various pallets stacked in different random configurations. We added about 2,000 images to the dataset, bringing the total number of images to about 4,000. ‌We created the stacked pallet scene using the USD Scene Construction Utilities open source project.

The USD Scene Construction Utilities are used to position pallets relative to each other in a configuration that reflects the distribution one might see in the real world. ‌We used the common scene description (OpenUSD) SimReady Assets, which provides a variety of pallet models to choose from.
insert image description here

Figure 5. A structured scene was created using the USD Python API and the USD scene building utility, further randomized and rendered using Omniverse Replicator

By training with stacked trays and a wider range of views, we were able to improve the model's accuracy in these situations.

If adding this data helps the model, why only generate 2,000 images without increasing the cost of labeling? We don't start with many images because we sample from the same synthetic distribution. ‌Adding more images does not necessarily add much diversity to our dataset. Instead, we might just add many similar images without improving the model's accuracy in the real world.

Starting small, the team was able to quickly train the model, see where it was failing, and adjust the SDG pipeline and add more data. For example, after noticing that the model was biased towards specific colors and shapes of pallets, we added more synthetic data to address these failure cases.

insert image description here

Figure 6. Renderings of plastic pallets in various colors

These data changes improved the ability of the model to handle the failure scenarios encountered (plastic and colored pallets).

If the data changes well, why not go all out and add lots of changes at once? It was difficult to judge what differences might be required until our team started testing on real data. ‌We may be missing an important factor needed for the model to perform well. Or, we may overestimate the importance of other factors, draining our efforts unnecessarily. By iterating, we get a better idea of ​​what data the tasks need.

4. Detection of the center of the side of the pallet

Once we have some promising results on segmentation, the next step is to adapt the task from semantic segmentation to something more practical. ‌We decided that the next easiest evaluation task was to detect the center of the pallet sides.

insert image description here

Figure 7. Example data for pallet side center detection task

The pallet side center point is where the forklift will center itself when maneuvering the pallet. ‌While in practice more information may be needed to manipulate the pallet (such as distance and angle at this point), we see this as an easy next step in the process, enabling the team to evaluate our data for any downstream applications How useful the program is.

Detecting these points can be done with heatmap regression, which, like segmentation, is done in the image domain, which is easy to implement and easy to interpret intuitively. By training a model for this task, we can quickly evaluate the usefulness of our synthetic dataset for training a model to detect important operational key points.

The results after training are promising, as shown in Figure 8.

insert image description here

Figure 8. Real detection results of the pallet side detection model

The team confirmed the ability to use synthetic data to detect the sides of pallets, even closely stacked ones. We continue to iterate on the data, model, and training pipeline to improve the model for this task.

5. Corner detection

When we were satisfied with the side-center detection model, we explored taking the task to a new level: detecting corners of boxes. The original approach was to use a heatmap for each corner, similar to the approach for the center of the side of the pallet.

insert image description here

Figure 9. Pallet corner detection model using heatmap

However, this approach quickly presented challenges. Since the dimensions of the detected objects are unknown, it is difficult for the model to infer exactly where the corners of the pallet should be if they are not directly visible. With heatmaps, if the peaks are inconsistent, it can be difficult to reliably resolve them.

Therefore, instead of using a heatmap, we opted to regress on the corner positions after detecting a peak at the face center. We trained a model to infer a vector field containing the offsets of corner points relative to the center of a given pallet face. ‌This approach quickly showed promise for this task, and we were able to provide meaningful estimates of the diagonal positions even with large occlusions.

insert image description here

Figure 10. Pallet detection results using face center heatmap and corner regression based on vector field

Now that the team had a promising workflow, we iterated and extended it to address the different failure cases that arose. In total, our final model was trained on approximately 25,000 rendered images. Our model was trained at a relatively low resolution (256 x 256 pixels) and was able to detect small pallets by running inference at a higher resolution. Finally, we are able to detect challenging scenes like the one above with relatively high accuracy.

Here's something we can use - all created with synthetic data. This is where our pallet detection model stands today.

insert image description here

Figure 11. The final pallet model detection results, for ease of visualization, only the front side of the detection is shown

insert image description here

Figure 12. Pallet detection model running in real time

6. Build your own models with synthetic data

Through iterative development using synthetic data, our team developed a pallet detection model that works on real images. With more iterations, further progress may be possible. Beyond that, our task might benefit from adding real-world data. However, without synthetic data generation, we cannot iterate quickly because every change we make requires new labeling work.

If you are interested in trying this model, or are developing an application that can use the pallet detection model, you can find the model and inference code by visiting SDG Pallet Model on GitHub. This repository includes pretrained ONNX models and instructions for using TensorRT to optimize the models and run inference on images. The model can run in real time on NVIDIA Jetson AGX Orin, so you will be able to run it on edge devices.

You can also check out the recent open source project USD Scene Construction Utilities, which contains examples and utilities for building USD scenes using the USD Python API.

We hope our experience inspires you to explore how synthetic data can be used to guide your AI applications. If you want to start generating synthetic data, NVIDIA provides a set of tools to simplify the process. These include:

  • Universal Scene Description (OpenUSD): USD is described as the HTML of the metaverse, which is a framework for fully describing the 3D world. USD not only contains primitives such as 3D object meshes, but also has the ability to describe materials, lighting, cameras, physics, and more.
  • NVIDIA Omniverse Replicator: Replicator is a core extension of the NVIDIA Omniverse platform that enables developers to generate large and diverse synthetic training data to guide perception model training. With features such as an easy-to-use API, domain randomization, and multi-sensor simulation, Replicator solves data-scarce challenges and accelerates the model training process.
  • SimReady Assets: SimReady Assets are physically accurate 3D objects that contain accurate physical properties, behavior, and connected data flow to represent the real world within a simulated digital world. NVIDIA provides a collection of realistic assets and materials that can be used out of the box to build 3D scenes. This includes various assets related to warehouse logistics such as pallets, trolleys and cardboard boxes. To search, display, inspect and configure SimReady assets before adding them to an active stage, you can use the SimReady Explorer extension. Each SimReady asset has its own predefined semantic labels, making it easier to generate labeled data for segmentation or object detection models.

If you have questions about pallet models, NVIDIA Omniverse synthetic data generation, or NVIDIA Jetson inference, visit GitHub or visit the NVIDIA Omniverse synthetic data generation developer forum and the NVIDIA Jetson Orin Nano developer forum.


Original Link: Pallet Detection Based on Synthetic Data—BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/132004734