[DAU-FI Net open source | Dual Attention UNet+feature fusion+operators such as Sobel and Canny solve the pain points of semantic segmentation]

Article directory

overview

Please add image description
The proposed architecture, Dual Attention U-Net with Feature Fusion (DAU-FI Net), addresses the challenges in semantic segmentation, especially on multi-class imbalanced datasets with limited samples. DAU-FI Net integrates multi-scale spatial-channel attention mechanism and feature injection to improve the accuracy of target localization. The core adopts a multi-scale depthwise separable convolution block to capture local patterns across scales. This block is complemented by a Spatial-Channel Squeezing and Excitation (scSE) attention unit, which models the dependencies between channels and spatial regions in the feature map. Furthermore, additional attention gates optimize segmentation by connecting encoder-decoder paths.

To enhance the model, Gabor filter is used for texture analysis, Sobel and Canny filters are used for edge detection, and guided by semantic mask to strategically expand the feature space. Comprehensive experiments on challenging severe pipeline and culvert defect datasets as well as benchmark datasets validate the performance of DAU-FI Net. Ablation studies highlight the progressive benefits of attention modules and feature injection. The average Intersection over Union (IoU) of DAU-FI Net on the defect test set and benchmark data set are 95.6% and 98.8% respectively, which are 8.9% and 12.6% higher than the previous method respectively. Ablation studies highlight the progressive benefits of attention modules and feature injection. The proposed architecture provides a powerful solution to advance semantic segmentation of multi-class problems with limited training data. The authors' dataset of severe pipeline and culvert defects, with pixel-level annotations, provides avenues for further research in this critical area. Overall, this work achieves key innovations in architecture, attention, and feature engineering to improve the effectiveness of semantic segmentation.

Automated defect detection in underground infrastructure is critical and challenging, and manual inspection is dangerous, time-consuming and error-prone. The DAU-FI Net architecture proposed by the authors provides an accurate automated solution that overcomes the key limitations of a priori methods. By integrating innovations in attention mechanisms and feature engineering, DAU-FI Net achieves 95.6% and 98.8% Intersection over Union on the drainage Pipeline and Culvert Defect datasets and the Cell Kernel benchmark, respectively, which is 8.9% and 12.6% higher than the state-of-the-art methods. %. This level of performance on complex real-world data enables reliable automation of infrastructure inspections, increasing efficiency and safety. The filtered dataset of pixel-level annotated defects also provides avenues for new research.

With further development, this technology can be deployed in practical applications to autonomously analyze drainage pipelines, tunnels and pipelines, providing rapid anomaly detection to prioritize maintenance and prevent catastrophic failures. This work establishes a complete benchmark that pushes the boundaries in architectural design, attention models, and feature injection to improve deep learning capabilities.
The proposed architecture, Dual Attention U-Net with Feature Fusion (DAU-FI Net), addresses the challenges in semantic segmentation, especially on multi-class imbalanced datasets with limited samples. DAU-FI Net integrates multi-scale spatial-channel attention mechanism and feature injection to improve the accuracy of target localization. The core adopts a multi-scale depthwise separable convolution block to capture local patterns across scales. This block is complemented by a Spatial-Channel Squeezing and Excitation (scSE) attention unit that models dependencies between channels and spatial regions in feature maps. Furthermore, additional attention gates optimize segmentation by connecting encoder-decoder paths.

To enhance the model, Gabor filter is used for texture analysis, Sobel and Canny filters are used for edge detection, and guided by semantic mask to strategically expand the feature space. Comprehensive experiments on challenging severe pipeline and culvert defect datasets as well as benchmark datasets validate the performance of DAU-FI Net. Ablation studies highlight the progressive benefits of attention modules and feature injection. The average Intersection over Union (IoU) of DAU-FI Net on the defect test set and benchmark data set are 95.6% and 98.8% respectively, which are 8.9% and 12.6% higher than the previous method respectively. Ablation studies highlight the progressive benefits of attention modules and feature injection. The proposed architecture provides a powerful solution to advance semantic segmentation of multi-class problems with limited training data. The authors' dataset of severe pipeline and culvert defects, with pixel-level annotations, provides avenues for further research in this critical area. Overall, this work achieves key innovations in architecture, attention, and feature engineering to improve the effectiveness of semantic segmentation.

Automated defect detection in underground infrastructure is critical and challenging, and manual inspection is dangerous, time-consuming and error-prone. The DAU-FI Net architecture proposed by the authors provides an accurate automated solution that overcomes the key limitations of a priori methods. By integrating innovations in attention mechanisms and feature engineering, DAU-FI Net achieves 95.6% and 98.8% Intersection over Union on the drainage Pipeline and Culvert Defect datasets and the Cell Kernel benchmark, respectively, which is 8.9% and 12.6% higher than the state-of-the-art methods. %. This level of performance on complex real-world data enables reliable automation of infrastructure inspections, increasing efficiency and safety. The filtered dataset of pixel-level annotated defects also provides avenues for new research.

With further development, this technology can be deployed in practical applications to autonomously analyze drainage pipelines, tunnels and pipelines, providing rapid anomaly detection to prioritize maintenance and prevent catastrophic failures. This work establishes a complete benchmark that pushes the boundaries in architectural design, attention models, and feature injection to improve deep learning capabilities.
Code: https://tinyurl.com/DAUFNet

I Introduction

Semantic segmentation is an important subfield of computer vision and continues to keep the research field active and dynamic. This technology enables machines to better understand visual scenes by assigning each pixel a corresponding category and precise spatial location. The most promising advances in this field are fully convolutional networks (FCNs). FCNs excel in semantic segmentation through an elegant encoder-decoder architecture built around convolutional neural networks (CNNs). The encoder part compresses the input image into a compact latent representation through convolution and downsampling layers. This encoder part, which represents high-level features, is then passed to the decoder network, which generates segmentation maps that match the original image dimensions through upsampling. The encoder-decoder structure enables FCNs to perform pixel-level end-to-end predictions for inputs of any size.

FCNs are fully neural network architectures that can accommodate inputs of different sizes and can generate outputs of matching sizes. This unique architectural approach enables FCNs to efficiently assign category labels to individual pixels in images and align them with the corresponding target categories. In particular, the U-Net architecture has recently made significant progress in the field of semantic segmentation, especially its ability to produce accurate segmentation output when trained on limited data.

U-Net’s encoder-decoder structure and skip connections allow it to exploit local details and global context when generating segmentations [1]. By combining convolutional layers that capture hierarchical feature representations and upsampling layers that enable precise local localization, U-Net achieves excellent performance in medical imaging and other segmentation tasks that require detailed definition of complex structures. The network's ability to produce accurate outputs from small datasets is extremely valuable for many applications where obtaining large-scale annotated training data is challenging.

Building on recent advances in semantic segmentation utilizing FCNs and U-Net architectures, the authors' own previous work aimed to further improve segmentation performance. In our previous research, we introduced an enhanced U-Net architecture that employs a depthwise separable block consisting of a series of separated convolutions with multiple kernel sizes (3x3 and 5x5), and a kernel The size is 1x1 and the convolution of the convolution filter is reduced. This combination of convolutions captures multiple local patterns and extracts features across multiple scales. Although the model shows promising performance, further improvements are needed to address the limitations and challenges of multi-class datasets, especially in terms of class-imbalanced instances and limited sample availability. Furthermore, distinguishing classes similar to each other in a dataset remains challenging.

This paper addresses the above challenges by introducing different attention mechanisms. Specifically, the authors combine multi-scale depth-separable blocks with enhanced extrusion and excitation (SE) mechanisms. The main goal of this combined block is to enhance the representational power of the model by capturing the spatiotemporal and channel dependencies of the input data. This dual attention approach allows the model to take local patterns and global context into account, thereby improving segmentation performance.

Furthermore, the authors introduce a novel method to integrate extracted image features into the model to improve performance. Although the importance of feature engineering has gradually waned with the rise of deep learning, combining hand-crafted features with deep neural networks still has advantages, especially in scenarios involving limited training data or fewer semantic categories. This hybrid approach can effectively expand the feature space and provide additional information to guide the model’s learning process.

To evaluate the performance of our model, we created a specialized dataset for segmenting defects in drainage pipelines and culverts. The images in this dataset were identified and annotated by domain experts. These defects include cracks, splits, pipeline deformation, joint problems, and other complex damage patterns commonly seen during pipeline and culvert inspections. By manually outlining the defect boundaries in each image using the LabelMe annotation tool, the authors obtained precise pixel-level annotations.

Furthermore, to evaluate the generalization ability of the model, it was tested on the cell nucleus segmentation dataset of the 2018 Data Science Competition, which is considered a public benchmark. This benchmark consists of a series of 2D light microscopy images with segmented ground truth. The dual nature of this evaluation – focusing on the specific task of segmenting fine pipeline defects, as well as the broader segmentation challenge, allows the authors to comprehensively analyze the model’s capabilities.

Creating a customized Pipeline culvert defect segmentation dataset that includes challenging features observed in real-world situations not only facilitates an in-depth evaluation of the authors' model, but also provides a valuable resource for future research in this critical area. .

Main ContributionsThe
key contributions of this work in advancing semantic segmentation with limited training data in multi-class problems include:

A novel dual-attention U-Net architecture (DAU-FI Net) is proposed, which integrates a customized multi-scale spatial-channel attention mechanism and strategically injects engineered image features to enhance the performance of limited training data. Accuracy in multi-class segmentation. It introduces a dual-attention block that combines multi-scale convolution with simultaneous spatial-channel squeeze and excitation modeling to capture local patterns and global context.

State-of-the-art performance is achieved on a challenging real-world sewer culvert defect segmentation dataset collected and annotated. The method significantly outperforms previous methods and demonstrates generalization capabilities on nuclei segmentation benchmarks. This experimental analysis details the key components of step-by-step interest.

A safety-critical dataset of sewer culvert defects with pixel-level annotations is provided to advance future research in this area. This in-depth experimental analysis investigates how attention mechanisms and strategic feature injection can progressively improve multi-class segmentation when training data is scarce.

II Literature Review on the Evolution of Semantic Segmentation
In recent years, many semantic segmentation models have been proposed, each of which surpasses their predecessors through different strategies and techniques. A transformative development is the integration of attention mechanisms into segmentation models. By focusing on salient regions and suppressing irrelevant details, the attention module replicates core aspects of human visual perception. When integrated effectively, attention modules can enhance model performance and contribute to the advancement of the field. This section reviews original models of semantic segmentation, analyzes the key limitations that motivated the authors' approach, and introduces related attention mechanisms.

At its core, semantic segmentation classifies each pixel in an image through pixel level classification, typically implemented in an encoder-decoder architecture using fully convolutional neural networks (FC-NNs). The encoder compresses the spatial dimensions through successive convolutional and downsampling layers, compressing the input into a compact latent representation representing salient features. The decoder then upsamples this encoding to produce a segmentation map that matches the original input resolution. This overall design has become well known, continually improving on earlier methods. For example, the FCN-8s model [5] utilizes skip connections to recover fine spatial details lost during encoding, thereby improving segmentation accuracy on datasets like PASCAL VOC [5].

Roy et al. proposed the spatial and channel squeeze-and-excitation (scSE) model, which is a targeted improvement on the traditional squeeze-and-excitation block to enhance fully convolutional networks in semantic segmentation. performance. This is achieved by recalibrating feature responses in learnable weight layers to capture interdependencies within feature maps. scSE dynamically recalibrates activations during forward propagation while boosting useful feature weights and suppressing unfavorable features. Experiments show that integrating scSE into the model can improve segmentation accuracy and target definition, enhance representation capabilities and model richer inter-channel relationships. However, there may be variability limitations from increased model complexity and the need for comprehensive analysis on different architectures and tasks.

To complement these advances, Su et al. studied the integration of lightweight convolutional block attention modules (CBAM) into U-Net. The Channel Attention Module (CAM) operates along the channel dimension, aggregates features through average pooling, and uses learnable layers to selectively emphasize useful channels. Meanwhile, the Spatial Attention Module (SAM) generates 2D attention maps to highlight salient spatial regions and objects. This synchronized combination of channels and spatial attention improves segmentation by capturing global semantics and localizing detailed structures. The evaluation results show the effectiveness of CBAM, in which the dual attention mechanism improves the representation ability and accuracy of pixel-level target definition. The modular nature also makes it possible to flexibly integrate it into architectures like U-Net without adding excessive computational overhead.

Based on the basic encoder-decoder architecture, Attention Sparse Convolutional U-Net (ASCU-Net) proposes an innovative triangle attention mechanism to improve semantic segmentation performance. This is achieved by efficiently integrating three complementary attention modules - Attention Gate (AG), Spatial Attention Module (SAM) and Channel Attention Module (CAM). Attention gates dynamically focus important target structures in each encoder layer. Before transmitting the information to the decoder, it filters out irrelevant areas. The lightweight SAM uses normalized convolutions to model spatial relationships and locate salient regions.

At the same time, CAM adaptively recalibrates channel features through squeeze and excitation operations to emphasize useful channels. This comprehensive attention mechanism selectively filters and emphasizes the most important information, thereby improving representation quality. Their experimental results show that the attention ensemble method improves segmentation accuracy and enhances the robustness of the model. Elegantly integrating customized attention modules is an important step towards the advancement of deep learning techniques in scene understanding.

While previous innovations have brought unique advantages to semantic segmentation and pushed the boundaries of the field, challenges remain in handling complex multi-class datasets that suffer from class imbalance. To overcome these obstacles, the authors propose a novel model that adopts an integrated approach.

The authors' model combines multiple attention mechanisms, including multi-scale filtering, to improve resolution and refine segmentation. The authors also strategically inject engineered features to extend representation capabilities. This balanced contextual attention and underlying feature design enhances the effectiveness of segmentation. These innovations have the potential to transform the segmentation of complex real-world scenarios, such as the inspection of underground infrastructure. The authors' approach focuses on advancing semantic segmentation research by developing more reliable and accurate solutions that can be easily deployed. By addressing current limitations, the authors' work aims to advance scene understanding to the next level with a flexible architecture specifically designed to handle multiple classes of imbalanced data.

Automated inspection of underground infrastructure, such as culverts and drainage pipelines, is critical to identifying structural and material degradation to ensure proper operation throughout the design life. Automated defect detection is critical to improving infrastructure owners’ ability to make data-driven maintenance decisions and mitigating human risks related to health and safety. However, this presents challenges, as manual inspection is slow, costly and error-prone. These environments feature poor lighting, occlusion, and, most importantly, multiple defect types, including cracks, corrosion, blockages, joint issues, and intrusions. This heterogeneity of defects greatly complicates analysis.

In addition, Haurum et al. and Gao et al. also pointed out other difficulties such as limited light illumination, texture changes, and water occlusion and clutter. There is also a class imbalance problem, where certain defects are underrepresented. Together, these issues make accurate identification and classification very difficult.

However, recent research has made progress in using deep learning and machine learning to detect cracks in infrastructure. Panta et al. proposed an encoder-decoder network IterLUNet for pixel-level crack detection in embankment images. A comparative study found that MultiResUnet achieved the highest average Intersection over Union. Continuous monitoring was emphasized because of the catastrophic risk posed by cracks. Previous research evaluated algorithms for detecting cracks from images and pioneered sand bubble detection using machine learning.

These studies demonstrate progress in deploying sophisticated algorithms to detect structural weaknesses, aiding disaster prevention. However, challenges remain regarding complex real-world environments and limitations of aerial inspection data.

The authors' model aims to address these challenges through a custom attention mechanism and extended feature representation, enabling effective segmentation even in situations where data is imbalanced and limited. Strategically injecting features further enhances capabilities by integrating underlying image processing and learned representations. The authors' approach facilitates efficient and reliable automation, making this traditionally manual process more expedient.
Insert image description here

This section introduces the model Dual Attentive U-Net with Feature Infusion (DAU-FI Net) proposed by the author in detail, as shown in Figure 1. The model contains several key components to enhance semantic segmentation capabilities through dual attention modeling and strategic feature engineering:

The dual attention module (DAB) attention mechanism is shown in the left part of Figure 1.
The attention gate is shown in the upper right corner of Figure 1.
As shown in the lower right corner of Figure 1, there are attention jumps between the encoder and decoder paths.
The feature enhancement Pipeline is shown in the lower right corner of Figure 1.
U-Net Encoder-Decoder Backbone with Attentive Skip Connections
DAU-FI Net implements an attentive encoder-decoder architecture for accurate semantic segmentation, which is based on the author's previous refined U-Net architecture. This U-shaped topology integrates innovations including attention gates, jumpers, coordinated up/down sampling between encoder paths and decoder paths, as shown by data flow in DABs and connection nodes. In the encoder path, max pooling is used for downsampling, while in the decoder path, transposed convolution (upsampling) is used for upsampling. Together, these components improve segmentation modeling capabilities, resulting in improved segmentation results.

A key innovation of DAU-FI Net is the integration of attention gates (see top right corner of Figure 1) into jumpers that connect the encoder and decoder paths. The attention gate acts as a specialized filter that suppresses irrelevant regions while emphasizing the most salient structures in the encoder feature map. As data flows through attention gates, they aggregate spatial context appropriately, allowing only the most effective information to pass through. This attention gate-guided communication between the encoder-decoder path enhances the segmentation ability of the model.

Attention gates are strategically placed before connection nodes, where encoder and decoder features are combined. This mechanism allows the attention gate to prioritize feature maps, improving crucial aspects for the object segmentation task while ignoring unnecessary details. The selectivity of the attention gate, guided by the data, helps provide a more focused feature representation, supporting the network's capabilities in semantic segmentation and achieving better performance.

In the DAU-FI Net architecture, strategically including attention jumps (see bottom right corner of Figure 1) can enhance the model's capabilities and improve segmentation accuracy and reconstruction fidelity. These jumps, shown as solid lines, can bypass layers to feed earlier feature maps to later layers, helping to preserve spatial details lost during downsampling. Effectively integrating downsampling, upsampling, attention modeling, and skipping greatly improves segmentation accuracy and the ability to reconstruct features.

Dual Attentive Block
The core of DAU-FI Net is the dual attention module (DAB), an innovative fusion method that fuses local patterns and global dependencies of multiple scales and shapes to enhance segmentation. Specifically, DAB combines two key components: multi-scale depth-separable convolutional blocks and a modified Simultaneous Spatial-Channel Squeezing and Excitation (scSE) attention mechanism.

As shown in the upper left corner of Figure 1, depth-separable convolution blocks use convolution kernels of different sizes, such as and , to match the dimensions of different objects. This multi-scale filtering can adapt to size changes and optimize feature extraction on objects of different sizes.

At the same time, the parallel scSE attention component shown in the lower left corner of Figure 1 includes channel squeezing and excitation (CSE) and spatial squeezing and excitation (SSE) units for targeted feature recalibration. CSE emphasizes useful channels through channel dimensionality reduction and global averaging. SSE generates spatial attention maps through spatial dimensionality reduction and global pooling to highlight important regions.

The author's key improvement is to set the scaling in CSE and SSE to be dynamically learnable rather than fixed. This enables adaptive input-specific resampling to optimize the flow of information within the block. Optimized multi-scale filtering and recalibrated attention can bridge the gap between fine-grained pixel complexity and broader channel features.

As shown in Figure 1, the simultaneous scSE output is merged with the initial multi-scale block output by adding. This two-stage approach unifies complementary spatial-channel attention at different scales into a unified block. Overall, DAB represents an overall innovation that effectively integrates local patterns, contextual relationships, and scaling dynamics into a single block through a tight fusion mechanism, thereby improving segmentation.

Strategic Feature Augmentation
Although the DAU-FI Net architecture demonstrates promising segmentation capabilities, further improvements are needed to handle situations with multi-class imbalanced datasets and limited training samples. To address these challenges, the authors propose a domain knowledge-based strategy that strategically enhances engineered features in the model. Although deep neural networks demonstrate amazing feature learning capabilities [18], combining such features with hand-crafted inputs will expand the feature space and provide additional guidance to improve performance when data is sparse or skewed. The authors' hybrid approach effectively blends the complementary strengths of deep learning and specialized feature design.

Specifically, the authors adopted a comprehensive four-corner approach, including:

Use Gabor filter for texture analysis, detect patterns, smoothness and irregularities through wavelet transform;
use Canny and Sobel edge detectors to identify boundaries and intensity changes, highlighting the outline of objects;
use Histogram Oriented Gradients (HOG) to capture Morphological properties, represented by quantified gradient orientation histograms;
use color spectrum and intensity analysis to evaluate color and intensity distribution within the image.
Insert image description here
As shown in Figure 2, these four complementary techniques extract texture, edge, shape, and color/intensity features respectively. Fusing these engineered inputs significantly expands the feature space beyond what deep learning can extract from restricted pipeline inspection data. This allows the model to improve segmentation capabilities by using specialized handcrafted representations, thereby overcoming sample size and class imbalance limitations.

Gabor filters extract multi-scale, multi-directional texture features to robustly describe corrosion, cracking and clogging, while Sobel and Canny operators capture useful edge patterns to pinpoint defects - the adaptability of these techniques makes for complex data sets. Sewer Pipeline defects provide additional clues. The supplementary documentation contains information on texture analysis based on Gabor filters and edge analysis based on Sobel and Canny operators.
As shown in Figure 2, these four complementary techniques extract texture, edge, shape, and color/intensity features respectively. Fusing these engineered inputs significantly expands the feature space beyond what deep learning can extract from restricted pipeline inspection data. This allows the model to improve segmentation capabilities by using specialized handcrafted representations, thereby overcoming sample size and class imbalance limitations.

Gabor filters extract multi-scale, multi-directional texture features to robustly describe corrosion, cracking and clogging, while Sobel and Canny operators capture useful edge patterns to pinpoint defects - the adaptability of these techniques makes for complex data sets. Sewer Pipeline defects provide additional clues. The supplementary documentation contains information on texture analysis based on Gabor filters and edge analysis based on Sobel and Canny operators.

Insert image description here

Figure 3 shows the responses of the mentioned filters and the results obtained by applying the region-based feature extraction method on samples from the authors’ dataset.

Iii-B1 Gradient Orientation Analysis
In addition to texture and edge features, the authors also explored histogram oriented gradient (HOG) features to describe local shape attributes and the spatial layout of defects. HOG analyzes the gradient direction of local areas of the image to enable robust feature extraction for target detection, which is very useful for the author's sewage Pipeline images. For certain defects with directional gradient patterns, such as cracks, holes, and disconnections, HOG can help capture their unique shape contours. This enhances the model's ability to detect shape-based anomalies and differentiate between different defect types with oriented gradient patterns.

Iii-A2 Color Spectrum and Intensity Analysis
Furthermore, color and intensity features play a vital role in image analysis. The authors implemented color spectral intensity analysis to examine color distribution and changes within Pipeline images [25]. Defects have unique color characteristics compared to intact areas based on factors such as corrosion and mineral deposits. Capturing color and intensity features can help identify these unusual patterns. Fusing these hand-crafted features together expands the representation capabilities of deep learning extracted from the authors' restricted pipeline detection data.

Iii-A3 Injection of Augmented Features
After feature extraction, the next key step is to integrate these features into the author’s segmentation model DAU. The author conducted an in-depth examination of multiple layers in the model, trying various operations such as addition and multiplication. After a series of experiments, the authors determined the optimal method of feature injection, which involves introducing features into the first two layers of the model as follows:

First, the authors align the extracted features with the corresponding layers of the model. This requires creating convolutional layers with the same number of channels (i.e. number of filters or feature extractors). The authors then combine these layers additively and apply a 1x1 convolutional layer with a single filter. This final step enables the authors to perform element-wise multiplication of the model’s input data, effectively integrating the extracted features into the segmentation process. The method is shown in Figure 4.

Insert image description here
Iii-A4 Annotation-Guided Extraction
In order to further optimize feature extraction, the author uses a custom dataset from pixel-level annotated Mask as a guide for regional filtering. The authors first apply selected filters to the image to generate response maps. The authors then perform element-wise multiplication of the response map and annotation mask to preserve defect-related patterns, which are determined by domain knowledge.

By carefully combining domain knowledge, feature design, and data annotation, the authors enhance the model's efficiency in identifying and classifying pipeline defects. Domain knowledge guides appropriate filter selection, and annotation locates key areas, jointly achieving more focused feature extraction.

The authors determine the optimal injection strategy by injecting engineered features into multiple model layers. The authors aim to effectively combine the complementary advantages of deep representation learning and customized feature engineering to improve performance. Specialized feature enhancements are designed to overcome data constraints and deploy customized models for practical sewer inspections.

IV Dataset Description
This section details the author's approach to creating a sewer and culvert defect segmentation dataset. The authors first provide an overview of the data collection process, describing how source videos containing labeled defect instances are acquired and preprocessed to extract keyframes. Next, the authors discuss pixel-level annotation strategies for generating fine-grained GT Masks for semantic segmentation. Technologists carefully trace the precise boundaries of each defect in the extracted frame to produce pixel-level labels. Finally, the authors evaluate the model's reliability and generalization ability by evaluating its performance on a public benchmark dataset for kernel segmentation.

Data Collection Methodology
The authors collected 580 annotated underground infrastructure inspection videos from two industry sources to build a robust dataset covering a variety of real-world conditions. These videos cover wastewater pipelines and culverts, introducing variations in materials, shapes, sizes, and imaging environments. Technologists carefully annotated each video, identifying bounding boxes and timestamps for nine common structural defect categories based on industry standards.

To guide model training, a professional civil engineer assigned each defect category a weight between 0 and 1 based on U.S. industry standards to reflect economic and safety impacts. Prioritization in the learning process was determined by normalizing the individual scores by dividing them by the highest value.

The constructed dataset includes a variety of materials, shapes, and sizes found in wastewater pipelines and culverts. The diversity of this data set accurately reflects the inherent variability encountered during actual field inspections of wastewater pipelines and culverts. It also presented additional challenges due to the integration of data from various sources and different structures of pipelines and culverts.

Pixel-Level Annotation
To construct the author's dataset, the author first segmented each video into separate frames, capturing one frame for each category of annotation points, and the time interval of each category annotation point was 4 to 10 seconds. Each annotation corresponds to a specific category of defect label and has a timestamp of a specific number of seconds. In addition, the author also records the position of each annotation in the Pipeline. Currently, the dataset consists of approximately 5000 frames, including the nine defects listed in Table I.

A key aspect of the authors' dataset preparation involves manual pixel-level annotation to facilitate semantic segmentation. This involves accurately outlining the occurrence of each defect in the video frame. Highly skilled annotators carefully outline the boundaries of each defect in each frame to create a pixel-level mask. These Masks serve as GT data for training and evaluating the author's semantic segmentation model. The pixel-level annotation process ensures that the authors' dataset provides the necessary details to accurately identify and classify defects at the pixel level, allowing the development and evaluation of powerful semantic segmentation algorithms. Each annotated category is assigned a specific color according to the structural color coding guidelines specified in the National Shipbuilding and Oceanographic Service (NASSCO) Pipeline Assessment Certification Program (PACP) [26].

Benchmark Evaluation
In addition to the sewer and culvert defect dataset constructed by the author, the author also evaluated the performance on the 2018 Data Science Competition Cell Nucleus Segmentation Dataset [4]. This dataset originally contained approximately 700 segmented cell images, each acquired under a variety of conditions, including different cell types, different magnifications, and multiple imaging modalities. Although this biomedical application area is significantly different from infrastructure inspection, evaluating performance on recognized benchmarks with unique imaging properties can still provide useful insights. The key aspects the authors aimed to validate were the model's ability to:

Expand to a variety of datasets beyond the data used for development
Segment fine-grained structures from images containing unique artifacts, noise, and other complexities
Handle multi-class segmentation tasks with different output classes
Analyze results on cell nucleus datasets, Allows us to rigorously evaluate the scalability and limitations of our methods on standard datasets with different challenges and imaging modalities.
This helps determine where cross-cutting is appropriate and where customization may be needed. The goal is to comprehensively evaluate the model to inform future development and applications.

V Model Training and Evaluation
This section provides an overview of the optimization, regularization, loss functions, and evaluation metrics used to train and rigorously evaluate semantic segmentation models.

Training Optimization and Regularization
V-A1 Loss Function
In model training, the author uses the categorical cross-entropy loss function to achieve optimized multi-class pixel classification. This is a commonly used function that is ideal for optimizing multi-class predictions in semantic segmentation tasks. This loss function effectively minimizes the error between the predicted class probability and the actual label.

V-A2 Dropout Regularization
In order to maximize model performance and prevent overfitting, the author introduced a dropout layer before the output layer, and its ratio was determined to be 0.2 through hyperparameter tuning. This is a regularization technique that temporarily disables random neurons during training to prevent co-adaptation. This helps improve the generalization ability of the model.

The author of V-A3 Optimization Algorithm
uses the Adam algorithm, which can adaptively adjust the learning rate of each parameter. Use an initial learning rate of 0.001 and exponential decay after 10 epochs. This learning rate adjustment strategy can gradually optimize model weights.

Evaluation Metrics
Model performance evaluation for semantic segmentation of sewer and culvert defects uses four key metrics:

Intersection over union (IoU) and its standard (Standard) and weighted frequency (Frequency-Weighted) variants, used to measure the degree of coverage of predictions and actual data;
F1 score, which combines precision and recall to obtain sensitivity to data imbalance Balanced index;
balanced precision, average recall rate for all categories to achieve fair representation;
Mahalanobis Correlation Coefficient (MCC), which measures the quality of two-category classification. The higher the score, the better the performance, which is especially suitable for data sets with unbalanced distribution. .
VI Results
This section presents the authors’ comprehensive experimental results on the proposed DAU-FI Net model on the challenging sewer and culvert defect dataset. We evaluate overall performance by benchmarking against established architectures and analyze the impact of our core innovations through detailed ablation studies.

Comparative Evaluation
In order to verify the model proposed by the author, the author conducted a comprehensive comparison with some famous Baseline and state-of-the-art architectures, such as:

U-Net: the pioneering fully convolutional network in segmentation;
Attention U-Net: U-Net with additional attention modules;
CBAM U-Net: U-Net with simultaneous spatial and channel attention blocks;
ASCU-Net: with U-Net with three-part attention mechanism;
Deeply Separable U-Net (DWS MF U-Net): U-Net with multi-scale filtering.
The authors evaluate multiple metrics - IoU, FWIoU, F1-Score, balanced accuracy, and MCC.
Insert image description here
Table 2 summarizes the key results.
Insert image description here
In the authors' study, the upgraded enhanced DAU-FI Net model, especially the upgraded seSC block, outperformed its state-of-the-art competitors in all evaluation metrics. This achievement is noteworthy because the model has a reduced number of parameters, as shown in Table 3.
Insert image description here
For practical illustration, Figure 5 provides a comparison of sample image segmentation results from the authors' sewer and culvert defect datasets. The figure includes the original image, GT Mask and predictions from various models: U-Net, Attention U-Net, CBAM U-Net, ASCU-Net, Depthwise Separable U-Net and the author's own DAU-FI Net. Visually, DAU-FI Net is more accurate than its competitors in both defect identification and alignment with GT Mask.

Insert image description here

Further quantitative evidence demonstrating the effectiveness of DAU-FI Net is shown in Figure 6. Here, the authors use sewer and culvert datasets to demonstrate model performance over a training period. Specifically, Figure 6(a) reveals the Intersection over Union (IoU) score, where the author’s DAU-FI Net (indicated by the blue line) achieved the highest IoU on the validation set. Similarly, Figure 6(b) charts the F1-Score trend, again emphasizing the leading performance of DAU-FI Net. Finally, Figure 6© shows the validation loss trend, where the authors’ model shows faster convergence and lower validation loss than its competitors. Together, these figures emphasize the enhanced segmentation accuracy of DAU-FI Net.

The author conducted an ablation study to analyze the impact of the author's core innovation-dual attention module and strategic feature injection. Integrating engineered features into the model achieved significant IoU improvements on both sewer and culvert datasets and cell nuclei datasets, highlighting the effectiveness of the authors' approach. Table 4 summarizes the results.
Insert image description here
Furthermore, the authors conducted a series of experiments to determine the optimal module placement configuration (P1-P6), as shown in Table VI. The best architecture is P6, which uses dual attention on both the encoder and decoder paths and attention gates on jump connections. This configuration demonstrates the highest segmentation accuracy in Table VII.

VII Discussion
This section provides in-depth analysis and contextualization of the results to highlight the excellence and innovative contributions of the proposed DAU-FI Net model.

Comparative benchmark experiments verify the superior performance of DAU-FI Net on complex state-of-the-art architectures like Attention U-Net and CBAM U-Net. As shown in Table 2, our model achieves the highest scores on all key metrics on the challenging sewer and culvert defect dataset.

Crucially, this excellence is achieved with significantly fewer parameters than competitors, as shown in Table 3. DAU-FI Net has only 1.46 million parameters, significantly reducing the amount of calculation while improving accuracy. This optimization enables efficient deployment in real-world applications.

Furthermore, the ablation study performed in Section VI-B confirms the value of the authors' core innovation. Integrating engineered features into the model achieves significant IoU improvements on both sewer and culvert datasets and cell nucleus datasets, as shown in Table 5. This demonstrates the cross-domain versatility of the author's feature enhancement method.

Insert image description here
Furthermore, ablation experiments identified the optimal architectural configuration P6, which strategically combines dual attention modules and attention gates on jump connections on the encoder-decoder path. As shown in Table V, this design achieves the best results by enhancing information flow and feature fusion.

Insert image description here

To provide a deeper perspective, Figure 7 visually compares the segmentation outputs. DAU-FI Net generates precise and accurate Masks in challenging situations, outperforming other methods. The author's model can detect defects even when GT values ​​are omitted, demonstrating its strong learning ability.

Overall, our systematic evaluation of DAU-FI Net confirms its ability to handle complex category segmentation tasks, even with limited data. The framework effectively integrates complementary modalities - dual attention for representation enhancement and feature injection for extended embedding. This balanced approach pushes the boundaries and delivers a new state-of-the-art solution.

By pioneering an optimized architecture, strategic feature enhancement, and creating a novel defect dataset, this work lays a solid foundation for advancing semantic segmentation in the real world. The authors' configurable approach provides a blueprint for solving data-scarce segmentation problems across domains.

summary

This research makes significant progress in multi-class semantic segmentation, especially under the constraints of limited training data. The DAU-FI Net architecture is the cornerstone of this research, innovatively combining multi-scale depthwise separable convolutions with advanced parallel spatial-channel squeezing and excitation (scSE) attention units. This integration is key to achieving fine-grained segmentation by promoting local feature learning and capturing global dependencies.

The authors demonstrate the robustness of DAU-FI Net in rigorous tests including challenging sewer and culvert defect datasets and benchmark datasets. Ablation studies further highlight the value of the authors' approach, in particular the integration of attention mechanisms using Gabor, Sobel and Canny filters for strategic feature injection based on semantic masks. These techniques achieve measurable improvements in Intersection over Union (IoU) scores on both datasets.

The method proposed here not only improves performance without significantly increasing computational overhead. It shows how domain-specific engineered features can be effectively integrated into deep learning frameworks.

This research not only pushes the boundaries of multi-class semantic segmentation but also highlights the synergistic potential between deep learning and feature engineering, which is especially valuable for data-scarce scenarios. The author's innovative adaptive feature injection and concurrent spatial-channel attention methods provide a new perspective for solving complex segmentation tasks.

The use of wastewater Pipeline and culvert defect data sets in this study provides valuable dimensions for future research. Although DAU-FI Net has been tested on two datasets, it shows promise on various segmentation tasks. This paper lays the foundation for improving multi-class semantic segmentation architectures. The authors' integration of different techniques marks a significant advance and opens the door to further research in this area.

Guess you like

Origin blog.csdn.net/weixin_47869094/article/details/135277242