List of CVPR2022 papers (Chinese-English bilingual)

Cascade Transformers for End-to-End Person Search
Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning
Long-Tailed Recognition via Weight Balancing recognizes long tails through weight balancing
InfoGCN: Representation Learning for Human Skeleton-based Action RecognitionInfoGCN: Representation Learning for Action Recognition Based on Human Skeleton
Interactive Geometry Editing of Neural Radiance Fields Interactive Geometry Editing of Neural Radiance Fields
MLSLT: Towards Multilingual Sign Language TranslationMLSLT : Towards multilingual sign language translation
360MonoDepth: High-Resolution 360° Monocular Depth Estimation360MonoDepth: High-resolution 360° monocular depth estimation
Generating Diverse and Natural 3D Human Motions from textual descriptions Generate diverse and natural 3D human motion from text
Masked-attention Mask Transformer for Universal Image Segmentation用于通用图像分割的 Masked-attention Mask Transformer
Pointly-Supervised Instance Segmentation点监督实例分割
A Closer Look at Few-shot Image Generation近距离观察少镜头图像生成
Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation学习用于多人姿势估计的局部-全局上下文适应
Neural 3D Scene Reconstruction with the Manhattan-world Assumption基于曼哈顿世界假设的神经 3D 场景重建
Masked Autoencoders Are Scalable Vision Learners蒙面自动编码器是可扩展的视觉学习者
De-rendering 3D Objects in the Wild在野外去渲染 3D 对象
Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction直接体素网格优化:辐射场重建的超快速收敛
Finding Badly Drawn Bunnies寻找画得不好的兔子
GradViT: Gradient Inversion of Vision TransformersGradViT:视觉变压器的梯度反转
On the Importance of Asymmetry for Siamese Representation
LearningStacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph GenerationStacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph
GenerationSelf -Supervised Material and Texture Representation Learning for Remote Sensing Tasks
Rethinking Efficient Lane Detection via Curve Modeling Rethinking Efficient Lane Detection via Curve Modeling
StyleT2I: Toward Compositional and High-Fidelity Text-to-Image SynthesisStyleT2I: Towards combination and high-fidelity text-to-image synthesis
Learning Fair Classifiers with Partially Annotated Group Labels Learning Fair Classifiers with Partially Annotated Group Labels
Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? Demystifying Neural Tangent Kernels from a Practical Perspective: Can Neural Architecture Search Be Trusted Without Training?
Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis
A ConvNet for the 2020s
Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning through 2D-3D mutual learning to stylize consistent 3D scenes into stylized NeRF
Weakly Supervised Semantic Segmentation by Pixel-to-Prototype ContrastConnecting
the Complementary-view Videos: Joint Camera Identification and Subject Association Connect Complementary View Video: Joint Camera Recognition and Subject Association
Decoupled Knowledge Distillation Decoupled Knowledge Distillation
Maximum Spatial Perturbation Consistency for Unpaired Image-to-Image Translation Unpaired Image to Image Conversion Maximum Spatial Perturbation Consistency
Compound Domain Generalization via Meta- Knowledge Encoding based on meta-knowledge encoding compound domain generalization
Bilateral Video Magnification Filter Bilateral Video Magnification Filter
EDTER: Edge Detection with Transformer EDTER: Edge Detection with Transformer
Structure-Aware Motion Transfer with Deformable Anchor Model Structure-Aware Motion Transfer with Deformable Anchor Model
Attentive Fine-Grained Structured Sparsity for Image Restoration Fine-grained Structured Sparsity for Image Restoration
Sign Language Video Retrieval with Free-Form Textual QueriesSplitNets : Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted SystemsSplitNets: Designing Neural
Architectures for Efficient Distributed Computing on Head-Mounted Systems
Discrepancy for Efficient Out-of-Distribution Detection
LAKe-Net: Topology-Aware Point Cloud Completion by Localizing Aligned Keypoints LAKe-Net: Completion of Topology-Aware Point Clouds by Locating Aligned Keypoints
Focal and Global Focus and global knowledge distillation of Knowledge Distillation for Detectors
Enhancing Adversarial Robustness for Deep Metric Learning增强深度度量学习的对抗鲁棒性
Novel Class Discovery in Semantic Segmentation语义分割中的新类发现
IDEA-Net: Dynamic 3D Point Cloud Interpolation via Deep Embedding AlignmentIDEA-Net:通过深度嵌入对齐的动态 3D 点云插值
WarpingGAN:Warping Multiple Uniform Priors for Adversarial 3D Point Cloud Generation为对抗性 3D 点云生成扭曲多个均匀先验
Rethinking Reconstruction Autoencoder-Based Out-of-Distribution Detection重新思考重构基于自动编码器的分布外检测
HyperDet3D: Learning a Scene-Conditioned 3D Object DetectorHyperDet3D:学习基于场景的 3D 物体检测器
Deep Decomposition for Stochastic Normal-Abnormal Transport随机正常-异常传输的深度分解
Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production大规模手语:学习为大规模逼真的手语制作共同发音标志
Self-supervised Video Transformers Self-supervised Video Transformer HLRTF
: Hierarchical Low-Rank Tensor Factorization for Inverse Problems in Multi-Dimensional Imaging a Physics-based Deformation Modelφ-SfT: Template Shape Boosting View Synthesis with Residual Transfer
with a Physics-based Deformation Model DINE: Domain Adaptation from Single and Multiple Black-box Predictors DINE: From Single and Multiple Black-box Predictors Domain Adaptation for Box PredictorOccluded Human Mesh RecoveryOccluded Human Mesh RecoveryUnderstanding Uncertainty Maps in Vision with Statistical TestingEquivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging DatasetsIn Analyzing Pooled Neuroimaging Datasets When imaging datasets, equal variance allows handling of multiple nasty variables





Learning from Pixel-Level Label Noise: A New Perspective for Light Field Salient Object Detection从像素级标签噪声中学习:光场显着目标检测的新视角
Self-Supervised Global-Local Structure Modeling for Point Cloud Domain Adaptation with Reliable Voted Pseudo Labels具有可靠投票伪标签的点云域自适应的自监督全局-局部结构建模
Towards An End-to-End Framework for Flow-Guided Video Inpainting面向流引导视频修复的端到端框架
E-CIR: Event-Enhanced Continuous Intensity RecoveryE-CIR:事件增强的连续强度恢复
Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization using Satellite Image超越跨视图图像检索:使用卫星图像进行高度准确的车辆定位
Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers具有多视图 Cosegmentation 和 Clustering Transformers 的无监督分层语义分割
Forward Propagation, Backward Regression and Pose Association for Hand Tracking in the WildFERV39k
: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in VideosFERV39k: Used in Videos Large-scale multi-scene dataset for facial expression recognition
Efficient Neural Radiance Fields Efficient Neural Radiance Field
Robust Equivariant Imaging: a fully unsupervised framework for learning to image from noisy and partial measurementsRobust Equivariant Imaging: A completely unsupervised framework for noise Learning images in and part of the measurement
HumanNeRF: Efficiently Generated Human Radiance Field from Sparse InputsHumanNeRF: Efficiently Generate Human Radiation Fields from Sparse Inputs
Attributable Visual Similarity Learning Attributable Visual Similarity Learning
Efficient Multi-view Stereo by Iterative Dynamic Cost Volume Through Iterative Dynamics Efficient Multi-View Stereo
Replacing Labeled Real-image Datasets with Auto-generated Contours Replace Labeled Real-image Datasets with Auto-generated Contours
SOMSI: Spherical Novel View Synthesis with Soft Occlusion Multi-Sphere ImagesSOMSI: Spherical Novel View Synthesis with Soft Occlusion Multi-Sphere ImagesAutoSDF:
Shape Priors for 3D Completion, Reconstruction, and GenerationAutoSDF: Shape Priors for 3D Completion, Reconstruction, and Generation
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio DescriptionsMAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio
Descriptions Variable Edge Guided Network
DST: Dynamic Substitute Training for Data-free Black-box Attack DST: Dynamic Substitute Training for Data-free Black-box Attack
HCSC: Hierarchical Contrastive Selective Coding HCSC: Hierarchical Contrastive Selective Coding
Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis towards diverse and natural scene-aware 3D human motion synthesis
Inertia-Guided Flow Completion and Style Fusion for Video Inpainting Inertia-Guided Flow Completion and Style Fusion for Video Inpainting
PlaneMVS: 3D Plane Reconstruction from Multi-View StereoPlaneMVS:从多视图立体重建 3D 平面
Ref-NeRF: Structured View-Dependent Appearance for Neural Radiance FieldsRef-NeRF:神经辐射场的结构化视图相关外观
Interactiveness Field of Human-Object Interactions人与物交互的交互领域
Learning Memory-Augmented Unidirectional Metrics for Cross-modality Person Re-identification学习用于跨模态人员重新识别的记忆增强单向度量
Event-based Video Reconstruction via Potential-assisted Spiking Neural Network通过电位辅助尖峰神经网络进行基于事件的视频重建
SIGMA: Semantic-complete Graph Matching for Domain Adaptive Object DetectionSIGMA:用于域自适应对象检测的语义完整图匹配
Surface Reconstruction from Point Clouds by Learning Predictive Context Priors通过学习预测上下文先验从点云重建表面
Active Teacher for Semi-Supervised Object Detection半监督目标检测的主动教师
Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning
RCL: Recurrent Continuous Localization for Temporal Action Detection RCL: Recurrent Continuous Localization for Temporal Action Detection
GroupNet: Multiscale Hypergraph Neural Networks for Trajectory Prediction with Relational ReasoningGroupNet: Multi-scale Hypergraph Neural Network
SPAMs: Structured Implicit Parametric Models Spam: Structured Implicit Parameter Model
A Keypoint-based Global Association Network for Lane Detection Using Relational Reasoning for Trajectory Prediction Global Association Network
Weakly Supervised Semantic Segmentation using Out-of-Distribution Data Weakly Supervised Semantic Segmentation Using Out-of-Distribution Data BasicVSR
++: Improving Video Super-Resolution with Enhanced Propagation and Alignment
Real-World Video Super-Resolution investigates the trade-offs of real-world video super-resolution
OakInk: A Large-scale Knowledge Repository for Understanding Hand-Object InteractionOakInk: A Large-scale Knowledge Base for Understanding Hand-Object InteractionBending Graphs: Hierarchical Shape Matching
using Gated Optimal Transport Norm Must
Go On: Dynamic Unsupervised Domain Adaptation by Normalization -Temporal Interaction for Referring Video Object Segmentation is used to refer to video object segmentation language bridging spatio-temporal interaction Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification Stochastic Trajectory Prediction via Motion Indeterminacy Diffusion Random Trajectory Prediction Based on Motion Uncertainty Diffusion




Unbiased Subclass Regularization for Semi-Supervised Semantic Segmentation Semi-Supervised Semantic Segmentation Unbiased Subclass Regularization Stratified Transformer
for 3D Point Cloud Segmentation
Cloning Outfits from Real-World Images to 3D Characters for Generalizable Person Re-Identification Clones clothing from real
- world images as 3D characters for generalizable person re-
identification Sparse Instance Activation for Real-time Instance Segmentation
Pastiche Master: Exemplar-Based High-Resolution Portrait Style TransferPastiche Master: Example-Based High-Resolution Portrait Style Transfer
Unsupervised Image-to-Image Translation with Generative Prior Unsupervised Image-to-Image with Generative Prior Translation
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
Versatile Multi-Modal Pre-Training for Human-Centric PerceptionInstance
-wise Occlusion and Depth Orders in Natural ScenesInstance-wise Occlusion and Depth Orders in Natural Scenes
Degradation-agnostic Correspondence from Resolution-asymmetric Stereo
No Pain, Big Gain: Classify Dynamic Point Cloud Sequences with Static Models by Fitting Feature-level Space-time Surfaces No pain, great gain: by fitting Feature-level spatiotemporal surfaces, classifying dynamic point cloud sequences with static models
Multi-Dimensional with Intensity: A Crowd-sourced Method for Measuring the Perception of Facial Expression Multi-Dimensional with Intensity: A Crowdsourced Method for Measuring the Perception of Facial Expression
Class-Incremental Learning with Strong Pretrained Models
A Patch-centric Error Analysis of Image Super-Resolution A Patch-centric Error Analysis of Image Super-Resolution
IFOR: Iterative Flow Minimization for Robotic Object RearrangementIFOR: Iterative Flow Minimization for Robotic Object Rearrangement3D
-aware Image Synthesis via Learning Structural and Textural RepresentationsDeeCap
: Dynamic Early Exiting for Efficient Image CaptioningDeeCap : Dynamic Early Exit
GAN-Supervised Dense Visual Alignment for Efficient Image Captioning
Multilayer GAN Inversion and Editing Multilayer GAN Inversion and Editing
On Aliased Resizing and Surprising Subtleties in GAN Evaluation About Aliased Resizing and Surprising Subtleties in GAN Evaluation And surprisingly subtle
Learning Pixel Trajectories with Multiscale Contrastive Random WalksComparing Correspondences:
Video Prediction with Correspondences-wise LossesComparing Correspondences: Video Prediction with Correspondences-wise LossesMix
and Localize: Localizing Sound Sources from Mixtures Mixing and Localization: Localize Sound Sources from Mixtures
AziNorm: Exploiting the Radial Symmetry of Point Cloud for Azimuth-Normalized 3D PerceptionAziNorm:利用点云的径向对称性进行方位归一化 3D 感知
Fourier PlenOctrees for Dynamic Radiance Field Rendering in Real-time用于实时动态辐射场渲染的傅里叶 PlenOctrees
Point Cloud Pre-training with Natural 3D Structures使用自然 3D 结构进行点云预训练
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding将更多注意力转移到视觉骨干上:用于端到端视觉基础的查询调制细化网络
Video K-Net: A Simple, Strong, and Unified Baseline for Video SegmentationVideo K-Net:一个简单、强大、统一的视频分割基线
Mr.BiQ: Post-Training Non-Uniform Quantization based on Minimizing the Reconstruction ErrorMr.BiQ:基于最小化重构误差的训练后非均匀量化
Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative Models放弃 GAN:保护最近邻的补丁作为单图像生成模型
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video RecognitionMeMViT: Memory-Enhanced Multiscale Vision Transformer for Efficient Long-Term Video RecognitionMS-TCT: Multi-Scale Temporal ConvTransformer for Action DetectionMS-TCT: For Action
Detection Multi-scale time ConvTransformer
Reversible Vision Transformers Reversible Vision Transformers
RigNeRF: Fully Controllable Neural 3D PortraitsRigNeRF: Fully Controllable Neural 3D Portraits
Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation Rethinking Multi-View Stereo Depth Estimation: Unified
Integrative Few-Shot Learning for Classification and Segmentation for Classification and Segmentation Integrated Few-Shot Learning
Learning Affordance Grounding from Exocentric Images Learning Affordance Grounding from Exocentric Images Progressive Attention
on Multi-Level Dense Difference Maps for Generic Event Boundary Detection for Progressive Attention for Multi-Level Dense Difference Graphs for Universal Event Boundary Detection
Exploring Geometry Consistency for monocular 3D object detection
Visual Abductive Reasoning Visual Abductive Reasoning Putting
People in their Place: Monocular Regression of 3D People in Depth Monocular Regression
Exploiting Explainable Metrics for Augmented SGD
Rethinking Bayesian Deep Learning Methods for Semi-Supervised Volumetric Medical Image Segmentation
A Hybrid Quantum -Classical Algorithm for Robust Fitting A hybrid quantum-classical algorithm for robust fitting
Dataset Distillation by Matching Training Trajectories
DiLiGenT10^2: A Photometric Stereo Benchmark Dataset with Controlled Shape and Material VariationDiLiGenT10^2 : Photometric Stereo Benchmark Dataset
Scene Representation Transformer with Controlled Shape and Material Variations
ConDor: Self-Supervised Canonicalization of 3D Pose for Partial ShapesConDor:部分形状的 3D 姿势的自我监督规范化
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion学习倾听:非确定性二元面部运动建模
Injecting Visual Concepts into End-to-End Image Captioning将视觉概念注入端到端的图像字幕
Learning Neural Light Fields with Ray-Space Embedding Networks使用光线空间嵌入网络学习神经光场
What’s in your hands? 3D Reconstruction of Generic Objects in Hands你手里有什么?手中通用对象的 3D 重建
Virtual Correspondences: Human as a Cue for Extreme-View Geometry虚拟通信:人类作为极端视图几何的线索
Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering通过联合表示学习和在线聚类进行无监督活动分割
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation RecognitionTransRank:通过基于排名的转换识别进行自监督视频表示学习
SketchEdit: Mask-Free Local Image Manipulation with Partial SketchesSketchEdit: Use Partial Sketches for Mask-Free Partial Image ProcessingGroupViT:
Zero-Shot Transfer to Semantic Segmentation with Text SupervisionGroupViT: Zero-Shot Transfer to Semantic Segmentation with Text SupervisionLSVC
: A Learning -based Stereo Video Compression FrameworkLSVC:
Learning to Align Sequential Actions in the
Wild Learning to Align Sequential Actions in the Wild
from-Blur: 3D Shape and Motion Estimation of Motion-blurred Objects in VideosMotion-from-Blur: 3D Shape and Motion Estimation of Motion Blurred Objects in Videos Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction Through Learning Physical Simulation and Functional
Prediction Repair fault object
Simulated Adversarial Testing of Face Recognition Models Face Recognition Model's simulated confrontation test
GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping目标:为手物体抓取生成 4D 全身运动
Ensembling Off-the-shelf Models for GAN Training为 GAN 训练集成现成模型
Global Tracking Transformers全球追踪变形金刚
Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline可见热无人机跟踪:大规模基准和新基线
Joint Global and Local Hierarchical Priors for Learned Image Compression用于学习图像压缩的联合全局和局部分层先验
D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object InteractionsD-Grasp:用于手物交互的物理上合理的动态抓取合成
Human-Aware Object Placement for Visual Environment Reconstruction用于视觉环境重建的人类感知对象放置
Dual-path Image Inpainting with Auxiliary GAN Inversion具有辅助 GAN 反转的双路径图像修复
Accurate 3D Body Shape Regression using Metric and Semantic Attributes使用度量和语义属性进行准确的 3D 身体形状回归
BARC: Learning to Regress 3D Dog Shape from Images by Exploiting Breed Information BARC: Learning to Regress 3D Dog Shape from Images by Exploiting Breed Information Capturing
and Inferring Dense Full-Body Human-Scene Contact
Are Equal: Rationalizing The Labeling Costs for Training Object DetectionNot all labels are equal:
Background Activation Suppression for Weakly Supervised Object Localization Background Activation Suppression for Weakly Supervised Object Localization
Attribute Group Editing for Reliable Few-shot Image Generation attribute group editing for reliable few-shot image generation
Negative-aware Attention for Image-Text Matching Negative-aware Attention for Image-Text Matching
Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects Watch it move: Unsupervised discovery of 3D joints for repositioning articulated objects
TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather ConditionsTransWeather: Transformer-based Restoration of Images under Adverse Weather ConditionsHyperTransformer: A Textural
and Spectral Feature Fusion Transformer for PansharpeningHyperTransformer: Texture and Spectral Feature Fusion Transformer for Pansharpening
gDNA: Towards Generative Detailed Neural AvatarsgDNA: Towards Generating Detailed Neural AvatarsCaDeX
: Learning Canonical Deformation Coordinate Space for Dynamic Surface Representation via Neural HomeomorphismCaDeX: Learning Canonical Deformation Coordinate Space for Dynamic Surface Representations via Neural HomeomorphismBACON: Band-limited Coordinate
Networks for Multiscale Scene RepresentationBACON:
Revisiting Near/Remote Sensing with Geospatial Attention Revisiting Near/Remote Sensing with Geospatial Attention
Simple multi-dataset detection Simple multi-dataset detection
Generalizable Cross-modality Medical Image Segmentation via Style Augmentation and Dual Normalization Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation for LiDAR Semantic
Segmentation Voxel Knowledge DistillationOnline
Convolutional Re-parameterizationOnline Convolution Reparameterization
Neural Inertial LocalizationNeural Inertial PositioningMNSRNet
: Multimodal Transformer Network for 3D Surface Super-ResolutionMNSRNet: Multimodal Transformer Network for 3D Surface Super-
ResolutionUnsupervised Pre-training Unsupervised pre-training for Temporal Action Localization Tasks Time Action Localization Task
Augmented Geometric Distillation for Data-Free Incremental Person ReID Enhanced Geometric Distillation for Data-Free Incremental Person ReID HEAT
: Holistic Edge Attention Transformer for Structured Reconstruction HEAT: for structured reconstruction Integral marginal attention converter
NomMer: Nominate Synergistic Context in Vision Transformer for Visual RecognitionNomMer:
ContrastMask: Contrastive Learning to Segment Every ThingContrastMask: Contrastive Learning to Segment Every Thing
Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression
CoordGAN: Self-Supervised Dense Correspondences Emerge from GANsCoordGAN: Self-Supervised Dense Communication from GAN
MAT: Mask-Aware Transformer for Large Hole Image InpaintingMAT: Mask Perception for Large Hole Image Inpainting Transformer
A Comprehensive Study of End-to-End Temporal Action Detection
Rethinking Image Cropping: Exploring Diverse Compositions from Global Views Rethinking Image Cropping: Exploring Diverse Compositions from Global Views
OcclusionFusion: Occlusion-aware Motion Estimation for Real-time Dynamic 3D ReconstructionOcclusionFusion: Occlusion-aware Motion Estimation for Real-time Dynamic 3D ReconstructionMHFormer:
Multi-Hypothesis Transformer for 3D Human Pose EstimationMHFormer: Multi-Hypothesis Transformer for 3D Human Pose
EstimationAsynchronous Event- based Graph-Neural Networks
RAMA: A Rapid Multicut Algorithm on GPURAMA: Fast Multicut Algorithm on GPUEvUnroll:
Neuromorphic Events based Rolling Shutter Image CorrectionEvUnroll: Neuromorphic Events based Rolling Shutter Image CorrectionCycle
-Consistent Counterfactuals by Latent Transformations Loop Consistent Counterfactuals
Understanding 3D Object Articulation in Internet Videos
Synthetic Generation of Face Videos with Plethysmograph Physiology Synthetic Generation of Face Videos with Plethysmograph Physiology
MonoJSG: Joint Semantic and Geometric Cost Volume for Monocular 3D Object DetectionMonoJSG: Joint Semantic and Geometric Cost Volume for Monocular 3D Object Detection
Neural Architecture Search with Representation Mutual Information Neural Architecture Search with Representation Mutual Information
Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning Gaussian-Based Contrastive Proposal Learning for Weakly Supervised Temporal Sentence Grounding Blind2Unblind: Self -Supervised
Image Denoising with Visible Blind Spots
Semi-supervised object detection based on multi-instance alignment of global class prototypes
Fine-Grained Predicates Learning for Scene Graph Generation
Meta Distribution Alignment for Generalizable Person Re-Identification align
Align Representations with Base: A New Approach to Self-Supervised LearningStyle
-Based Global Appearance Flow for Virtual Try-On
Learning Semantic Associations For Mirror Detection learn the semantic association of mirror detection
Task Decoupled Framework for Reference-based Super-Resolution Beyond
Semantic to Instance Segmentation: Weakly-Supervised Instance Segmentation via Semantic Knowledge Transfer and Self-Refinement Semantic to Instance Segmentation: Weakly Supervised Instance Segmentation via Semantic Knowledge Transfer and Self-Improvement
Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction
GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic CamerasGLAMR: Global occlusion-aware human mesh recovery using dynamic cameras
Fast and Unsupervised Action Boundary Detection for Action SegmentationNeural
MoCon: Neural Motion Control for Physically Plausible Human Motion CaptureNeural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture
Unified Transformer Tracker for Object Tracking for Object Tracking Unified Transformer Tracker
NeuralHOFusion: Neural Volumetric Rendering under Human-object InteractionsNeuralHOFusion: Neural Volume Rendering under Human-Computer Interaction
H 2 ^22FA R-CNN: Holistic and Hierarchical Feature Alignment for Cross-domain Weakly Supervised Object DetectionH 2 ^2 2 FA R-CNN: Holistic and hierarchical feature alignment for cross-domain weakly supervised target detectionICON
: Implicit Clothed humans Obtained from NormalsIcon: Implicit Clothed Humans
Semantic-Aware Domain Generalized Segmentation Semantic-Aware Domain Generalized Segmentation
ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose EstimationZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose Estimation
Detecting Deepfakes with Self-Blended ImagesDetecting Deepfake
Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization Accurate Feature Distribution Matching for Style Transfer and Domain Generalization
FreeSOLO: Learning to Segment Objects without Annotations FreeSOLO: Learning to Segment Objects without Annotations
Auditing Privacy Defenses in Federated Learning via Generative Gradient Leakage
Differentially Private Federated Learning with Local Regularization and Sparsification
Modeling 3D Layout For Group Re-Identification Modeling 3D Layout for Group Re-Identification DASO: Distribution -
Aware Semantics-Oriented Pseudo-label for Imbalanced Semi-Supervised Learning
Fields for Human Avatar Modeling Contrastive Regression for Domain Adaptation
on Gaze Estimation Contrastive Regression for Domain Adaptation on Gaze Estimation
Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition for Semi-Supervised Action Recognition Cross-model pseudo-label
Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification Joint distribution problem: Deep Brownian distance covariance of Few-Shot classification
Tree Energy Loss: Towards Sparsely Annotated Semantic Segmentation Tree Energy Loss: Towards the semantics of sparse annotation Segmentation
Learning Second Order Local Anomaly for General Face Forgery Detection
LGT - Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer NetworkLGT-Net:
Indoor Panoramic Room Layout Estimation Using a Geometry- Aware Transformer Network
Motion Forecasting: A Causal Representation Perspective Towards Robust and Adaptive Motion Prediction: A Causal Representation Perspective
Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos
Omnivore: A Single Model for Many Visual Modalities Omnivore: Multiple Visual Forms of a Single Model
Multi-Frame Self-Supervised Depth with Transformers Multi-Frame Self-Supervised Depth with Transformers
Voice-Face Homogeneity Tells Deepfake Sound Face Homogeneity Tells Deepfake
Representation Compensation Networks for Continual Semantic Segmentation Representation Compensation Network for Continuous Semantic Segmentation
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
FLAVA: A Foundational Language And Vision Alignment ModelFLAVA: Foundational Language and Vision Alignment Model
Vision Prompt TuningVehicle
trajectory prediction works, but not everywhereCamera
-Conditioned Stable Feature Generation for Isolated Camera Supervised Person Re-IDentificationCamera-Conditioned Stable Feature Generation for Isolated Camera Supervised Person Re-IDentification Feature generation
ReSTR: Convolution-free Referring Image Segmentation Using TransformersReSTR: Use Transformers for reference image segmentation without convolution
DATA: Domain-Aware and Task-Aware Self-supervised Learning Data: Domain-aware and task-aware self-supervised learning
Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval Noise-Tolerant Sketch: Noise-Tolerant Image Retrieval Based on Sketch
Balanced MSE for Imbalanced Visual Regression Balanced MSE for Imbalanced Visual Regression
The Devil Is in the Details: Window-based Attention for Image Compression The Devil Is in the Details: Window-based Attention for Image Compression
DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in VideosDeltaCNN: End-to-end CNN Inference for Sparse Frame Differences in Videos
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud UnderstandingCrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding
Video Frame Interpolation Transformer Video frame interpolation converter
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling Open vocabulary instance segmentation through robust cross-modal pseudo-label LASER
: LAtent Space Rendering for 2D Visual LocalizationLASER: for 2D visual localization LaTr
: Layout-Aware Transformer for Scene-Text VQALaTr: Layout-Aware Transformer for Scene Text VQA
Universal Photometric Stereo Network using Global Lighting Contexts
Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training
Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models Stochastic Backpropagation: A Memory-Efficient Strategy for Training Video Models
Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory
Multi-View Consistent Generative Adversarial Networks for 3D-aware Image SynthesisMulti-view Consistent Generative Adversarial Network for 3D Perceptual
Image Synthesis
Template: Topology-aware reconstruction and disentanglement for 3D mesh generation
CRAFT: Cross-Attentional Flow Transformer for Robust Optical FlowCRAFT:
Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition for Robust Optical Flow Cross-Attention Flow Transformer Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition Decoupling and recoupling of
Cross-Modal Transferable Adversarial Attacks from Images to VideosFrom images to videos, cross-modal transferable adversarial attacks
PTTR: Relational 3D Point Cloud Object Tracking with TransformerPTTR: Using Transformer for relational 3D point cloud object tracking
Deformation and Correspondence
Aware Unsupervised Synthetic-to-Real Scene Flow Estimation for Point Clouds Lifetime Unsupervised Domain Adaptive Person Re-Recognition of Ability
Object Localization under Single Coarse Point Supervision Object Localization under Single Coarse Point Supervision
Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation
TubeDETR: Spatio-Temporal Video Grounding with TransformersTubeDETR: Spatio-Temporal Video Grounding with Transformers
Reinforced Structured State-Evolution for Vision-Language Navigation Enhanced Structured State Evolution for Visual Language Navigation
Learning to Anticipate Future with Dynamic Context Removal Learning
Program Representations for Food Images and Cooking Recipes
Transferability Estimation using Bhattacharyya Class Separatability
LiDAR Snowfall Simulation for Robust 3D Object Detection LiDAR Snowfall Simulation for Robust 3D Object Detection Masked Feature Prediction for Vision Self-Supervised Pre-Training Masked Feature Prediction for Vision Self -
Supervised Pre-Training
Unbiased Teacher v2: Semi-supervised Object Detection for Anchor-free and Anchor-based DetectorsUnbiased Teacher v2:无锚和基于锚的检测器的半监督目标检测
Shape from Polarization for Complex Scenes in the Wild野外复杂场景的极化形状
PhotoScene: Physically-Based Material and Lighting Transfer for Indoor ScenesPhotoScene:室内场景的基于物理的材质和照明传输
Node Representation Learning in Graph via Node-to-Neighbourhood Mutual Information Maximization通过节点到邻域互信息最大化的图中节点表示学习
Selective-Supervised Contrastive Learning with Noisy Labels带有噪声标签的选择性监督对比学习
LAVT: Language-Aware Vision Transformer for Referring Image SegmentationLAVT:用于参考图像分割的语言感知视觉转换器
L2G: A Simple Local-to-Global Knowledge Transfer Framework for Weakly Supervised Semantic SegmentationL2G:用于弱监督语义分割的简单本地到全球知识转移框架
TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial EditingTransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing
Leveraging Self-Supervision for Cross-Domain Crowd Counting
Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency TimeReplayer: Unlocking
the Potential of Event Cameras for Video InterpolationTimeReplayer: Unlocking the Potential of Event Cameras for Video Interpolation The Potential of
Self-supervised Image-specific Prototype Exploration for Weakly Supervised Semantic SegmentationSemantic Segmentation of Weakly Supervised Semantic Segmentation Self-supervised Image-specific Prototype Exploration
Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation Pixel-level self-labeling
Probabilistic Warp Consistency for Weakly-Supervised Semantic Correspondences
DIFNet: Boosting Visual Information Flow for Image CaptioningDIFNet: Improving the visual information flow of image subtitles
ScaleNet: A Shallow Architecture for Scale EstimationScaleNet: A shallow architecture for scale estimation
HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static ImagesHODOR: Advanced Object Descriptor Density
-preserving Deep Point Cloud Compression
Exploring Dual-task Correlation for Pose Guided Person Image Generation Exploring Dual-task Correlation for Pose Guided Person Image Generation Dual-task correlation for person image generation
Exploring Endogenous Shift for Cross-domain Detection: A Large-scale Benchmark and Perturbation Suppression
Network Transferability metrics for selecting source model integration
The Auto Arborist Dataset: A Large-Scale Benchmark for Multimodal Urban Forest Monitoring Under Domain ShiftAuto Arborist 数据集:域转移下多模式城市森林监测的大规模基准
EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose EstimationEPro-PnP:用于单目物体姿态估计的广义端到端概率透视-n-点
Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection用于多模态 3D 目标检测的激光雷达相机深度融合
Learning from Temporal Gradient for Semi-supervised Action Recognition从时间梯度中学习半监督动作识别
JoinABLe: Learning Bottom-up Assembly of Parametric CAD JointsJoinABLe:学习参数化 CAD 关节的自下而上装配
DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse MotionDanceTrack:统一外观和多样化运动中的多对象跟踪
Defensive Patches for Robust Recognition in the Physical World物理世界中强大识别的防御补丁
UniCoRN: A Unified Conditional Image Repainting NetworkUniCorN: A unified conditional image redrawing network
APES: Articulated Part Extraction from Sprite SheetsAPES: Extracting joint parts from Sprite sheets
Learning Deep Implicit Functions for 3D Shapes with Dynamic Code Clouds Learning 3D using dynamic code clouds Depth Implicit Functions for Shape
Neural Rays for Occlusion-aware Image-based Rendering
DisARM: Displacement Aware Relation Module for 3D Detection DisARM: Displacement Aware Relation Module for 3D Detection
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network
Calibration Recurrent Implicit Field
Weakly Supervised Object Localization as Domain Adaptation for Unsupervised Learning of Hierarchical Shape Structures
Reflash Dropout in Image Super-Resolution Image Super-Resolution
Semantic Segmentation by Early Region Proxy Semantic Segmentation
EyePAD++: A Distillation-based approach for joint Eye Authentication and Presentation Attack Detection using Periocular ImagesEyePAD++: A Based on A Distilled Method for Joint Eye Authentication and Demonstration Attack Detection Using Periocular Images
Online Learning of Reusable Abstract Models for Object Goal Navigation
Time Microscope: Event-based Frame Interpolation with Parametric Non -linear Flow and Multi-scale Fusion Temporal Microscopy: Event-based frame interpolation with parametric nonlinear flow and multi-scale fusionOSOP:
A Multi-Stage One Shot Object Pose Estimation FrameworkOSOP: Multi-stage single-shot object pose estimation framework
Localization Distillation for Dense Object Detection Localization Distillation for Dense Object Detection
RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse InputsRegNeRF: Regularizing Neural Radiance Fields from Sparse Input View Synthesis Cross-Image Relational Knowledge Distillation
for Semantic Segmentation Trustworthy
Long-tailed Classification Credible Long Tail Classification
Episodic Memory Question Answering
REX: Reasoning-aware and Grounded ExplanationREX: Reasoning Awareness and Grounded Explanation
Query and Attention Augmentation for Knowledge-Based Explainable Reasoning
LOLNerf: Learn from One LookLOLnerf: At a Glance
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object InteractionsBongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interaction CoNeRF: Controllable Neural Radiance FieldsCoNeRF: Controllable Neural
Radiation field
Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization SpaceVision Transformer Slimming: Multi-Dimensional Search in Continuous Optimization SpaceUnweaveNet
: Unweaving Activity StoriessUnweaveNet: Unweaving Activity
StoriesMeMOT: Multi-Object Tracking with MemoryMeMOT: Multi-Object Tracking with
MemoryVisualHow: Multimodal Problem SolvingVisualHow: Multimodal Problem Solving
Affine Medical Image Registration with Coarse-to-Fine Vision Transformer Unpaired
Deep Image Deraining Using Dual Contrastive Learning Unpaired Deep Image Deraining Using Dual Contrastive Learning Rain
DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image AnalysisDiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis
Mask Transfiner for High-Quality Instance Segmentation for High-Quality Instance Segmentation Mask Transfiner
GLASS: Geometric Latent Augmentation for Shape Spaces玻璃:形状空间的几何潜在增强
Global Convergence of MAML and Theory-Inspired Neural Architecture Search for Few-Shot LearningMAML 的全局收敛和受理论启发的神经架构搜索以进行 Few-Shot 学习
Multi-modal Extreme Classification多模态极端分类
CodedVTR: Codebook-Based Sparse Voxel Transformer in Geometric RegionsCodedVTR:几何区域中基于码本的稀疏体素变换器
Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity对语义相似性的频率驱动的不可察觉的对抗性攻击
Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization学习重构动作和共现特征以进行时间动作定位
Self-augmented Unpaired Image Dehazing via Density and Depth Decomposition通过密度和深度分解的自增强非配对图像去雾
QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object DetectionQueryDet:用于加速高分辨率小目标检测的级联稀疏查询
Cross-modal Representation Learning for Zero-shot Action Recognition
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation
AUV-Net: Learning Aligned UV Maps for Texture Transfer and SynthesisAUV-Net: Learning Aligned UV Maps
for Texture Transfer
and Synthesis ObjectFormer for detection and positioning
GraFormer: Graph-oriented Transformer for 3D Pose EstimationGraFormer: Graph-oriented Transformer for 3D pose estimation
Multi-Granularity Alignment Domain Adaptation for Object Detection Multi-granularity alignment domain adaptation for object detection
Adaptive Hierarchical Representation Learning for Long-Tailed Object Detection Adaptive Hierarchical Representation Learning for Long-Tailed Object Detection
Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial SensorsPhysical Inertial Poser (PIP): Physical Inertial Poser (PIP): Physically aware real-time human motion tracking from sparse inertial sensors 3D Scene Painting via Semantic Image Synthesis for 3D through semantic image
synthesis Scene Painting
MViTv2: Improved Multiscale Vision Transformers for Classification and DetectionMViTv2: Improved Multiscale Vision Transformers for Classification and Detection
One-bit Active Query with Contrastive Pairs
HOI4D: A 4D Egocentric Dataset for Category -Level Human-Object InteractionHOI4D: 4D egocentric dataset for category-level human- object
interaction
Prediction with Context-Aware PromptingDenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
JIFF: Jointly-aligned Implicit Face Function for High Fidelity Single View Clothed Human ReconstructionJIFF: Jointly Aligned Implicit Face Function Prompt Distribution Learning for High Fidelity Single View Clothed Human Reconstruction Prompt Distribution Learning CSWin Transformer: A General Vision Transformer Backbone
with
Cross -Shaped WindowsCSWin Transformer: Generic vision transformer backbone with cross-shaped windows
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense CaptioningX-Trans2Cap: Cross-modal knowledge transfer for 3D dense captioning using
Transformer -Centric Paradigm for 3D Single Object Tracking in Point CloudsBeyond 3D Siamese Tracking: A motion-centric paradigm for 3D Single Object Tracking in Point CloudsNoisy Boundaries
: Lemon or Lemonade for Semi-supervised Instance Segmentation? Lemons or Lemonade for Instance Segmentation?
Interactive Image Synthesis with Panoptic Layout Generation Interactive Image Synthesis with Panoptic Layout Generation
Learning to Find Good Models in RANSAC学习在 RANSAC 中寻找好的模型
Meta-attention for ViT-backed Continual LearningViT 支持的持续学习的元注意力
Deep Anomaly Discovery from Unlabeled Videos via Normality Advantage and Self-Paced Refinement通过常态优势和自定进度细化从未标记视频中发现深度异常
Improving neural implicit surfaces geometry with patch warping使用补丁变形改进神经隐式曲面几何
Rope3D: Take A New Look from the 3D Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection TaskRope3D:从用于自动驾驶和单目 3D 目标检测任务的 3D 路边感知数据集中重新审视
AME: Attention and Memory Enhancement in Hyper-Parameter OptimizationAME:超参数优化中的注意力和记忆增强
TopFormer: Token Pyramid Transformer for Mobile Semantic SegmentationTopFormer:用于移动语义分割的令牌金字塔转换器
Automated Progressive Learning for Efficient Training of Vision Transformers
Templates for 3D Object Pose Estimation Revisited: Generalization to New Objects and Robustness to Occlusions Revisited Templates for Efficient Training of Vision Transformers: For New Objects Generalization and robustness to occlusionTowards
Implicit Text-Guided 3D Shape GenerationTowards implicit text-guided 3D shape generationSpatial
-Temporal Parallel Transformer for Arm-Hand Dynamic EstimationRevisiting
skeleton-based action recognitionRevisiting skeleton-based action recognitionMutual
Quantization for Cross-Modal Search with Noisy LabelsRevisiting
Temporal Alignment for Video RestorationRevisiting Temporal Alignment for Video RestorationRevisiting Temporal Alignment
Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation Learning Multi-View Aggregation in the Wild for Large-Scale 3D Semantic Segmentation
Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural ActivitiesAssembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities
Video Frame Interpolation with Transformer Using Transformer for Video Frame Interpolation
Autofocus for Event Cameras Event Cameras Autofocus
Event -based Direct Sparse Odometry Event-based Direct Sparse Odometry
OpenTAL: Towards Open Set Temporal Action LocalizationOpenTAL: Towards Open Set Temporal Action Localization
Programmatic Concept Learning for Human Motion Description and Synthesis Programmatic Concept Learning for Human Motion Description and Synthesis
MAXIM : Multi-Axis MLP for Image Processing MAXIM: Multi-Axis MLP
Temporal Alignment Networks for Long-term Video Temporal Alignment Networks for Long-term Video
Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches Incremental learning in the classroom
Registering Explicit to Implicit: Towards High-Fidelity Garment mesh Reconstruction from Single Images will explicitly register to implicit: from a single image to achieve high-fidelity clothing mesh reconstruction
Progressive End-to-End Object Detection in Crowded Scenes progressive in crowded scenes End-to-end object detection
Object-aware Video-language Pre-training for Retrieval
Multi-Source Uncertainty Mining for Deep Unsupervised Saliency Detection Multi-Source Uncertainty for Deep Unsupervised Saliency Detection Context -Aware Video Reconstruction
for
Rolling Shutter CamerasMonoScene
: Monocular 3D Semantic Scene CompletionMonoScene: Monocular 3D Semantic Scene CompletionWeakly
But Deeply Supervised Occlusion-Reasoned Parametric Road Layouts weak but deeply supervised occlusion inference parametric road layout
Point Cloud Color Constancy point cloud color constancy
HDNet: High-resolution Dual-domain Learning for Spectral Compressive ImagingHDNet: High-resolution Dual-domain Learning for Spectral Compressive Imaging
iPLAN: Interactive and Procedural Layout PlanningiPLAN: Interactive and procedural layout planning
End-to-End Multi-Person Pose Estimation with End-to-end multi-person pose estimation using Transformers Reading
to Listen at the Cocktail Party: Multi-Modal Speech Separation
Adversarial Feature Attack
Domain-Aware Representation Learning for Unsupervised Domain Generalization
Sub-word Level Lip Reading With Visual Attention
Efficient Video Instance Segmentation via Tracklet Query and Proposal Efficient video instance segmentation through Tracklet Query and Proposal
Towards cross-modal pose localization from text-based position descriptions From text-based position descriptions to cross-modal pose localization
Opening up Open World Tracking开放开放世界追踪
Dynamic Clustering Mask Transformers for Panoptic Segmentation用于全景分割的动态聚类掩码转换器
Compressive Single-Photon 3D Cameras压缩单光子 3D 相机
Style-ERD: Responsive and Coherent Online Motion Style TransferStyle-ERD:响应式和连贯的在线运动风格转移
MixFormer: Mixing Features across Windows and DimensionsMixFormer:跨窗口和维度混合功能
Robust Image Forgery Detection over Online Social Network Shared Images基于在线社交网络共享图像的鲁棒图像伪造检测
Semantic-aligned Fusion Transformer for One-shot Object Detection用于一次性目标检测的语义对齐融合转换器
Long-term Video Frame Interpolation Via Feature Propagation通过特征传播的长期视频帧插值
Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation使用分层视觉语言知识蒸馏的开放词汇单阶段检测
GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI DetectionGEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection
ETHSeg: An Amodel Instance Segmentation Network and a Real-world Dataset for X-Ray Waste Inspection Amodel Instance Segmentation Network for Waste Detection and Real Dataset
SEEG: Semantic Energized Co-speech Gesture GenerationSEEG: Semantic Energized Co-speech Gesture Generation
Instance-Dependent Label-Noise Learning With Manifold-Regularized Transition Matrix Estimation Using Manifold Regularized Transition Matrix Estimation Instance Dependent Label Noise LearningAcquiring
a Dynamic Light Field through a Single-Shot Coded ImageHow
many Observations are Enough? Knowledge Distillation for Trajectory ForecastingHow many observations are enough? Knowledge Distillation for Trajectory Prediction
FaceVerse: a Fine-grained and Detail-changeable 3D Neural Face Model from a Hybrid DatasetFaceVerse: 3D Neural Face Model from Hybrid Dataset
Learning Where to Learn in Cross-View Self-Supervised Learning Learning Where to Learn
Automatic Relation-aware Graph Network Proliferation in Cross-View Self-Supervised Learning
CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi -Supervised LearningCoSSL: Unbalanced semi-supervised learning representation and classifier joint learning
P3Depth: Monocular Depth Estimation with a Piecewise Planarity
Prior Higher Data-efficiency, and Better Transferability knowledge distillation as efficient pre-training: faster convergence, higher data efficiency and better transferabilityEn-Compactness
: Self-Distillation Embedding & Contrastive Generation for Generalized Zero-Shot LearningEn -Compactness: Self-distilled embeddings for generalized zero-shot learning and contrastive generation
Unsupervised Learning of Accurate Siamese Tracking
Accelerating DETR Convergence via Semantic-Aligned Matching Accelerating DETR Convergence through Semantic Alignment Matching Co -
advise: Cross Inductive Bias Distillation for 6D Multi-Object Pose Estimation for 6D Multi-Object Pose Estimation Coupling Iterative Refinement DeepCurrents: Learning Implicit Representations of Shapes with BoundariesDeepCurrents: Learning implicit representation of shape with boundaries Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image Looking Out: Synthesizing Consistent Long-Term 3D Scene Videos from a Single Image Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation Learn Day-to-Night Image Synthesis for Training Nighttime Neural ISPs Day-to-Night Image Synthesis for Training Nighttime Neural ISPs






Playable Environments: Video Manipulation in Space and Time
Unified Contrastive Learning in Image-Text-Label Space Unified Contrastive Learning in Image-Text-Label Space
Many-to-many Splatting for Efficient Video Frame Interpolation Many-to-many Splatting
Uncertainty-Aware Deep Multi-View Photometric Stereo
for Efficient Video Frame
Interpolation -free Human Pose Estimation Positionless Human Pose Estimation
Multiview Transformers for Video Recognition Multiview Transformers for Video Recognition RIO: Rotation-equivariance supervised learning of robust
inertia odometry
Adaption via Relaxed Spatial Structural Alignment Few Shot generative model adaptation based on relaxed spatial structure alignment
MiniViT: Compressing Vision Transformers with Weight MultiplexingMiniViT:使用权重复用压缩视觉变压器
Pop-Out Motion: 3D-Aware Image Deformation via Learning Shape Laplacian弹出运动:通过学习形状拉普拉斯算子实现 3D 感知图像变形
On the Road to Online Adaptation for Semantic Image Segmentation语义图像分割的在线适应之路
Generalized Binary Search Network for Highly-Efficient Multi-View Stereo用于高效多视图立体的广义二元搜索网络
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation视觉语言导航中指令跟踪和生成的反事实循环一致学习
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger TokensMSG-Transformer:通过操作 Messenger 令牌交换本地空间信息
Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning用于提高元学习中的泛化和内存效率的动态内核选择
Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation
DLFormer: Discrete Latent Transformer for Video InpaintingDLFormer: Discrete Latent Transformer for Video Repair
Continuous Scene Representations for Embodied AI vCLIMB: A Novel Video
Class Incremental Learning BenchmarkvCLIMB: A Novel Video Class Incremental Learning Benchmark
NODEO: A Neural Ordinary Differential Equation Based Optimization Framework for Deformable Image Registration
3DLanes: Building Monocular 3D Lane DetectionONCE-3DLanes: Building Monocular 3D Lane Detection
ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real TransferObjectFolder 2.0: Multisensory Object Dataset for Sim2Real Transfer
HairMapper: Removing Hair from Portraits Using GANs hair removal in portrait
Dist-PU: Positive-Unlabeled Learning from a Label Distribution PerspectiveDist-PU: From the perspective of label distribution for positive unlabeled learning
Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection Monocular 3D Object DetectionInteractive
Multi-Class Tiny-Object DetectionGeneralizable
Human Pose TriangulationTowards
Discriminative Representation: Multi-view Trajectory Contrastive Learning for Online Multi-object TrackingTowards Discriminative Representation :
A Simple Episodic Linear Probe Improves Visual Recognition in the Wild A Simple Episodic Linear Probe Improves Visual Recognition in the Wild
Learning to Learn by Jointly Optimizing Neural Architecture and Weights by Jointly Optimizing Neural Architecture and Weights Neural Architectures and Weights to Learn to Learn
Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot
Learning Soft estimator for key point scale and orientation
Towards Semi-Supervised Deep Facial Expression Recognition with An Adaptive Confidence Margin
Cross Domain Object Detection by Target-Perceived Dual Branch Distillation
Depth-Aware Generative Adversarial Network for Talking Head Video GenerationOccAM 's
Laser: Occlusion-based Attribution Maps for 3D Object Detectors on LiDAR DataOccAM's Laser: Occlusion-based 3D Property map of object detectors on LiDAR data
Improving Adversarially Robust Few-shot Image Classification with Generalizable Representations
DyTox: Transformers for Continuous Learning with DYnamic TOken eXpansionDyTox: Transformers for Continuous Learning with Dynamic Token Expansion
Stable Long-Term Recurrent Video Super-Resolution Stable Long-Term Recurrent Video Super-Resolution
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization SelfD
: Self-Learning Large-Scale Driving Policies From the WebSelfD: Self-Learning Large-Scale Driving Policies From the Web
InstaFormer: Instance-Aware Image-to-Image Translation with TransformerInstaFormer: Instance-Aware Image-to-Image Translation with Transformer
AutoGPart : Intermediate Supervision Search for Generalizable 3D Part SegmentationAutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation

GASP , a generalized framework for agglomerative clustering of signed graphs and its application to Instance Segmentation Image Restoration Potential in Dynamic Scenes
Multi-level Feature Learning for Contrastive Multi-view Clustering
Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data for Natural Images Commonality saves GANs: Pre-training GANs with generic and privacy-free
synthetic
data : Transformer-based GAN for High-resolution Image GenerationStyleSwin: Transformer-based GAN for generating high-resolution images
Semi-Supervised Learning of Semantic Correspondence with Pseudo-Labels带有伪标签的语义对应的半监督学习
Divide and Conquer: Compositional Experts for Generalized Novel Class Discovery分而治之:广义小说类发现的组合专家
Splicing ViT Features for Semantic Appearance Transfer为语义外观转移拼接 ViT 特征
Optimizing Video Prediction via Video Frame Interpolation通过视频帧插值优化视频预测
Iterative Corresponding Geometry: Fusing Region and Depth for Highly Efficient 3D Tracking of Textureless Objects迭代对应几何:融合区域和深度以实现无纹理对象的高效 3D 跟踪
HARA: A Hierarchical Approach for Robust Rotation AveragingHARA:稳健旋转平均的分层方法
Revisiting Weakly Supervised Pre-Training of Visual Perception Models重新审视视觉感知模型的弱监督预训练
Safe-Student for Safe Deep Semi-Supervised Learning with Unseen-Class Unlabeled Data
PatchFormer: An Efficient Point Transformer with Patch AttentionPatchFormer:
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning Neural
Global Shutter: Learn to Restore Video from a Rolling Shutter Camera with Global Reset FeatureNeural Global Shutter: Learning to Restore Video from a Rolling Shutter Camera with Global Reset FeatureConditional Prompt Learning
for Vision-Language ModelsConditional Prompt Learning for Vision-Language
ModelsStability- driven Contact Reconstruction From Monocular Color Images Stability-driven contact reconstruction based on monocular color imagesSharpContour
: A Contour-based Boundary Refinement Approach for Efficient and Accurate Instance SegmentationSharpContour: A contour-based boundary refinement method for efficient and accurate Instance segmentation
MSDN: Mutually Semantic Distillation Network for Zero-Shot LearningMSDN: Mutually Semantic Distillation Network for Zero-Shot Learning
GeneralDepth: Unsupervised Learning of Single-Image Depth Estimation in General ScenesGeneralDepth:一般场景中单图像深度估计的无监督学习
Revisiting AP Loss for Dense Object Detection: Adaptive Ranking Pair Selection重温用于密集对象检测的 AP 损失:自适应排名对选择
No-Reference Point Cloud Quality Assessment via Domain Adaptation通过域适应进行无参考点云质量评估
DArch: Dental Arch Prior-assisted 3D Tooth Instance Segmentation with Weak AnnotationsDArch:具有弱注释的牙弓先验辅助 3D 牙齿实例分割
Self-Supervised Keypoint Discovery in Behavioral Videos行为视频中的自我监督关键点发现
Toward Practical Self-Supervised Monocular Indoor Depth Estimation迈向实用的自监督单目室内深度估计
Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?跨模态感知者:可以从声音中收集面部几何形状吗?
DPGEN: Differentially Private Generative Energy-Guided Network for Natural Image SynthesisDPGEN:用于自然图像合成的差分私有生成能量引导网络
Learning the Degradation Distribution for Blind Image Super-Resolution学习盲图像超分辨率的退化分布
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action LocalizationASM-Loc:弱监督时间动作定位的动作感知分段建模
Exploiting Rigidity Constraints for LiDAR Scene Flow Estimation利用刚性约束进行 LiDAR 场景流估计
Democracy Does Matter: Comprehensive Feature Mining for Co-Salient Object Detection民主很重要:共同显着目标检测的综合特征挖掘
Unsupervised Domain Adaptation for Nighttime Aerial Tracking夜间空中跟踪的无监督域自适应
UDA-COPE: Unsupervised Domain Adaptation for Category-level Object Pose EstimationUDA-COPE:类别级对象姿态估计的无监督域自适应
3D Shape Reconstruction from 2D Images with Disentangled Attribute FlowUse separated attribute flow to reconstruct 3D shapes from 2D imagesMultimodal
Dynamics: Dynamical Fusion for Trustworthy Multimodal ClassificationMultimodal Dynamics:
Towards Weakly-Supervised Text Spotting using a Multi-Task TransformerUsing a multi-task converter to achieve weakly supervised text positioningStyTr2
: Image Style Transfer with TransformersStyTr2: Using Transformers for image style transferBokehMe
: When Neural Rendering Meets Classical RenderingBokehMe: When Neural Rendering Meets Classical Rendering
Memory-augmented Deep Conditional Unfolding Network for Pan-sharpening Memory Enhanced Deep Conditional Unfolding Network for Pan-sharpening
Learning Object Context for Novel-view Scene Layout Generation Learning Object Context for Novel-view Scene Layout Generation FineDiving
: A Fine-grained Dataset for Procedure- aware Action Quality AssessmentFineDiving: A fine-grained dataset for program-aware action quality assessment
TCTrack: Temporal Contexts for Aerial TrackingTCTrack:空中跟踪的时间上下文
RBGNet: Ray-based Grouping for 3D Object DetectionRBGNet:用于 3D 对象检测的基于射线的分组
3PSDF: Three-Pole Signed Distance Function for Learning Surfaces with Arbitrary Topologies3PSDF:用于学习具有任意拓扑结构的曲面的三极符号距离函数
PanopticNeRF: A Semantic Object-Aware Neural Scene RepresentationPanopticNeRF:语义对象感知神经场景表示
Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation弯曲现实:适应全景语义分割的失真感知变压器
Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer使用新型一元成对变换器对人-物体交互进行高效两阶段检测
Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors使用表面先验重建稀疏点云的表面

Unsupervised Vision - Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships Resolution details or artifacts: a local discriminative learning method for real image super-resolution
Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans from a Single Camera
A Voxel Graph CNN for Object Classification with Event Cameras How Good Is Aesthetic Ability of a Fashion Model?
How Good Is Aesthetic Ability of a Fashion Model? How Good Is Aesthetic Ability of a Fashion Model?
Recurrent Dynamic Embedding for Video Object Segmentation Video Object Segmentation Loop Dynamic Embedding
Self-Distillation from the Last Mini-Batch for Consistency Regularization Self-distillation of the last small batch for consistency regularization
Group Contextualization for Video Recognition
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos Bridge-Prompt: Towards Ordinal Action Understanding in Instructional
VideosDual Adversarial Adaptation for Cross-Device Real-World Image Super- Resolution Cross-device real-world image super-resolution dual adversarial adaptation
Urban Radiance Fields
Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack
PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence
Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images 3D Generative Model of AppearanceGlobal
Sensing and Measurements Reuse for Image Compressed SensingGlobal Sensing and Measurements Reuse for Image Compressed Sensing
AKB-48: A Real-World Articulated Object Knowledge BaseAKB-48: A Real-World Articulated Object Knowledge Base
Structured Sparse R-CNN for Direct Scene Graph Generation Structured Sparse R-CNN for Direct Scene Graph Generation
Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing Realistic Monocular 3D Reconstruction of Humans Wearing Clothing
Spectral Unsupervised Domain Adaptation for Visual Recognition Spectral Unsupervised Domain Adaptation for Visual Recognition
SimMatch: Semi-supervised Learning with Similarity MatchingSimMatch: Semi-supervised Learning with Similarity Matching
Multi- grained Spatio-Temporal Features Perceived Network for Event-based Lip-Reading
POCO: Point Convolution for Surface Reconstruction POCO: Point Convolution for Surface Reconstruction
HerosNet: Hyperspectral Explicable Reconstruction and Optimal Sampling Deep Network for Snapshot Compressive ImagingHerosNet: Hyperspectral Interpretable Reconstruction and Optimal Sampling Deep Networks for Snapshot Compressive Imaging
Towards Robust Rain Removal Against Adversarial Attacks: A Comprehensive Benchmark Analysis and
Beyond Coupling and CorrectionUsing non-IID data for federated learningOpen
-set Text Recognition via Character-Context DecouplingGeneralized
Few-shot Semantic SegmentationCausal Transportability
for Neural Representations PortabilityUncertainty
-Guided Probabilistic Transformer for Complex Action RecognitionUncertainty-Guided Probabilistic Transformer for Complex Action RecognitionMatching
Feature Sets for Few-Shot Image ClassificationMatching Feature Sets for Few-Shot Image
ClassificationInteractron: Embodied Adaptive Object DetectionInteractron: Embodied Adaptive Object Detection
of It's About Time: Analog Clock Reading in the WildIt's About Time: Analog Clock Reading in the Wild
A Graph Matching Perspective with Transformers on Video Instance Segmentation Graph Matching Perspective with Transformers in Video Instance Segmentation
GIF: Neural Implicit Function for General Shape RepresentationGIF: Neural Implicit Function for General Shape
Representation Adaptive Vision Converter for Efficient Image Recognition
Language as Queries for Referring Video Object Segmentation
Federated Class-Incremental Learning
Human Hands as Probes for Interactive Object Understanding Human Hands as Probes for Interactive Object Understanding The probe
STIF: Learning Continuous Video Representation for Space-Time Super-ResolutionSTIF: Learning Continuous Video Representation for Space-Time Super-Resolution
Bridging Video-text Retrieval with Multiple Choice Questions Bridging Video Text Retrieval with Multiple Choice Questions
FoggyStereo: Stereo Matching with Fog Volume RepresentationFoggyStereo: Stereo matching and fog volume representation
MonoGround: Detecting Monocular 3D Objects from the GroundMonoGround:从地面检测单目 3D 物体
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic SegmentationCLIMS:用于弱监督语义分割的跨语言图像匹配
ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive CodingELIC:具有不均匀分组的空间通道上下文自适应编码的高效学习图像压缩
Local Texture Estimator for Implicit Representation Function隐式表示函数的局部纹理估计器
Neural Recognition of Dashed Curves with Gestalt Law of Continuity具有格式塔连续性定律的虚线曲线的神经识别
Voxel Field Fusion for 3D Object Detection用于 3D 对象检测的体素场融合
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with TransformersPanoptic SegFormer:使用 Transformers 深入研究全景分割

Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding Contrastive Learning
H4D: Human 4D Modeling by Learning Neural Compositional RepresentationPhysFormer
: Facial Video-based Physiological Measurement with Temporal Difference TransformerPhysFormer: Facial Video-based Physiological Measurement with Temporal Difference TransformerA
Unified Query -based Paradigm for Point Cloud UnderstandingAdaInt
: Learning Adaptive Intervals for 3D Lookup Tables on Real-time Image EnhancementAdaInt: Learning Adaptive Interval of 3D Lookup Tables in Real-time Image EnhancementFS6D
: Few-Shot 6D Pose Estimation of Novel ObjectsFS6D: Few-Shot 6D pose estimation of new objects
CLIP-Event: Connecting Text and Images with Event StructuresCLIP-Event: Connecting Text and Images with Event Structures
Category Contrast for Unsupervised Domain Adaptation in Visual TasksGateHUB
: Gated History Unit with Background Suppression for Online Action DetectionGateHUB: Gated History Unit with Background Suppression for Online Action Detection
MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in VideoMixSTE: Seq2seq Hybrid Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Learning 3D Object Shape and Layout without 3D Supervision Learn 3D object shape and layout without 3D supervision
Discrete Cosine Transform Network for Guided Depth Super-Resolution Discrete Cosine Transform Network for Guided Depth Super-Resolution
DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology Whole Slide Image ClassificationDTFD-MIL: Multiple Instance Learning with Double-layer Feature Distillation for Histopathology Whole Slide Image Classification
Recurrent Glimpse-based Decoder for Detection with Transformer Recursive Glimpse-based decoder for detection using Transformer HSC4D:
Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDARHSC4D: Using Wearable IMU and LiDAR Human - centric 4D scene capture in large-scale indoor and
outdoor
spaces
Open-Category Object Proposal Generation via Exploiting CLIP CuesProposalCLIP: Generate Unsupervised Open Category Object Proposals by Utilizing CLIP Cues
Task-specific Inconsistency Alignment for Domain Adaptive Object Detection Task-specific Inconsistency Alignment for Domain Adaptive Object Detection Fine
-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization is used for fine-grained temporal contrastive learning of weakly supervised temporal action localization
Global-Aware Registration of Less-Overlap RGB-D Scans Global Perceptual Registration of Less-Overlap RGB-D ScansXMP
-Font: Self-Supervised Cross-Modality Pre-training for Few-Shot Font GenerationXMP-Font: For Few-Shot Self-supervised cross-modal pre-training for font generation
A Simple Data Mixing Prior for Improving Self-Supervised Vision Transformer
Dense Learning based Semi-Supervised Object Detection Dense Learning based Semi-Supervised Object Detection
RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization RNNPose: Recursive 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose
Optimization Global Context Collaborative Learning with Discrete Diffusion in Vector Quantization Modeling of Image Generation
Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution Collaborative Learning for Hand and Object Reconstruction Using Attention-Guided Graph Convolution
End-to-end Generative Pretraining for Multimodal Video Captioning
Exposure Normalization and Compensation for Multiple Exposure Correction Exposure Normalization and Compensation for Multiple Exposure Correction
Interpretable part-whole hierarchies and conceptual-semantic Relationships in neural networks Interpretable Part-Overall Hierarchy and Concept Semantic Relationships
Multi-label Classification with Partial Annotations using Class-aware Selective Loss
Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask PredictionFire Together Wire Together: A Dynamic Pruning Method with Self-Supervised Mask Prediction IterMVS: Iterative Probability Estimation
for Efficient Multi-View StereoIterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo
Think Global , Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction用于有效降维的分层最近邻图嵌入
Decoupling Makes Weakly Supervised Local Feature Better解耦使弱监督的局部特征更好
Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds并非所有点都是平等的:学习用于 3D LiDAR 点云的高效基于点的检测器
Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification使用多模态信息注入扩展大型预训练单模态模型以进行图像-文本多模态分类
Semi-Weakly-Supervised Learning of Complex Actions from Instructional Videos教学视频中复杂动作的半弱监督学习
Set-Supervised Action Learning in Procedural Videos via Pairwise Order Consistency通过成对顺序一致性在程序视频中进行集合监督动作学习
SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain AdaptationSHIFT:用于连续多任务域适应的综合驱动数据集
BANMo: Building Animatable 3D Neural Models from Many Casual VideosBANMo:从许多休闲视频中构建可动画的 3D 神经模型
HD-CSE: Learning Dense Correspondence of Clothed Humans with Vision TransformersHD-CSE:使用视觉变形器学习穿衣人的密集对应
Efficient Geometry-aware 3D Generative Adversarial Networks高效的几何感知 3D 生成对抗网络
CAPRI-Net: Learning Compact CAD Shapes with Adaptive Primitive AssemblyCAPRI-Net:使用自适应基元装配学习紧凑的 CAD 形状
HL-Net: Heterophily Learning Network for Scene Graph GenerationHL-Net:用于场景图生成的异质学习网络
Towards Efficient Data Free Black-box Adversarial Attack迈向高效的无数据黑盒对抗攻击
Neural Collaborative Graph Machines for Table Structure Recognition用于表结构识别的神经协同图机
Dimension Embeddings for Monocular 3D Object Detection用于单目 3D 对象检测的维度嵌入
Nested Collaborative Learning for Long-Tailed Visual Recognition用于长尾视觉识别的嵌套协作学习
Scalable Penalized Regression for Noise Detection in Learning with Noisy Labels带有噪声标签的学习中噪声检测的可扩展惩罚回归
Calibrating Deep Neural Networks by Pairwise Constraints通过成对约束校准深度神经网络
HybridCR: Weakly-Supervised 3D Point Cloud Semantic Segmentation via Hybrid Contrastive RegularizationHybridCR:通过混合对比正则化的弱监督 3D 点云语义分割
Few-Shot Font Generation by Learning Fine-Grained Local Styles通过学习细粒度的局部样式生成 Few-Shot 字体
Point-NeRF: Point-based Neural Radiance FieldsPoint-NeRF:基于点的神经辐射场
Spatial-Temporal Space Hand-in-Hand: Spatial-Temporal Video Super-Resolution via Cycle-Projected Mutual Learning时空空间手牵手:通过周期投影互学习的时空视频超分辨率
Learning from All Vehicles向所有车辆学习
Gait Recognition in the Wild with Dense 3D Representations and A Benchmark具有密集 3D 表示和基准的野外步态识别
DETReg: Unsupervised Pretraining with Region Priors for Object DetectionDETReg:使用区域先验进行目标检测的无监督预训练
Rethinking Semantic Segmentation: A Prototype View重新思考语义分割:原型视图
Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection使用 Oracle 查询进行基于 Transformer 的人机交互检测的蒸馏
MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular ImageMobRecon:从单目图像重建移动友好的手部网格
Spatio-temporal Relation Modeling for Few-shot Action Recognition少样本动作识别的时空关系建模
RestoreFormer: High-Quality Blind Face Restoration from Undegraded Key-Value PairsRestoreFormer:从未降级的键值对中进行高质量的盲人脸恢复
DF-GAN: A Simple and Effective Baseline for Text-to-Image SynthesisDF-GAN: A Simple and Effective
Baseline for Text-to-
Image Synthesis Fully Adaptive Label Distribution Learning for Ordinal Regression Single-peak Concentration Loss: Fully Adaptive Label Distribution Learning for Ordinal Regression
Pyramid Grafting Network for One-Stage High Resolution Saliency Detection Pyramid Grafting Network for Single-Stage High Resolution Saliency Detection
Pseudo- Q: Generating Pseudo Language Queries for Visual GroundingPseudo-Q: Generating Pseudo-Language Queries for Visual Grounding
Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation Keypoint Transformer: Solving Joints in Challenging Hands and Object Interactions Recognition for accurate 3D pose estimation
Towards Discovering the Effectiveness of Moderately Confident Samples for Semi-Supervised Learning
Semi-Supervised Video Semantic Segmentation with Inter-Frame Feature Reconstruction
Revisiting the “Video” in Video-Language Understanding Revisiting the “Video” in Video-Language Understanding
SNUG: Self-Supervised Neural Dynamic GarmentsSNUG: Self-Supervised Neural Dynamic ClothingFocalClick
: Towards Practical Interactive Image SegmentationFocalClick: Towards Practical Interactive Image SegmentationDAFormer
: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic SegmentationDAFormer: Improving Domain-Adaptive Semantic Segmentation Network Architecture and Training strategy
GRAM: Generative Radiance Manifolds for 3D-Aware Image GenerationGRAM: Generative Radiance Manifold
Temporally Efficient Vision Transformer for Video Instance Segmentation for 3D Perceptual Image Generation Temporally Efficient Vision Transformer for Video Instance Segmentation
C-CAM: Causal CAM for Weakly Supervised Semantic Segmentation on Medical ImageC-CAM: Causal CAM for weakly supervised semantic segmentation of medical images

Adversarial Texture for Fooling Person Detectors in the Physical
World Temporally Coherent UV CoordinatesTemporalUV: Use temporally coherent UV coordinates to capture loose clothing
Kernelized Few-shot Object Detection by Integral Aggregation Kernelized few-shot object detection using integral aggregation
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data Self-supervised distillation of image-to-lidar from driving data
Amodal Segmentation through Out-of-Task and Out-of-Distribution Generalization with a Bayesian Model FocusCut: Diving
into a Focus View in Interactive SegmentationFocusCut: An in-depth look at the focus view in interactive segmentation
Mutual Information-driven Pan-sharpening Mutual Information-driven Pan-sharpening
Gradient-SDF: A Semi-Implicit Surface Representation for 3D ReconstructionGradient-SDF:用于 3D 重建的半隐式表面表示
Neural Head Avatars from Monocular RGB Videos来自单眼 RGB 视频的神经头部头像
Point-Level Region Contrast for Object Detection Pre-Training目标检测预训练的点级区域对比
HODEC: Towards Efficient High-Order DEcomposed Convolutional Neural NetworksHODEC:迈向高效的高阶分解卷积神经网络
Bridging Global Context Interactions for High-Fidelity Image Completion桥接全局上下文交互以完成高保真图像
CDGNet: Class Distribution Guided Network for Human ParsingCDGNet:用于人类解析的类分布引导网络
Primitive3D: Learning from 3D Objects Assembled with Random PrimitivesPrimitive3D:从随机基元组装的 3D 对象中学习
HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular VideoHumanNeRF:来自单目视频的移动人物的自由视点渲染
TransMix: Attend to Mix for Vision TransformersTransMix: Attend to Mix for Vision TransformersJRDB
-Act: A Large-scale Dataset for Spatio-temporal Action, Social Group and Activity DetectionJRDB-Act: For spatio-temporal behavior, social groups and activity detection Large-scale dataset
Few-shot Head Swapping in the Wild
Neural Texture Extraction and Distribution for Controllable Person Image Synthesis Embracing
Single Stride 3D Object Detector with Sparse TransformerEmbrace single-step 3D object detectors with Sparse TransformerShow
Me What and Tell Me How: Video Synthesis via Multimodal ConditioningTell Me What and Tell Me How:
Portrait Eyeglasses and Shadow Removal by Leveraging 3D Synthetic Data uses 3D synthetic data to remove portrait glasses and shadows
Expanding Low-Density Latent Regions for Open-Set Object Detection Expanding Low-Density Latent Regions for Open-Set Object Detection
GMFlow: Learning Optical Flow via Global MatchingGMFlow:通过全局匹配学习光流
Source-Free Domain Adaptation via Distribution Estimation通过分布估计进行无源域自适应
Aesthetic Text Logo Synthesis via Content-aware Layout Inferring通过内容感知布局推断的审美文本标志合成
An Image Patch is a Wave: Phase-Aware Vision MLP图像补丁是波浪:相位感知视觉 MLP
FisherMatch: Semi-Supervised Rotation Regression via Entropy-based FilteringFisherMatch:基于熵的过滤的半监督旋转回归
BE-STI: Spatial-Temporal Integrated Network for Class-agnostic Motion Prediction with Bidirectional EnhancementBE-STI:用于具有双向增强的类别不可知运动预测的时空集成网络
DC-SSL: Addressing Mismatched Class Distribution in Semi-supervised LearningDC-SSL:解决半监督学习中不匹配的类分布
Deterministic Point Cloud Registration via Novel Transformation Decomposition通过新颖的变换分解进行确定性点云配准
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos Deep
Visual Geo-localization Benchmark
LC-FDNet: Learned Lossless Image Compression with Frequency Decomposition NetworkLC-FDNet: Learning lossless image compression with frequency decomposition network
Towards Robust Vision Transformer Towards a powerful visual transformer
Volumetric Bundle Adjustment for Photorealistic Real-time Reconstruction Volumetric Bundle Adjustment for Real-time Real-time Reconstruction
Continue Test- Time Domain AdaptationContinuously testing time domain adaptationScribble
-Supervised LiDAR Semantic SegmentationScribble-Supervised LiDAR Semantic SegmentationTableFormer : Table Structure Understanding
with Transformers
network
CLRNet: Cross Layer Refinement Network for Lane DetectionCLRNet:
Transformer Based Line Segment Classifier with Image Context for Real-Time Vanishing Point Detection in Manhattan World Real-time Vanishing Point Detection
NeRFReN: Neural Radiance Fields with ReflectionsNeRFReN: Neural Radiance Fields with Reflections
HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image EditingHyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing
Ditto: Building Digital Twins of Articulated Objects from Interaction Ibid.: Building digital twins of articulated objects from interactionsCroMo
: Cross-Modal Learning for Monocular Depth EstimationCroMo: Cross-modal learning for monocular depth estimationMobile
-Former: Bridging MobileNet and TransformerMobile-Former: Connecting MobileNet and Transformer
MetaFormer is Actually What You Need for VisionMetaFormer is actually what you need for vision
RU-Net: Regularized Unrolling Network for Scene Graph Generation RU-Net: Regularized Unrolling Network for Scene Graph Generation
Dreaming to Prune Image Deraining Networks Dreaming to Prune Image Deraining Network
Salvage of Supervision in Weakly Supervised Object Detection Supervised Remediation
Lagrange Motion Analysis and View Embeddings for Improved Gait Recognition for Improved Gait Recognition Lagrange Motion Analysis and View Embedding
Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation Lite Pose: Efficient Architecture for 2D Human Pose Estimation Designing
SwinBERT: End-to-End Transformers with Sparse Attention for Video CaptioningSwinBERT: End-to-End Transformers with Sparse Attention for Video CaptioningFMCNet:
Feature-Level Modality Compensation for Visible-Infrared Person Re-IdentificationFMCNet: Visible-Infrared Person Re-Identification Identifying Eigen-Level Mode Compensation
Generalizing Gaze Estimation with Rotation Consistency Generalizing Gaze Estimation with Rotation Consistency

SIOD: Single Instance Annotated Per Category Per Image for Object Detection Time complementary guided reinforcement learning for person re-identification
A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift
Manifold Learning Benefits GANs Manifold Learning Benefits GAN
Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing through Shuffled Style Assembly for domain generalization for face anti-spoofing
OW-DETR: Open-world Detection TransformerOW-DETR: Open World Detection Transformer
Learning Optimal K-space Acquisition and Reconstruction using Physics-Informed Neural Networks learn optimal K-space acquisition and reconstruction using physics-based neural networks
Global Tracking via Ensemble of Local Trackers Global Tracking via Ensemble of Local Trackers
Robust Region Feature Synthesizer for Zero-Shot
Object
Detection from Language Reference GamesPartGlot: Learning Shape Part Segmentation from Language Reference Games
Self-Taught Metric Learning without Labels Self-Taught Metric Learning without Labels
GPV-Pose: Category-level Object Pose Estimation via Geometry-guided Point-wise VotingGPV-Pose: Through Geometry Guided Pointwise Voting for Class-Level Object Pose Estimation OmniFusion
: 360 Monocular Depth Estimation via Geometry-Aware
Fusion
and Accurate Neural Radiance Fields with Deterministic Integration for Volume RenderingDIveR: Real-time and Accurate Neural Radiance Fields with Deterministic Integration for Volume Rendering
Boosting Robustness of Image Matting with Context Assembling and Strong Data AugmentationCross
-modal Clinical Graph Transformer For Ophthalmic Report Generation is used to generate cross-modal clinical graph transformation for ophthalmology reports Correlation
-Aware Deep Tracking
Learning to Imagine: Diversify Memory for Incremental Learning using Unlabeled Data Learning to Imagine: Diversify memory using unlabeled data for incremental learning
Block-NeRF: Scalable Large Scene Neural View Synthesis Extended Large Scene Neural View Synthesis
Vector Quantized Diffusion Model for Text-to-Image Synthesis
Boosting Crowd Counting via Multifaceted Attention
Physically-guided Disentangled Implicit Rendering for 3D Face Modeling for physics-guided unraveling of implicit rendering for 3D face modeling
IFRNet: Intermediate Feature Refine Network for Efficient Frame InterpolationIFRNet: Intermediate Feature Refinement Network for Efficient Frame InterpolationTransFusion
: Robust LiDAR-Camera Fusion for 3D Object Detection with TransformersTransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers Back
to Reality: Weakly-supervised 3D Detection with Shape-guided Label EnhancementBack to reality: weakly supervised 3D detection with shape-guided label enhancementIncremental
Transformer Structure Enhanced Image Inpainting with Masking Positional EncodingIncremental Transformer Structure Enhanced Image Repair and Masking Positional
EncodingBlind Image Super-resolution with Elaborate Degradation Modeling on Noise and KernelReduce
Information Loss in Transformers for Pluralistic Image InpaintingReduce Information Loss in Transformers for Pluralistic Image InpaintingOCSampler
: Compressing Videos to One Clip with Single-step SamplingOCSampler: Compress video to one clip using single-step sampling
Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network掩盖对抗性损害:为稳健和稀疏网络寻找对抗性显着性
SemAffiNet: Semantic-Affine Transformation for Point Cloud SegmentationSemAffiNet:点云分割的语义仿射变换
High-resolution Face Swapping via Latent Semantics Disentanglement通过潜在语义解缠结实现高分辨率人脸交换
Deep Rectangling for Image Stitching: A Learning Baseline图像拼接的深度矩形:学习基线
Detector-Free Weakly Supervised Group Activity Recognition无检测器弱监督群体活动识别
Unsupervised Domain Generalization by learning a Bridge Across Domains通过学习跨域的桥梁进行无监督域泛化
RSCFed: Random Sampling Consensus Federated Semi-supervised LearningRSCFed:随机抽样共识联邦半监督学习
IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network QuantizationIntraQ:学习具有类内异质性的合成图像以进行零样本网络量化
A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution一种用于空间变形鲁棒场景文本图像超分辨率的文本注意网络
Learned Queries for Efficient Local Attention有效局部注意力的学习查询
Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling来回回顾:具有显式时间差异建模的视频超分辨率
HVH: Learning a Hybrid Neural Volumetric Representation for Dynamic Hair Performance CaptureHVH:学习用于动态头发性能捕获的混合神经体积表示
Robust Contrastive Learning against Noisy Views针对嘈杂视图的鲁棒对比学习
Discovering Objects that Can Move发现可以移动的物体
TubeFormer-DeepLab: Video Mask TransformerTubeFormer-DeepLab:视频掩码转换器
Sparse and Complete Latent Organization for Geospatial Semantic Segmentation地理空间语义分割的稀疏和完整潜在组织
ITSA: An Information Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching NetworksITSA:立体匹配网络中自动避免捷径和域泛化的信息论方法
Few-shot Backdoor Defense Using Shapley Estimation使用 Shapley 估计的 Few-shot 后门防御
Exploring Domain-Invariant Parameters for Source Free Domain Adaptation探索无源域自适应的域不变参数
Ev-TTA: Test-Time Adaptation for Event-Based Object RecognitionEv-TTA:基于事件的对象识别的测试时间适应
Likert Scoring with Grade Decoupling for Long-term Action Assessment长期行动评估的李克特评分与成绩解耦
Unpaired Cartoon Image Synthesis via Gated Cycle Mapping通过门控循环映射合成未配对卡通图像
Contextual Instance Decoupling for Robust Multi-Person Pose Estimation用于鲁棒多人姿势估计的上下文实例解耦
Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual ScenesSelf-Supervised Predictive Learning: Modulated Contrast for Versatile Image Translation Modulated Contrast for Versatile Image
Translation
Oriented RepPoints for Aerial Object DetectionRepPoints
INS-Conv: Incremental Sparse Convolution for Online 3D SegmentationINS-Conv: Incremental Sparse Convolution for Online 3D SegmentationPanopticDepth:
Instance-Decoupled Depth Estimation for Unified Depth-Aware Panoptic SegmentationPanopticDepth : Instance-Decoupled Depth Estimation for Unified Depth-Aware Panoramic Segmentation
Point-BERT : Pre-Training 3D Point Cloud Transformers with Masked Point ModelingPoint-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling
Implicit Sample Extension for Unsupervised Person Re -Implicit sample extension for unsupervised person re-identification
Incorporating Semi-Supervised and Positive-Unlabeled learning for Boosting Full Reference Image Quality Assessment结合半监督和正无标记学习来提升全参考图像质量评估
HairCLIP: Design Your Hair by Text and Reference ImageHairCLIP:通过文本和参考图像设计你的头发
C2AM Loss: Chasing a Better Decision Boundary for Long-Tail Object DetectionC2AM 损失:为长尾目标检测寻找更好的决策边界
MogFace: Towards a Deeper Appreciation on Face DetectionMogFace:对人脸检测进行更深入的了解
RegionCLIP: Region-based Language-Image PretrainingRegionCLIP:基于区域的语言图像预训练
HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule NetworkHP-Capsule:分层解析胶囊网络的无监督面部部分发现
Structure-Aware Flow Generation for Human Body Reshaping用于人体重塑的结构感知流生成
Revisiting Document Image Dewarping by Grid Regularization通过网格正则化重新审视文档图像去扭曲
GANSeg: Learning to Segment by Unsupervised Hierarchical Image GenerationGANSeg:通过无监督分层图像生成学习分割
Align and Prompt: Video-and-Language Pre-training with Entity Prompts对齐和提示:使用实体提示进行视频和语言预训练
Bridging the Gap between Classification and Localization for Weakly Supervised Object Localization为弱监督目标定位弥合分类和定位之间的差距
Shunted Self-Attention via Multi-Scale Token Aggregation通过多尺度令牌聚合分流自注意力
VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial AttentionVISTA:通过 Dual Cross-VIew SpaTial Attention 提升 3D 对象检测
MonoDTR: Monocular 3D Object Detection with Depth-Aware TransformerMonoDTR:使用深度感知 Transformer 的单目 3D 对象检测
YouMVOS: An Actor-centric Multi-shot Video Object Segmentation DatasetYouMVOS:一个以演员为中心的多镜头视频对象分割数据集
Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation single stage is enough: multi-person absolute 3D pose estimation
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection UMT: for joint video moment retrieval and highlight detection DiSparse: Disentangled Sparsification for Multitask Model Compression DiSparse: Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode Prediction Coarse-to-fine Deep Video Coding with Hyperprior-guided Mode
Prediction
Weakly
Supervised High-Fidelity Clothing Model GenerationWeakly Supervised High-Fidelity Clothing Model GenerationDeep
Generalized Unfolding Networks for Image RestorationPanoptic
-PHNet: Towards Real-Time and High-Precision LiDAR Panoptic Segmentation via Clustering Pseudo HeatmapPanoptic-PHNet: Real-time and high-precision LiDAR panoptic segmentation through clustering pseudo-heatmap
ES6D: A Computation Efficient and Symmetry-Aware 6D Pose Regression FrameworkES6D: Iterative
Deep Homography Estimation Iterative Deep Homography Estimation Iterative Deep Homography Estimation
Homography Loss for Monocular 3D Object Detection Homography for Monocular 3D Object Detection
Infrared Invisible Clothing: Hiding from Infrared Detectors at Multiple Angles in Real World Infrared Invisible Clothing: Hiding from Infrared Detectors at Multiple Angles in Real World
Deep Stereo Image Compression via Bi-directional Coding Deep Stereo Image Compression
Degree -of-linear-polarization-based Color Constancy
Unleashing Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification Re-identifying
Aladdin: Joint Atlas Building and Diffeomorphic Registration Learning with Pairwise Alignment
Learning Transferable Human-Object Interaction Detector with Natural Language Supervision通过自然语言监督学习可迁移的人-物交互检测器
PNP: Robust Learning from Noisy Labels by Probabilistic Noise PredictionPNP:通过概率噪声预测从噪声标签中鲁棒学习
RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View StereoRayMVSNet:学习基于光线的 1D 隐式场以实现准确的多视图立体
Shapley-NAS: Discovering Operation Contribution for Neural Architecture SearchShapley-NAS:发现对神经架构搜索的操作贡献
Few-shot Keypoint Detection with Uncertainty Learning for Unseen Species对未见物种进行不确定性学习的小样本关键点检测
Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation重用特定于任务的分类器作为鉴别器:无鉴别器的对抗域适应
The Pedestrian next to the Lamppost'' Adaptive Object Graphs for Better Instantaneous Mapping灯柱旁边的行人’‘自适应对象图更好的瞬时映射
Point2Seq: Detecting 3D Objects as SequencesPoint2Seq:将 3D 对象检测为序列
Towards Noiseless Object Contours for Weakly Supervised Semantic Segmentation
Syntax-Aware Network for Handwritten Mathematical Expression Recognition Syntax-Aware Network for Handwritten Mathematical Expression Recognition
RAGO: Recurrent Graph Optimizer For Multiple Rotation Averaging RAGO: Recurrent Graph Optimizer For Multiple Rotation Averaging A recurrent graph optimizer based on multiple rotation averages
A Brand New Dance Partner: Music-Conditioned Pluralistic Dancing Controlled by Multiple Dance
Genres : Dense 3D Reconstruction Using Two-Layer Neural Volume Fusion
AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks AutoLoss-Zero:
Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework via Unified Gradient Framework Gradient Framework Exploring Equivalence in Siamese Self-Supervised Learning
Cross-domain Few-shot Learning with Task-specific AdaptersUni
-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot TasksUni-Perceiver: Zero Sample and Less Pre-training Unified Architecture for General Perception for Sample Tasks
Geometric and Textural Augmentation for Domain Gap Reduction Geometric and Textural Augmentation for Domain Gap Reduction
Geometric Transformer for Fast and Robust Point Cloud Registration Geometric Transformer for Fast and Robust Point Cloud Registration Geometric Transformer
Group R -CNN for Point-based Weakly Semi-supervised Object DetectionGroup R-CNN for point-based weak semi-supervised object detection Wnet
: Audio-Guided Video Semantic Segmentation via Wavelet-Based Cross-Modal Denoising NetworksWnet: Wavelet-based cross-modal denoising Audio-guided video semantic segmentation with noisy networks
3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds
ELSR: Efficient Line Segment Reconstruction with Planes and Points GuidanceELSR:
A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos A Proposal-based Paradigm for Self-supervised Sound Source Localization in Videos
Semi-Supervised Wide- Angle Portraits Correction by Multi-Scale TransformerSemi-supervised wide-angle portrait correction based on multi-scale TransformerEnd
-to-End Referring Video Object Segmentation with Multimodal TransformersEnd-to-End Referring Video Object Segmentation with Multimodal
TransformersNeural fields as learnable kernels for 3D reconstruction neural domain as a learnable kernel for 3D reconstructionIDR
: Self-Supervised Image Denoising via Iterative Data RefinementIDR: Self-supervised image denoising through iterative data refinementTransMVSNet
: Global Context-aware Multi-view Stereo Network with TransformersTransMVSNet: with Transformer Global context-aware multi-view stereo network of
SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware NormalizationSimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware
Normalization On Adversarial Robustness of Trajectory
Prediction for Autonomous Vehicles
Learning Multiple Dense Prediction Tasks from Partially Annotated Data Learning Multiple Dense Prediction Tasks from Partially Annotated Data
Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free Isolation: Sparse can find the Trojan attack trigger for free Video Demoireing with Relation-based Temporal Consistency with a
video demonstration of relation-based temporal consistency
Generate flow-based 3D avatar
Learning an Optimal Linear Program for Multi-Target Tracking Learning an Optimal Linear Program for Multi-Target Tracking
IRON: Inverse Rendering by Optimizing Neural SDFs and Materials from Photometric ImagesIRON:
Stereoscopic Universal Perturbations across Different Architectures and Datasets
The Flag Median and FlagIRLS Flag Median and FlagIRLS
NeRF in the Dark: High Dynamic Range View Synthesis from Noisy Raw Images NeRF in the dark: High Dynamic Range View Synthesis from Noisy Raw Images BoxeR: Box-Attention
for 2D and 3D Transformers BoxeR: For 2D and 3D Transformers Box-Attention
DynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change SegmentationDynamicEarthNet: Daily Multi-Spectral Satellite Dataset for Semantic Change Segmentation
UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection A new benchmark for video anomaly detection
Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection
CADTransformer: Panoptic Symbol Spotting Transformer for CAD Drawings CADTransformer: Panoptic Symbol Spotting Transformer for CAD Drawing
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy Diversity Principle: Training Transformers with Stronger Vision Needs to Reduce All Levels of Redundancy
Learning To Recognize Procedural Activities with Distant Supervision
Audio-driven Neural Gesture Reenactment with Video Motion Graphs Audio-Driven Neural Gesture Replay Using Video Motion Graphs
Towards Bidirectional Arbitrary Image Rescaling: Joint Optimization and Cycle Idempotence
Hire-MLP: Vision MLP via Hierarchical Rearrangement Hire-MLP: Through Hierarchy Rearranged Visual MLP
Escaping Data Scarcity for High-Resolution Heterogeneous Face HallucinationGet rid of the data scarcity of high-resolution heterogeneous face hallucinationsDeepDPM
: Deep Clustering With an Unknown Number of ClustersDeepDPM: Deep clustering with an unknown number of clustersZeroWaste
Dataset: Towards Deformable Object Segmentation in Cluttered ScenesZeroWaste Dataset: Realizing Deformable Object Segmentation in Messy Scenes Context
-Aware Sequence Alignment using 4D Skeletal Augmentation
COAP: Compositional Articulated Occupancy of People
with Multiple Pretraining Tasks Learning the Sound and Visual Representation of
The Wanderings of Odysseus in 3D Scenes
Deblurring via Stochastic Refinement
SMPL-A: Modeling Person-Specific Deformable AnatomySMPL-A: Modeling Human-Specific Deformable Anatomy
Neural Point Light Fields Neural Point Light Fields
FedCor: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated LearningFedCor:用于异构联合学习的基于相关性的主动客户选择策略
ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic SegmentationADeLA:关注语义分割中视点偏移的自动密集标签
Adversarial Parametric Pose Prior对抗性参数姿势先验
Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior通过学习交通先验生成有用的事故多发驾驶场景
Pre-Training meets Self-Training for Supersizing 3D Reconstruction预训练与超大 3D 重建的自我训练相遇
Safe Self-Refinement for Transformer-based Domain Adaptation基于 Transformer 的域自适应的安全自我改进
ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D PosesElePose:通过预测相机高度和学习 2D 姿势的归一化流来进行无监督的 3D 人体姿势估计
Towards Multimodal Depth Estimation from Light Fields基于光场的多模态深度估计
Deformable Sprites for Unsupervised Video Decomposition用于无监督视频分解的可变形精灵
Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection你能发现变色龙吗?来自共显着目标检测的对抗性伪装图像
MISF: Multi-level Interactive Siamese Filtering for High-Fidelity Image InpaintingMISF:用于高保真图像修复的多级交互式连体过滤
Aug-NeRF: Training Stronger Neural Radiance Fields with Triple-Level Physically-Grounded AugmentationsAug-NeRF:通过三级物理接地增强训练更强的神经辐射场
Semi-supervised Semantic Segmentation with Error Localization Network带有错误定位网络的半监督语义分割
Quantization-aware Deep Optics for Snapshot Hyperspectral Imaging用于快照高光谱成像的量化感知深度光学
Gravitationally Lensed Black Hole Emission Tomography引力透镜黑洞发射断层扫描
Improving Video Model Transfer with Dynamic Representation LearningFWD
: Real-time Novel View Synthesis with Forward Warping and DepthFWD: Real-time Novel View Synthesis with Forward Warping and
DepthEnhancing Adversarial Training with Second-Order Statistics of Weights uses second-order statistics of weights to strengthen confrontation training
Patch Slimming for Efficient Vision Transformers Efficient Vision Transformers Patch Slimming
3DAC: Learning Attribute Compression for Point Clouds3DAC: Learning Attribute Compression for Point Clouds
SNR-Aware Low-light Image EnhancementSNR Perceived Low-light Images Enhanced
Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation Motion-modulated Temporal
Fragment Alignment Network For Few-Shot Action Recognition for Few-Shot Action Recognition Motion Modulated Temporal Slice Alignment Networks
Self-Supervised Bulk Motion Artifact Removal in Optical Coherence Tomography Angiography Optical Coherence Tomography Angiography Self-Supervised Bulk Motion Artifact Removal Salient-to-Broad Transition for Video Person Re-
identification
Which images to label for few-shot medical landmark detection? Which images to label for few-shot medical landmark detection?
Hybrid Relation Guided Set Matching for Few-shot Action Recognition
Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction Better initial guessesBringing
Old Films Back to LifeFace
Relighting with Geometrically Consistent ShadowsLearning
Cloth-Irrelevant Features for Cloth-Changing Person Re-identificationLearning Cloth-Irrelevant Features for Cloth-Changing Person Re-identification Changer re-identification
DPICT: Deep Progressive Image Compression Using Trit-PlanesDPICT: Deep Progressive Image Compression Using Trit-Planes
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering ReasoningSimple
but Effective: CLIP Embeddings for Embodied AISimple but effective: CLIP embedded in AI Embedded
Scene Consistency Representation Learning for Video Scene SegmentationScene Consistency Representation for Video Scene Segmentation Learning
Neural Data-Dependent Transform for Learned Image Compression for learning images Compressed neural data correlation transform
CamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow EstimationCamLiFlow: Bidirectional Camera-LiDAR Fusion for Joint Optical Flow and Scene Flow Estimation Global Matching
with Overlapping Attention for Optical Flow Estimation with Overlapping Attention Meta Agent Teaming Active Learning for Pose Estimation for Global Matching Optical Flow Estimation
Meta Agent Teaming Active Learning for Pose Estimation
Robust Combination of Distributed Gradients Under Adversarial Perturbations对抗性扰动下分布式梯度的稳健组合
Toward Fast, Flexible, and Robust Low-Light Image Enhancement实现快速、灵活和稳健的低光图像增强
Motion-aware Contrastive Video Representation Learning via Foreground-background Merging通过前景-背景合并的运动感知对比视频表示学习
ViSTA: Vision and Scene Text Aggregation for Cross-Modal RetrievalViSTA:跨模态检索的视觉和场景文本聚合
L-Verse: Bidirectional Generation Between Image and TextL-Verse:图像和文本之间的双向生成
GANORCON: Are Generative Models Useful for Few-shot Segmentation?GANORCON:生成模型对小样本分割有用吗?
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation基于文本的视频分割的多模态特征建模运动
Towards Robust Adaptive Object Detection under Noisy Annotations噪声注释下的鲁棒自适应目标检测
Point2Cyl: Reverse Engineering 3D Objects – from Point Clouds to Extrusion CylindersPoint2Cyl:逆向工程 3D 对象——从点云到挤压圆柱体
MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic SegmentationMM-TTA:用于 3D 语义分割的多模态测试时间自适应
Subspace Adversarial Training子空间对抗训练
Structural and Statistical Texture Knowledge Distillation for Semantic Segmentation语义分割的结构和统计纹理知识蒸馏
UniVIP: A Unified Framework for Self-Supervised Visual Pre-trainingUniVIP:自我监督视觉预训练的统一框架
MUM : Mix Image Tiles and UnMix Feature Tiles for Semi-Supervised Object DetectionMUM : 混合图像块和 UnMix 特征块用于半监督目标检测
SS3D: Sparsely-Supervised 3D Object Detection from Point CloudSS3D:来自点云的稀疏监督 3D 对象检测
On the Integration of Self-Attention and Convolution关于self-attention和卷积的整合
Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distillation基于循环解纠缠自蒸馏的城市场景单域广义目标检测
Human Instance Matting via Mutual Guidance and Multi-Instance Refinement通过相互指导和多实例细化的人体实例消光
Delving Deep into the Generalization of Vision Transformers under Distribution Shifts深入研究分布变化下的视觉变形金刚的泛化
Causality Inspired Representation Learning for Domain Generalization因果关系启发的领域泛化表示学习
Learning Local Displacements for Point Cloud Completion学习点云补全的局部位移
Remember Intentions: Retrospective-Memory-based Trajectory Prediction记住意图:基于回顾性记忆的轨迹预测
Contextual Similarity Distillation for Asymmetric Image Retrieval非对称图像检索的上下文相似性蒸馏
Self-Supervised Models are Continual Learners自监督模型是持续学习者
High-Fidelity Human Avatars from a Single RGB Camera
Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
TWIST: Two -Way Inter-label Self-Training for Semi-supervised 3D Instance SegmentationTWIST: Focal length and object pose estimation via render and compare for semi-supervised 3D instance segmentation of bidirectional label self-training Focal length and object pose estimation via render and compare for focal length
and object pose estimation
Kubric : A scalable dataset generatorKubric: Scalable dataset generator VRDFormer: End-to-End Video Visual Relation Detection with TransformersVRDFormer: A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol
using Transformers for end-to-end video visual relationship detection
A Large-Scale Comprehensive Dataset for Segment-level Video Copy Detection and a Copy Overlap Aware Evaluation Protocol
Brain-inspired Multilayer Perceptron with Spiking Neurons
Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection
High Quality Segmentation for Ultra High-resolution Images High-quality segmentation of ultra-high-resolution images
Physically Disentangled Intra- and Inter-domain Adaptation for Varicolored Haze Removal Physical decoupling and inter-domain adaptation for removing varicolored haze
HandOccNet: Occlusion- Robust 3D Hand Mesh Estimation NetworkHandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network
Future Transformer for Long-term Action Anticipation
Decoupling Zero-Shot Semantic Segmentation Decoupling Zero-Shot Semantic Segmentation
Long-tail Recognition via Compositional Knowledge Transfer Long Tail Identification for Combinatorial Knowledge Transfer
Open Challenges in Deep Stereo: the Booster Dataset
BigDatasetGAN: Synthesizing ImageNet with Pixel-wise AnnotationsBigDatasetGAN: Synthesizing ImageNet
Recall@k Surrogate Loss with Large Batches and Similarity Mixup Recall@k Proxy Loss for Sex MixingPoseTriplet
: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervisionPoseTriplet: 3D Human Pose Estimation, Imitation, and Hallucination under Self-
supervision Output Diffusion Model
End-to-End Human-Gaze-Target Detection with Transformers Use Transformer for end-to-end human eye target detection
EMOCA: Emotion Driven Monocular Face Capture and AnimationEMOCA: Emotion Driven Monocular Face Capture and Animation
R(Det) 2^22: Randomized Decision Routing for Object DetectionR(Det) 2 ^2 2 : Random decision routing for object detectiona
Meaningful and Decodable Representation
A Simple Face Anti-Spoofing Framework for Fine-grained Patch Recognition
NeurMiPs: Neural Mixture of Planar Experts for View SynthesisNeurMiPs: Neural Mixture of Planar Experts for View Synthesis
Learning to generate line drawings that convey geometry and semantics Learning to generate lines that convey geometry and semantics Figure
AlignQ: Alignment Quantization with ADMM-based Correlation Preservation AlignQ: Alignment Quantization Using ADMM-based Correlation Preservation
Learning Embodied Object-Search Strategies from 50k Human Demonstrations Learn specific object search strategies from 50k demonstrations
Longitudinal Self-Supervision for Learning 2D Amodal Representation is used to learn longitudinal self-supervision of 2D Amodal representation
Controllable Dynamic Multi-Task ArchitecturesControllable Dynamic Multi-Task ArchitectureMulti
-Scale High-Resolution Vision Transformer for Semantic SegmentationBeyond
a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning Beyond Pretrained Object Detectors: Cross-modal Text and Visual Context for Image Captioning
Depth-supervised NeRF: Fewer Views and Faster Training for Free Deep-supervised NeRF: Fewer Views and Faster Training for Free
Learning to Detect Mobile Objects from LiDAR Scans Without LabelsRevisiting
Random Channel Pruning for Neural Network CompressionRevisiting Random Channel Pruning for Neural Network CompressionActiveZero
: Mixed Domain Learning for Active Stereovision with Zero AnnotationActiveZero: Mixed Domain Learning Learning
sRGB-to-Raw De-rendering with Content-Aware Metadata for Active Stereo Vision with Zero Annotations
SimVQA: Exploring Simulated Environments for Visual Question AnsweringSimVQA: Exploring Simulated Environments for Visual Question
AnsweringCross-Domain Adaptive Teacher for Object DetectionModality-
Agnostic Learning for Radar-Lidar Fusion in Vehicle DetectionModality-Agnostic Learning for Radar-Lidar Fusion in Vehicle Detection -Modality-independent learning of lidar fusionA
Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question AnsweringTowards General Purpose
Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture Towards a Universal Vision System: A Task-Independent End-to-End Visual Language Architecture Holocurtains:
Programming Light Curtains via Binary Holography Holocurtains: Programming Light Curtains via Binary Holography
Leverage Your Local and Global Representations: A New Self-Supervised Learning Strategy
3D human tongue reconstruction from single “in-the-wild” images Reconstruct 3D human tongue from a single “in-the-wild” image
Pushing the Performance Limit of Scene Text Recognizer without Human Annotation Pushing the Performance Limit of Scene Text Recognizer without Human Annotation
SAR-Net: Shape Alignment and Recovery Network for Category-level 6D Object Pose and Size EstimationSAR-Net: Shape Alignment and Recovery Network Improving Subgraph Recognition with Variational Graph Information Bottleneck for Category-level 6D Object Pose and Size
Estimation Information Bottleneck Improved Subgraph Recognition
Towards Multi-domain Single Image Dehazing via Test-time Training EMScore
: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching EMScore: Through Coarse-Grained and Fine-Grained Embedding Matching Granular Embedding Matching Evaluation Video Caption
CHEX: CHannel EXploration for CNN Model CompressionCHEX: Channel Exploration for CNN Model Compression
ImFace: A Nonlinear 3D Morphable Face Model with Implicit Neural RepresentationsImFace:具有隐式神经表示的非线性 3D 可变形人脸模型
Deblur-NeRF: Neural Radiance Fields from Blurry imagesDeblur-NeRF:来自模糊图像的神经辐射场
An MIL-Derived Transformer for Weakly Supervised Point Cloud Segmentation用于弱监督点云分割的 MIL 衍生变压器
Distribution Consistent Neural Architecture Search分布一致的神经架构搜索
Training Object Detectors from Scratch: An Empirical Study in the Era of Vision Transformer从零开始训练目标检测器:视觉转换器时代的实证研究
Glass Segmentation using Intensity and Spectral Polarization Cues使用强度和光谱偏振线索进行玻璃分割
GAT-CADNet: Graph Attention Network for Panoptic Symbol Spotting in CAD DrawingsGAT-CADNet:用于 CAD 绘图中全景符号定位的图形注意网络
Unsupervised Deraining: Where Contrastive Learning Meets Self-similarity无监督脱水:对比学习遇到自相似性的地方
Delving into the Estimation Shift of Batch Normalization in a NetworkDeepth Estimation
by Combining Binocular Stereo and Monocular Structured-
LightFull-Range Virtual Try- On with Recurrent Tri-Level Transformation has repeated three-level transformation of all-round virtual try-on
Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation class reactivation map
Generalizing Interactive Backpropagating Refinement for Dense Prediction Networks for dense prediction network Generalizing Interactive Backpropagation Refinement
Protecting Celebrities from DeepFake with Identity Consistency Transformer
SVIP: Sequence Verification for Procedures in Videos SVIP: Sequence Verification for Procedures in Videos
Cannot See the Forest for the Trees: Aggregating Multiple Viewpoints to Better Classify Objects in Videos
Deep Saliency Prior for Reducing Visual DistractionDeep Saliency Prior
ClothFormer: Taming Video Virtual Try-on in All ModuleClothFormer: Taming Video Virtual Try-on in All ModulesFLARF
: Fast LArge-scale Radiance Field ReconstructionFLARF: Fast Large Scale Radiation Field Reconstruction
Estimating Structural Disparities in Face Models
Faithful Extreme Rescaling via Generative Prior Reciprocated Invertible Representations Faithful Extreme Rescaling via Generative Prior Reciprocated Invertible Representations
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding Animal Kingdom: a large and diverse dataset for understanding animal behavior
Uniform Subdivision of Omnidirectional Camera Space for Efficient Spherical Stereo Matching
COTS: Collaborative Two- Stream Vision-Language Pre-Training Model for Cross-Modal RetrievalCOTS: A Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Scene Graph Expansion for Semantics-Guided Image Outpainting
Deep Constrained Least Squares for Blind Image Super-Resolution Depth Constrained Least Squares
MaskGIT: Masked Generative Image TransformerMaskGIT: Masked Generative Image TransformersCMT
: Convolutional Neural Networks Meet Vision TransformersCMT: Convolutional Neural Networks Meet Vision TransformersGraftNet
: Towards Domain Generalized Stereo Matching with a Broad-Spectrum and Task-Oriented FeatureGraftNet: Towards Domains with Broad-Spectrum and Task-Oriented Features Generalized Stereo Matching SoftGroup for 3D Instance Segmentation
on Point Clouds text-to-face synthesis and manipulation


PoseKernelLifter: Metric Lifting of 3D Human Pose using SoundPoseKernelLifter: Metric Lifting of 3D Human Pose Using SoundLIFT:
Learning 4D LiDAR Image Fusion Transformer for 3D Object DetectionLIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object DetectionMake
It Move: Controllable Image-to-Video Generation with Text Descriptions Let it move: Controllable Image-to-Video Generation with Text Descriptions
Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels
Learning What Not to Segment : A New Perspective on Few-Shot Segmentation Learning Undivided Content: A New Perspective on Few-Shot SegmentationTT
-VSR: Learning Trajectory-Aware Transformer for Video Super-ResolutionTT-VSR: Learning Trajectory-Aware Transformer for Video Super-Resolution
Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes
DyRep: Bootstrapping Training with Dynamic Re-parameterizationDyRep: Bootstrapping Training with Dynamic ReparameterizationVGSE
: Visually-Grounded Semantic Embeddings for Zero-Shot LearningVGSE: Visually-Grounded Semantic Embeddings for Zero-Shot LearningGreedyNASv2
: Greedier Search with a Greedy Path FilterGreedyNASv2: Greedy Search Using Greedy Path Filters
HDR-NeRF: High Dynamic Range Neural Radiance FieldsHDR-NeRF: High Dynamic Range Neural Radiance Fields
Novel-View Object Selection in Neural Volumetric Representations
Relieving Long- Tailed Instance Segmentation via Pairwise Class Balance through Pairwise Class Balance to reduce long-tailed instance segmentation
Complex Video Action Reasoning via Learnable Markov Logic Network Complex Video Action Reasoning based on Learnable Markov Logic Network
PCL: Proxy-based Contrastive Learning for Domain GeneralizationPCL: Based on Domain Generalization Contrastive Learning for Agents
Unifying Motion Deblurring and Frame Interpolation with Events Unifying Motion Deblurring and Frame Interpolation with Events
Shape-invariant 3D Adversarial Point Clouds Shape-invariant 3D Against Point Clouds
Learning Pixel-Level Distinctions for Video Highlight Detection Learning Pixel-Level Distinctions for Video Highlight Detection Difference
Wavelet Knowledge Distillation: Towards Efficient Image-to-Image TranslationWavelet Knowledge Distillation: Towards Efficient Image-to-Image ConversionADAS: A Direct Adaptation
Strategy for Multi-Target Domain Adaptive Semantic SegmentationADAS: Direct Adaptation of Multi-Target Domain Adaptive Semantic Segmentation Strategy
PSTR: End-to-End One-Step Person Search With TransformersPSTR: Use Transformers for end-to-end one-stop person search
Towards real-world navigation with deep differentiable planners Realize real-world navigation through deep differentiable planners
Multi-class Token Transformer for Weakly Supervised Semantic Segmentation Multi-class Token Transformer for Weakly Supervised Semantic Segmentation
Fourier Document Restoration for Robust Document Dewarping and RecognitionNeural
RGB-D Surface ReconstructionNeural RGB-D Surface ReconstructionLMGP
: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object TrackingLMGP: Lifted Multicut geometric projection that meets multi-camera multi-target tracking
ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and GenerationManiTrans: Entity-level text-guided image manipulation through Token-wise semantic alignment and generation Spatio-
Temporal Gating-Adjacency GCN for Human Motion Prediction
What Matters For Meta-Learning Vision Regression Tasks? What Matters For Meta-Learning Vision Regression Tasks?
Self-supervised Learning of Adversarial Examples: Towards Good Generalizations for Deepfake DetectionSelf-supervised Learning of Adversarial Examples: Towards Good Generalizations for Deepfake Detection
Ray Priors through Reprojection: Improving Neural Radiance Fields for Novel View Extrapolation通过重投影的射线先验:改进新视图外推的神经辐射场
Perception Prioritized Training of Diffusion Models扩散模型的感知优先训练
Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving自动驾驶中用于单目 3D 目标检测的伪立体
Human Trajectory Prediction with Momentary Observation基于瞬时观测的人体轨迹预测
General Facial Representation Learning in a Visual-Linguistic Manner以视觉语言方式学习一般面部表征
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions通过大规模视频转录推进高分辨率视频语言表示
Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model预测、预防和评估:由预训练的视觉语言模型支持的解耦的文本驱动图像处理
Contextual Outpainting with Object-level Contrastive LearningOptical
Flow Estimation for Spiking Camera
PointCLIP: Point Cloud Understanding by CLIPPointCLIP: Understanding Point Cloud
Large scale pre-training for person re through CLIP -identification with noisy labels Large-scale pre-training for person re-identification with noisy labels
Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection
Blended Diffusion for Text-driven Editing of Natural Images
CREAM: Weakly Supervised Object Localization via Class RE-Activation Mapping CREAM: Weakly Supervised Object Localization via Class Reactivation Mapping
Finding Fallen Objects Via Asynchronous Audio-Visual Integration finds falling objects through asynchronous audio-visual integration
HeadNeRF: A Real-time NeRF-Based Parametric Head ModelHeadNeRF: A real-time NeRF-based parametric head model
Interacting Attention Graph for Single Image Two-Hand Reconstruction单幅图像双手重建的交互注意力图
Learning based Multi-modality Image and Video Compression基于学习的多模态图像和视频压缩
DR.VIC: Decomposition and Reasoning for Video Individual CountingDR.VIC:视频个体计数的分解与推理
End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection通用事件边界检测的端到端压缩视频表示学习
BaLeNAS: Differentiable Architecture Search via Bayesian Learning RuleBaLeNAS:通过贝叶斯学习规则进行可微架构搜索
Task Adaptive Parameter Sharing for Multi-Task Learning多任务学习的任务自适应参数共享
ViM: Out-Of-Distribution with Virtual-logit MatchingViM:具有虚拟 logit 匹配的分布外
Pyramid Adversarial Training Improves ViT Performance金字塔对抗训练提高 ViT 表现
Depth-Guided Sparse Structure-from-Motion for Movies and TV Shows
Part-based Pseudo Label Refinement for Unsupervised Person Re-identification
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment MVS2D
: Efficient Multi-view Stereo via Attention-Driven 2D ConvolutionsMVS2D: Driven by Attention The 2D convolution achieves efficient multi-view stereo
Consistent Explanations by Constrastive Learning Contrastive Explanation of Contrastive LearningFvOR
: Robust Joint Shape and Pose Optimization for Few-view Object ReconstructionFvOR: Robust Joint Shape and Pose Optimization for Few-view Object Reconstruction
Contextualized Spatio -Temporal Contrastive Learning with Self-Supervision Frame Averaging for Equivariant Shape Space Learning,
Frame Averaging for Equivariant Shape Space Learning, etc.
iFS-RCNN: An Incremental Few-shot Instance SegmenteriFS-RCNN: Incremental Few-shot Instance Segmentator
Bring Evanescent Representations to Life in Lifelong Class Incremental Learning Bring Ephemeral Representations to Life in Lifelong Class Incremental Learning
Text to Image Generation with Semantic-Spatial Aware GAN uses semantic space perception GAN to generate text to image
Real-Time Light-Weight Near-Field Photometric Stereo Real-time Lightweight Near-Field Photometric Stereo
DESTR: Object Detection with Split Transformer DESTR: Object Detection with Split Transformer
Backdoor Attacks on Self-Supervised Learning's backdoor attack on self-supervised learning
Diverse Image Outpainting via GAN Inversion High-Resolution Image Synthesis with Latent Diffusion Models
High-Resolution Image Synthesis with Latent Diffusion Models High-Resolution Image Synthesis with Latent Diffusion Models NFormer
: Robust Person Re-identification with Neighbor TransformerNFormer: Powerful Person Re-identification Using Neighbor Transformer
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality CrossLoc : Scalable Aerial Localization
Assisted by Multimodal Synthetic Data
Scene for Camera RelocalizationSceneSqueezer: Learning to compress scenes for camera relocationDancing
under the stars: video denoising in starlightDancing under the stars: video denoising in starlightTracking
People by Predicting 3D Appearance, Location and Pose Pose to track people
BCOT: A Markerless High-Precision 3D Object Tracking BenchmarkBCOT: Markerless High-Precision 3D Object Tracking Benchmark
Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture
CVF-SID: Cyclic multi-Variate Function for Self-Supervised Image Denoising by Disentangling Noise from ImageCVF-SID: Cyclic multi-variate function for Self-Supervised Image Denoising by Disentangling Noise from Image Unknown-Aware Object Detection: Learning What
You Don't Know from Videos in the Wild Unknown Perceptual Object Detection: Learn from Wild Videos What You Don't Know
BodyGAN: General-purpose Controllable Neural Human Body Generation BodyGAN: General-purpose Controllable Neural Human Body Generation
Training-free Transformer Architecture Search Free Training Transformer Architecture Search Learning to
Affiliate: Mutual Centralized Learning for Few-shot Classification Defense On Generalizing Beyond Domains in Cross-Domain Continual Learning




Practical Learned Lossless JPEG Recompression with Multi - Level
Cross -Channel Entropy Model in the DCT Domain : Unified 2D/3D Recognizer with Latent Space RenderingRendNet: Unified 2D/3D Recognizer with Latent Space Rendering
Identifying Ambiguous Similarity Conditions via Semantic Matching Identifying Ambiguous Similarity Conditions via Semantic Matching
Learn from Others and Be Yourself in Heterogeneous Federated Learning , Be yourself in heterogeneous federated learningEnhancing
Face Recognition with Self-Supervised 3D ReconstructionEnhancing Face Recognition through Self-Supervised 3D ReconstructionVisual Vibration Tomography
: Estimating Interior Material Properties from Monocular VideoVisual Vibration Tomography: Estimating Internal Materials from Monocular Video characteristic
ACPL: Anti-curriculum Pseudo-labelling for Semi-supervised Medical Image ClassificationACPL:用于半监督医学图像分类的反课程伪标签
The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization最坏情况训练的两个维度和域外泛化的综合效果
Perturbed and Strict Mean Teachers for Semi-supervised Semantic Segmentation半监督语义分割的扰动和严格均值教师
Directional Self-supervised Learning for Heavy Image Augmentations用于重图像增强的定向自监督学习
CPPF: Towards Robust Category-Level 9D Pose Estimation in the WildCPPF:在野外实现稳健的类别级 9D 姿势估计
Cross-patch Dense Contrastive Learning for Semi-supervised Segmentation of Cellular Nuclei in Histopathologic Images跨补丁密集对比学习用于组织病理学图像中细胞核的半监督分割
Dual-AI: Dual-path Actor Interaction Learning for Group Activity RecognitionDual-AI:用于群体活动识别的双路径 Actor 交互学习
UCC: Uncertainty guided Cross-head Co-training for Semi-Supervised Semantic SegmentationUCC: Uncertainty Guided Cross-head Co-training for Semi-Supervised Semantic
Segmentation Target detection
Exploiting Temporal Relations on Radar Perception for Autonomous Driving
Unsupervised Visual Representation Learning by Online Constrained K-Means
Contextual Debiasing for Visual Recognition with Causal Mechanisms Context Debiasing for Visual Recognition of Causal Mechanisms
Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes
Towards Accurate Facial Landmark Detection via Cascaded Transformers Achieving Accurate Facial Landmark Detection
DIP: Deep Inverse Patchmatch for High-Resolution Optical FlowDIP: Deep Inverse Patch Matching for High-Resolution Optical Flow
Critical Regularizations for Neural Surface Reconstruction in the Wild
Per-Clip Video Object Segmentation
CAFE: Learning to Condense Dataset by Aligning FeaturesCAFE: Learning to Compress Datasets by Aligning Features
ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and SynthesisArtiBoost: Boost Articulated 3D Hand Object Pose Estimation via Online Exploration and Synthesis
SphereSR: 360° Image Super-Resolution with Arbitrary Projection via Continuous Spherical Image Representation 360° image super-resolution
Learning to Restore 3D Face from In-the-Wild Degraded Images Learning to restore 3D faces from wild degraded images
BEVT: BERT Pretraining of Video TransformersBEVT: BERT pre-training for video converters
A Hybrid Egocentric Activity Anticipation Framework via Memory-Augmented Recurrent and One-shot Representation Forecasting
Sparse Fuse Dense: Towards High Quality 3D Detection with Depth CompletionSparse Fuse Dense : Towards High-Quality 3D Detection with Deep Completion
MSTR: Mutli-Scale Transformer for End-to-End Human-Object Interaction Detection MSTR: Multiscale Transformer for End-to-End Human-Object Interaction Detection
Synthetic Aperture Imaging with Events and FramesSynthetic Aperture Imaging with Events and FramesAP
-BSN: Self-Supervised Denoising for Real-World Images via Asymmetric PD and Blind-Spot NetworkAP-BSN: Self-Supervised Denoising for Real-World Images via Asymmetric PD and Blind-Spot
NetworkDynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information Dynamic MLP for fine-grained image classification using geographic and temporal information
Lepard: Learning partial point cloud matching in rigid and deformable scenesLepard: Learning partial point cloud matching in rigid and deformable
scenesNeural Compression-Based Feature Learning for Video RestorationLearning
to Collaborate in Decentralized Learning of Personalized ModelsRethinking
Parsing Branch for Human Densepose EstimationCollaborative
Transformers for Grounded Situation RecognitionISNet
: Shape Matters for Infrared Small Target DetectionISNet: The shape problem of infrared small target detection
Bi-level Doubly Variational Learning for Energy-based Latent Variable Models
PSMNet: Position-aware Stereo Merging Network for Room Layout EstimationPSMNet: with Position-aware Stereo Merging Networks for Room Layout Estimation
Bi-level Alignment for Cross-Domain Crowd Counting
Unsupervised Homography Estimation with Coplanarity-Aware GAN Unsupervised Homography Estimation with Coplanarity-Aware GAN
Real-time Object Detection for Streaming Perception for Streaming Perception Real-time object detection
Neural Window Fully-connected CRFs for Monocular Depth Estimation Neural Window Fully-connected CRFs for Monocular Depth Estimation
Deep Hyperspectral-Depth Reconstruction Using Single Color-Dot Projection Decoupled
Multi -task Learning with Cyclical Self-Regulation for Face Parsing Decoupled multi-task learning with cyclic self-regulation for face parsing
Shadows can be Dangerous: Stealthy and Effective Physical-world Adversarial Attack by Natural Phenomenon Shadows can be dangerous: natural phenomena Towards Understanding Adversarial Robustness
of Optical Flow Networks Towards Understanding Adversarial Robustness of Optical Flow Networks
Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation
A Continuous Video Generator with the Price, Quality and Perks of StyleGAN2 A Continuous Video Generator with the Price, Quality and Perks of StyleGAN2
Self-Supervised Learning of Object Parts for Semantic SegmentationHigh
-Resolution Image Harmonization via Collaborative Dual TransformationsSlot
-VPS: Object-centric Representation Learning for High-Resolution Image Harmonization via Collaborative Dual Transformations Video Panoptic SegmentationSlot-VPS: Object-centric representation learning for video panoptic segmentation
FIFO: Learning Fog-invariant Features for Foggy Scene SegmentationFIFO: Learning Fog-invariant Features for Foggy Scene
Segmentation Features of 3D Pose
Equalized Focal Loss for Dense Long-tailed Object Detection Equalized Focal Loss for Dense Long-tailed Object Detection
Style Neophile: Constantly Seeking Novel Styles for Domain Generalization Style Neophile: Constantly Seeking New Domain Generalization Style
Mining Multi-View Information: A Strong Self-Supervised Framework for Depth-based 3D Hand Pose and Mesh Estimation Mining Multi-View Information: Based on Depth A Powerful Self-Supervised Framework for 3D Hand Pose and Mesh Estimation
The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation The Devil is in the Label: Noisy Label Correction for Robust Scene Graph Generation
Correlation Verification for Image Retrieval Correlation Validation for Image Retrieval
Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization
UBoCo: Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection Unsupervised Boundary Contrast Learning
Multi-View Mesh Reconstruction with Neural Deferred Shading Multi-View Mesh Reconstruction with Neural Deferred Shading
SoftCollage: A Differentiable Probabilistic Tree Generator for Image CollageSoftCollage: Differentiable Probabilistic Tree Generator for Image Collage OVE6D:
Object Viewpoint Encoding For Depth-based 6D Object Pose EstimationOVE6D: Object Viewpoint Encoding for Depth-based 6D Object Pose Estimation
Smooth- Swap: A Simple Enhancement for Face-Swapping with SmoothnessSmooth-Swap: Simple Enhancement for Smooth Face-Swapping
3D-SPS: Single-Stage 3D Visual Grounding via Referral Point Progressive Selection3D-SPS: Single-Stage 3D Visual Grounding via Reference Point Progressive
Selection Disentanglement Autoencoder for Steganography without EmbeddingGated2Gated:
Self-Supervised Depth Estimation from Gated ImagesGated2Gated: Self-Supervised Depth Estimation of Gated ImagesInteract
before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition Alignment Pre-Interaction: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition

DN -DETR : Accelerate DETR Training by Introducing Query DeNoisingDN-DETR: Accelerate DETR training by introducing query denoising A Scalable Combinatorial Solver
for Elastic Geometrically Consistent 3D Shape Matching
Enhancing Classifier Conservativeness and Robustness by Polynomiality Enhancing Classifier Conservativeness and Robustness by Polynomiality
Raw High-Definition Radar for Multi-Task Learning
Self-Supervised Image Representation Learning with Geometric Set Consistency Self-Supervised Image Representation Learning with Geometric Set Consistency
Multi-View Transformer for 3D Visual Grounding 3D visually grounded multi-view transformer
Semiconductor Defect Detection by Hybrid Classical-Quantum Deep Learning Hybrid Classical-Quantum Deep Learning Semiconductor Defect Detection
Attention Reveals Occlusions Revisiting Domain Generalized Stereo
Matching Networks from a Feature Consistency Perspective : Content-Concealing Visual Descriptors via Adversarial LearningNinjaDesc: SwapMix: Diagnosing and Regularizing the Over-reliance on Visual Context in Visual Question AnsweringSwapMix: Diagnosing and Regularizing the Over-reliance on Visual Context in Visual Question Answering Learning Part Segmentation through Unsupervised Domain Adaptation from Synthetic VehiclesCellTypeGraph : A New Geometric Computer Vision BenchmarkCellTypeGraph: A New Geometric Computer Vision BenchmarkSiamese Contrastive Embedding Network for Compositional Zero-Shot Learning Siamese vs. embedding networks for






Reference-based Video Super-Resolution Using Multi-Camera Video Triplets
End-to-End Semi-Supervised Learning for Video Action Detection End-to-End Semi-Supervised Learning for Video Action Detection Learning
Parameter-free Online Test-time Adaptation
3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces Tangle
Dual-Key Multimodal Backdoors for Visual Question Answering
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective Can Neural Nets Learn the Same Model Twice? ? Studying repeatability and double decline from the perspective of decision boundaries
RePaint: Inpainting using Denoising Diffusion Probabilistic ModelsRePaint: Inpainting using Denoising Diffusion Probabilistic Models
Improving GAN Equilibrium by Raising Spatial Awareness通过提高空间意识来改善 GAN 平衡
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning超越监督与无监督:图像表示学习的代表性基准测试和分析
A variational Bayesian method for similarity learning in non-rigid image registration非刚性图像配准中相似性学习的变分贝叶斯方法
Task2Sim: Towards Effective Pre-training and Transfer from Synthetic DataTask2Sim:从合成数据实现有效的预训练和迁移
Adaptive Trajectory Prediction via Transferable GNN基于可迁移 GNN 的自适应轨迹预测
Learning to Learn across Diverse Data Biases in Deep Face Recognition学习在深度人脸识别中跨多种数据偏差进行学习
RIDDLE: Lidar Data Compression with Range Image Deep Delta EncodingRIDDLE:激光雷达数据压缩与距离图像深度增量编码
Total Variation Optimization Layers for Computer Vision计算机视觉的总变异优化层
Transforming Model Prediction for TrackingTransforming Model Prediction for TrackingHuman
Mesh Recovery from Multiple ShotsRecovering Human Mesh from Multiple ShotsFastDOG
: Fast Discrete Optimization on GPUFastDOG: Fast Discrete Optimization on GPUEstimating
Example Difficulty using Variance of GradientsUsing Gradient Variance Estimating example difficulty
Closing the Generalization Gap of Cross-silo Federated Medical Image SegmentationScale
-Equivalent Distillation for Semi-Supervised Object DetectionScale-Equivalent Distillation for Semi-Supervised Object
DetectionLong- term Visual Map Sparsification with Heterogeneous GNN ResSFL
: A Resistance Transfer Framework for Defending Model Inversion Attack in Split Federated Learning
Transformer Quick Point Transformer
Sketch3T: Test-time Training for Zero-Shot SBIRSketch3T: Test-time training of zero-shot SBIR
Generative Flows with Invertible Attentions
ABO: Dataset and Benchmarks for Real-World 3D Object UnderstandingABO: Real-world 3D Object Understanding Datasets and benchmarks
A Dual Weighting Label Assignment Scheme for Object Detection
ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts Adaptation: Vision-Language Navigation with Modality-Aligned Action Prompts
Explore the Spatio-temporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline Exploring Spatiotemporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline A
Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information Content: Quantifying Static vs. Dynamic Information
DGECN: A Depth-Guided Edge Convolutional Network For End-to-End 6D Pose EstimationDGECN: A Depth-Guided Edge Convolutional Network for End-to-End 6D Pose Estimation BNUDC: A Two-Branched Deep Neural Network for
Restoring Images from Under-Display CamerasBNUDC: A two-branch deep neural network for image recovery from under-displayed cameras
Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation Active Learning for Domain Adaptive Semantic Segmentation
Hallucinated Neural Radiance Fields in the Wild
The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration The Devil is at the Margin: Margin-based Label Smoothing for Network Calibration
Deep Depth from Focus with Differential Focus Volume
Towards Layer-wise Image Vectorization Towards Layer-wise Image Vectorization
Robust Federated Learning with Noisy and Heterogeneous Clients
Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis
Dynamic Prototype Convolution Network for Few -Shot Semantic Segmentation
Video Shadow Detection via Spatio-Temporal Interpolation Consistency Training
It's All In the Teacher: Zero-Shot Quantization Brought Closer to the Teacher Everything is on the teacher: zero-shot quantization is closer to the teacherVISOLO
: Grid-Based Space-Time Aggregation for Efficient Online Video Instance SegmentationVISOLO:
Rethinking Spatial Invariance of Convolutional Networks for Object Counting rethinks the spatial invariance of convolutional networks for object counting
Self-supervised Correlation Mining Network for Person Image Generation
ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution SegmentationISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution Exploring
Effective Data for Surrogate Training Towards Black-box Attack Exploring Effective Data for Surrogate Training Towards Black-box Attack
Contrastive Learning for Space-Time Correspondence via Self-cycle Consistency
Accelerating Video Object Segmentation with Compressed Video uses compressed video to accelerate video object segmentation
Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory
Incremental Cross-view Mutual Distillation for Self-supervised Medical CT Synthesis Incremental cross-view mutual distillation for self-supervised medical CT synthesis
Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer并非所有代币都是平等的:通过代币聚类转换器进行以人为中心的可视化分析
Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo基于非参数深度分布建模的多视图立体深度推断
LISA: Learning Implicit Shape and Appearance of HandsLISA:学习手的隐式形状和外观
GIQE: Generic Image Quality Enhancement via N t h ^{th} th Order Iterative DegradationGIQE:通过 N t h ^{th} Continual Learning for Visual Search with Backward Consistent Feature
Embedding Continual Learning for Visual Search with Backward Consistent Feature Embedding
STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded ScenesSTCrowd: Pedestrian Perception in Crowded Scenes Differentiable Stereopsis
: Meshes from multiple views using differentiable rendering Differentiable Stereopsis: Meshes from multiple views using differentiable rendering
ST++: Make Self-training Work Better for Semi-supervised Semantic SegmentationST++: Make self-training betterfor semi-supervised semantic segmentation
Arbitrary-Scale Image Synthesis
CRIS: CLIP-Driven Referring Image SegmentationCRIS: clip-driven reference image segmentation
ShapeFormer: Transformer-based Shape Completion via Sparse RepresentationShapeFormer: Transformer-based shape through sparse representation Complete
Quantifying Societal Bias Amplification in Image Captioning Quantifying Social Bias Amplification in Image Captioning
Omni-DETR: Omni-Supervised Object Detection with TransformersOmni-DETR: Omni-Supervised Object Detection Using Transformers
XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document UnderstandingXYLayoutLM: Towards Layout-Aware Multimodal Networks for Visual Richness Documentation Understanding
Cross-Architecture Self-supervised Video Representation Learning Cross-Architecture Self-supervised Video Representation Learning
Feature Erasing and Diffusion Network for Occluded Person Re-
Identification with Style VectorStyleformer:
A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty based on Transformer-based Generative Adversarial Network with Style Vector
360-Attack: Distortion-Aware Perturbations from Perspective -Views360 attack: Distortion-aware perturbation of perspective views
CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance FieldsCLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields Learnable
Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos Cerberus Transformer
: Joint Semantic, Affordance and Attribute Parsing Cerberus Transformer: Joint Semantic, Functional and Attribute Parsing
NICE-SLAM: Neural Implicit Scalable Encoding for SLAMNICE-SLAM: Neural Implicit Scalable Encoding for SLAM
FIBA: Frequency-Injection based Backdoor Attack in Medical Image AnalysisFIBA:
Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification Modal Invariance for Recognition and Temporal Memory
Continual Predictive Learning from Videos Continuous Predictive Learning from Videos
BatchFormer: Learning to Explore Sample Relationships for Robust Representation LearningBatchFormer: Learning to Explore Sample Relationships for Robust Representation
Learning to Zoom Inside Camera Imaging Pipeline Learning to Zoom Inside Camera Imaging Pipeline
TeachAugment: Data Augmentation Optimization Using Teacher KnowledgeTeachAugment: Data Augmentation Optimization Using Teacher Knowledge
PhyIR: Physics-based Inverse Rendering for Panoramic Indoor ImagesPhyIR: Physically-based reverse rendering of panoramic indoor images
Finding Good Configurations of Planar Primitives in Unorganized Point Clouds Towards
Better Understanding Attribution Methods More A good understanding of attribution methods
B-cos Networks: Alignment is All We Need for InterpretabilityB-cos Networks: Alignment is what we need Interpretability
TO-FLOW: Efficient Continuous Normalizing Flows with Temporal Optimization adjoint with Moving Speed ​​TO-FLOW: Has Efficient continuous normalized flow for time optimization and movement speed
Learning Invisible Markers for Hidden Codes in Offline-to-online Photography
Learning Distinctive Margin toward Active Domain Adaptation Learning Distinctive Margin toward Active Domain Adaptation Learning Distinctive Margin
Adiabatic Quantum Computing for Multi Object Tracking Adiabatic Quantum Computation for Tracking
Learnable Lookup Table for Neural Network Quantization
Artistic Style Discovery With Independent ComponentsOcclusion
-Aware Cost Constructor for Light Field Depth Estimation Constructor
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning
Which Model to Transfer? Finding the Needle in the Growing Haystack Which Model to Transfer? Finding needles in growing haystacks
Using 3D Topological Connectivity for Ghost Particle Reduction in Flow Reconstruction
Neural Points: Point Cloud Representation with Neural Fields Neural Points: Point Cloud Representation with Neural Fields Neural Points: Point Cloud Representation with Neural Fields
C 2 ^ 22AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic SegmentationC 2 ^2 2 AM: Contrastive Learning of Class-Agnostic Activation Maps for Weakly Supervised Object Localization and Semantic Segmentation
RCP: Recurrent Closest Point for Point Cloud RCP: Recurrent Closest Point for Point Clouds
Label, Verify, Correct: A Simple Few-Shot Object Detection Method Label, Verify, Correct: A Simple Few-Shot Object Detection Method
Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction
Dual-Generator Face Reenactment Dual-Generator Face Reenactment
BoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active AnnotationBoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active
Annotation Volume rendering ray entropy minimization
Balanced Contrastive Learning for Long-Tailed Visual Recognition Balanced Contrastive Learning for Long-Tailed Visual Recognition
The Devil is in the Pose: Ambiguity-free 3D Rotation-invariant Learning via Pose-aware ConvolutionThe Devil is in the Pose:通过 Pose-aware Convolution 的无歧义 3D 旋转不变学习
Partially Does It: Towards Scene-Level FG-SBIR with Partial Input部分做到了:走向带有部分输入的场景级 FG-SBIR
Source-Free Object Detection by Learning to Overlook Domain Style通过学习忽略领域风格进行无源目标检测
Region-Aware Face Swapping区域感知人脸交换
COOPERNAUT: End-to-End Driving with Cooperative Perceptionfor Networked VehiclesCOOPERNAUT:具有协作感知的联网车辆端到端驾驶
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language TasksNLX-GPT:视觉和视觉语言任务中的自然语言解释模型
SkinningNet: Two-Stream Graph Convolutional Neural Network for Skinning Prediction of Synthetic CharactersSkinningNet:用于合成字符皮肤预测的双流图卷积神经网络
Efficient Large-scale Localization by Global Instance Recognition
All-photon Polarimetric Time-of-Flight Imaging
Parametric Scattering Networks Parametric Scattering Networks
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge -based Visual Question AnsweringMuKEA: Multi-modal knowledge extraction and accumulation of knowledge-based visual question answering
Coarse-to-Fine Feature Mining for Video Semantic Segmentation
Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation Practical Stereo Matching with Adaptive Correlation Cascaded Recurrent Networks
Robust Egocentric Photo-realistic Facial Expression Transfer for Virtual Reality Rethinking
Visual Geo-localization for Large-Scale Applications rethinking Think Visual Geolocation for Large-Scale Applications

Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps Polymorphic GAN: Generating Aligned Samples Across Multiple Domains Using a Learned Morph Map And hierarchical relationship learning
High-Fidelity GAN Inversion for Image Attribute Editing High-fidelity GAN for image attribute editing inversion
Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC Killing two birds with one stone: Partial FC for faces
IM Avatar: Implicit Morphable Head Avatars from VideosIM Avatar: Implicit Morphable Head Avatars from VideosProactive
Image Manipulation DetectionActive Image Processing DetectionText Spotting
TransformersText RecognitionTransformersLearning
a Structured Latent Space for Unsupervised Point Cloud Completion Learning a structured latent space for unsupervised point cloud completion
PCA-Based Knowledge Distillation Towards Lightweight and Content-Style Balanced Photorealistic Style Transfer Models PCA-Based Knowledge Distillation Grounding Answers
for Visual Questions Asked by Visually Impaired People Basic Answers to Vision QuestionsEfficient
Classification of Very Large Images with Tiny ObjectsLeveraging
Adversarial Examples to Quantify Membership Information LeakageTowards
Practical Deployment-Stage Backdoor Attack on Deep
When to Prune? A Policy towards Early Structural Pruning When to Prune? A Policy towards Early Structural Pruning? Early Structure Pruning Policy
Robust Optimization as Data Augmentation for Large-scale Graphs Robust Optimization as Data Augmentation for Large-Scale Graphs
Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-SynthesisHarmony: A Generic Unsupervised
Approach for Disentangling Semantic Content from Parameterized Transformations
The Implicit Values ​​of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement The Implicit Values ​​of A Good Hand Shake: Handheld Multi-Frame Neural Depth Refinement Noise2NoiseFlow: Realistic
Camera Noise Modeling without Clean ImagesNoise2NoiseFlow: Realistic camera noise modeling without clean images
MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision MetaPose: Fast 3D pose generation from multiple views without 3D supervision
Virtual Elastic Objects Virtual elastic objects
StyleSDF: High-Resolution 3D-Consistent Image and Geometry GenerationStyleSDF: high-resolution 3D consistent image and geometry generation
Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning
Self-supervised Neural Articulated Shape and Appearance Models
A Self-Supervised Descriptor for Image Copy Detection Self-supervised descriptors for image duplication detection
Rethinking Deep Face Restoration Rethinking Deep Face Restoration
Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes Rethinking Controllable
Variational Autoencoders Rethinking Controllable Variational Autoencoders Controllable Variational Autoencoder
Convolutions for Spatial Interaction Modeling
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
AdaFace: Quality Adaptive Margin for Face RecognitionAdaFace: Quality Adaptive Margin for Face Recognition
Towards End-to-End Unified Scene Text Detection and Layout Analysis走向端到端统一场景文本检测和布局分析
Active Learning by Feature Mixing通过特征混合进行主动学习
Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs分类然后接地:将视频场景图重新格式化为时间二分图
Towards Better Plasticity-Stability Trade-off in Incremental Learning: A Simple Linear Connector在增量学习中实现更好的可塑性-稳定性权衡:一个简单的线性连接器
Cloth-Changing Person Re-identification from A Single Image with Gait Prediction and Regularization具有步态预测和正则化的单幅图像的换布人重新识别
SpaceEdit: Learning a Unified Editing Space for Open-Domain Image EditingSpaceEdit:学习开放域图像编辑的统一编辑空间
Learning to Answer Questions in Dynamic Audio-Visual Scenarios学习在动态视听场景中回答问题
Non-generative Generalized Zero-shot Learning via Task-correlated Disentanglement and Controllable Samples Synthesis通过任务相关解开和可控样本合成的非生成广义零样本学习
Knowledge-Driven Self-Supervised Representation Learning for Facial Action Unit Recognition用于面部动作单元识别的知识驱动的自监督表示学习
Coupling Vision and Proprioception for Navigation of Legged Robots耦合视觉和本体感知的腿式机器人导航
URetinex-Net: Retinex-based Deep Unfolding Network for Low-light Image EnhancementURetinex-Net:用于弱光图像增强的基于 Retinex 的深度展开网络
Modeling Image Composition for Complex Scene Generation为复杂场景生成建模图像合成
Think Twice Before Detecting GAN-generated Fake Images from their Spectral Domain Imprints在从光谱域印记中检测 GAN 生成的假图像之前三思而后行
Undoing the Damage of Label Shift for Cross-domain Semantic Segmentation消除标签移位对跨域语义分割的损害
Implicit Motion Handling for Video Camouflaged Object Detection
Contrastive Conditional Neural Processes Contrastive Conditional Neural Processes
Exploring Set Similarity for Dense Self-supervised Representation Learning Exploring Set Similarity for Dense Self-supervised Representation Learning
E2V-SDE: From Asynchronous Events to Fast and Continuous Video Reconstruction via Neural Stochastic Differential EquationsE2V-SDE: Catching Both
Gray and Black Swans: Open-set Supervised Anomaly Detection Catching Both Gray and Black Swans: Open-set Supervised Anomaly Detection Set supervised anomaly detection
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal PretrainingM5Product: Self-harmonized contrastive learning of e-commerce multi-modal pre-training
CycleMix: A Holistic Strategy for Medical Image Segmentation from Scribble SupervisionCycleMix: Scribble Supervision medical image segmentation overall strategy
Mixed Multimodal Tokens for Vision Transformers Rethinking
the Augmentation Module in Contrastive Learning: Learning Hierarchical Augmentation Invariance with Expanded Views Rethinking the Augmentation Module in Contrastive Learning: Learning Hierarchical Augmentation Invariance with Expanded
Views A Temporally Evolving Graph Embedding for Object IdentificationAirObject:
Balanced Multimodal Learning via On-the-fly Gradient Modulation Balanced Multimodal Learning via On-the-fly Gradient Modulation
Ray3D: ray-based 3D human pose estimation for monocular absolute 3D localizationRay3D: Ray-Based 3D Human Pose Estimation for Monocular Absolute 3D Localization
Computing Wasserstein- ppp Distance Between Images with Linear Cost Calculates the Wasserstein between images using a linear cost -ppp 距离
Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video捕捉运动中的人类:来自单目视频的时间注意 3D 人体姿势和形状估计
Feature Statistics Mixing Regularization for Generative Adversarial Networks生成对抗网络的特征统计混合正则化
Expressive Talking Head Generation with Granular Audio-Visual Control具有精细视听控制的富有表现力的说话头生成器
Geometric Anchor Correspondence Mining with Uncertainty Modelling for Universal Domain Adaptation具有不确定性建模的几何锚点对应挖掘用于通用域自适应
OSSO: Obtaining Skeletal Shape from OutsideOSSO:从外部获取骨骼形状
How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs你怎么做呢?使用伪副词进行细粒度的动作理解
GIRAFFE HD: A High-Resolution 3D-aware Generative ModelGIRAFFE HD:高分辨率 3D 感知生成模型
Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism
Pixel screening based intermediate correction for blind deblurring
LAS-AT: Adversarial Training with Learnable Attack StrategyLAS- AT: Adversarial Training with Learnable Attack Policies
Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse LanesEigenlanes: Data-Driven
Lane Descriptors for Structurally Diverse Lanes A new method
SC^2-PCR: A Second Order Spatial Compatibility for Efficient and Robust Point Cloud RegistrationSC^2-PCR: Efficient and Robust Point Cloud Registration Second-Order Spatial Compatibility APRIL: Finding the Achilles' Heel on Privacy
Leakage Month for Vision Transformers: Finding the Achilles heel of Vision Transformers privacy leaks
Eigencontours: Novel Contour Descriptors Based on Low-Rank Approximation Feature Contours: A New Contour Descriptor Based on Low-Rank Approximation Cross
-modal Background Suppression for Audio-Visual Event Localization Cross-modal Background Suppression for Audio-Visual Event Localization WebQA
: Multihop and Multimodal QAWebQA: Multi-hop and multi-modal QA Fairness-aware Adversarial Perturbation Towards Bias Mitigation for
Deployed Deep Models
Distribution-aware single-stage model for multi-person 3D pose estimation
Active Learning for Open-set Annotation Open-set Annotated Active Learning
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed ​​Instance SegmentationE2EC: An End-to-End High-quality and high-speed instance segmentation method for end-to-end contour
Self-Supervised Arbitrary-Scale Point Clouds Upsampling via Implicit Neural Representation Self-supervised arbitrary-scale point cloud upsampling via implicit neural representation
Relative Pose from a Calibrated and an Uncalibrated Smartphone Image Calibrated and Uncalibrated Smartphone Image Relative Pose
Learning Optical Flow with Kernel Patch Attention Using Kernel Patch Attention to Learn Optical Flow
Contrastive Learning for Unsupervised Video Highlight Detection for Unsupervised Video Highlight Detection Contrastive Learning
ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image PriorISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior MVSE: A Large-Scale
Benchmark Dataset for Multi-Modal Videos Similarity EvaluationMVSE: For Multi-Modal Videos Similarity Evaluation Large-Scale Benchmark Dataset for Modal Video Similarity Evaluation
Discrete time convolution for fast event-based stereo
Proper Reuse of Image Classification Features Improves Object Detection Proper Reuse of Image Classification Features Improves Object Detection Detect
Object-Region Video Transformers
Vision-Language Pre-Training for Boosting Scene Text Detectors Vision-Language Pre-Training for Boosting Scene Text Detectors
Bandits for Structure Perturbation-based Black-box Attacks to Graph Neural Networks with Theoretical Guarantees Revisiting Large Kernel Design in Convolutional Neural Networks Revisiting Large Kernel Design in Convolutional Neural
Networks Kernel DesignGenerating
High Fidelity Data from Low-density Regions using Diffusion ModelsUsing diffusion models to generate high-fidelity data from low-density regionsColar
: Effective and Efficient Online Action Detection by Consulting ExemplarsColar: Effective and Efficient Online Action Detection by Consulting ExemplarsLearning
Visual -Semantic Explanations of Deep Visual Latent Representations Learning Deep Visual Latent Representation Visual Semantic Interpretation
StyleMesh: Style Transfer for Indoor 3D Scene ReconstructionsStyleMesh: Indoor 3D Scene Reconstruction Style Transfer
Probing Representation Forgetting in Supervised and Unsupervised Continual Learning Exploring Supervised and Unsupervised Continuous Learning Representation in learning forgetting
Light Field Neural Rendering Light Field Neural Rendering
ROCA: Robust CAD Model Retrieval and Alignment from a Single ImageROCA: Retrieve and align powerful CAD models from a single image
Pix2NeRF: Unsupervised Conditional pi-GAN for Single Image to Neural Radiance Fields TranslationPix2NeRF: For single image to neural radiation field translation Unsupervised conditional pi-GAN
Non-Iterative Recovery from Nonlinear Observations using Generative ModelsUsing a generative model to perform non-iterative recovery from nonlinear observationsForecasting
from LiDAR via Future Object DetectionTowards
Total Recall in Industrial Anomaly Detection Comprehensive Recall in Industrial Anomaly Detection
Low-Resource Adaptation for Personalized Co-Speech Gesture Generation Integrating Language Guidance
into Vision-based Deep Metric Learning Integrating Language Guidance into Vision-based Deep Metric Learning
Non-isotropy Regularization for Proxy-based Deep Metric Learning in Learning
Estimating Egocentric 3D Human Pose in the Wild with External Weak Supervision
Less is More: Generating Grounded Navigation Instructions from Landmarks Less is More: Generating Grounded Navigation Instructions from Landmarks
Automatic Synthesis of Diverse Weak Supervision Sources for Behavior AnalysisAutomatically synthesize multiple weak supervision sources for behavior analysisPerformance
-Aware Mutual Knowledge Distillation for Improving Neural Architecture SearchEnd-
to-End Reconstruction-Classification Learning for Face Forgery Detection End-to - end reconstruction classification learning for face forgery
detection Black-Box Attack with Partially Transferred Conditional Adversarial Distribution Enhanced Black-Box Attack with Partially Transferred Conditional Adversarial Distribution


Style Transformer for Image Inversion and Editing
Uformer: A General U-Shaped Transformer for Image Restoration Uformer: A General U-Shaped Transformer for Image Restoration
Speech Driven Tongue Animation Speech Driven Tongue Animation
DO- GAN: A Double Oracle Framework for Generative Adversarial NetworksDO-GAN: A Double Oracle Framework for Generative Adversarial Networks
IntentVizor: Towards Generic Query Guided Interactive Video
Summarization Stochastic Gradient Langevin Dynamics Self-supervised depth image restoration based on adaptive stochastic gradient Langevin Dynamics
Sound-Guided Semantic Image Manipulation
Adaptive Gating for Single-Photon 3D Imaging Adaptive Gating for Single-Photon 3D Imaging
Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object DetectionGaTector: A Unified Framework for Gaze
Object PredictionGaTector: A unified framework for gaze object prediction
Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation
Anomaly Detection via Reverse Distillation from One-Class Embedding via a class of embedded Reverse Distillation for Abnormality Detection Dynamic 3D Gaze from Afar: Deep Gaze Estimation from
Temporal Eye-Head-Body Coordination
Maximum Consensus of Function Weighted Impact
Beyond Fixation: Dynamic Window Visual Transformer Beyond Fixation: Dynamic Window Visual Transformer
Dressing in the Wild by Watching Dance Videos Dressing in the Wild by watching dance videos
Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers
Contrastive Boundary Learning for Point Cloud Segmentation Point Cloud Segmentation Contrastive Boundary Learning
Proto2Proto: Can you recognize the car, the way I do? Proto2Proto: Can you recognize this car like me?
Bridged Transformer for Vision and Point Cloud 3D Object Detection Bridged Transformer for Vision and Point Cloud 3D Object
Detection
Training method
SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and EditingSemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing
SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text RecognitionSwinTextSpotter:通过文本检测和文本识别之间更好的协同作用进行场景文本定位
Task Discrepancy Maximization for Fine-grained Few-Shot Classification细粒度小样本分类的任务差异最大化
Reflection and Rotation Symmetry Detection via Equivariant Learning基于等变学习的反射和旋转对称检测
Self-Supervised Equivariant Learning for Oriented Keypoint Detection面向关键点检测的自监督等变学习
Improving the Transferability of Targeted Adversarial Examples through Object-Based Diverse Input通过基于对象的多样化输入提高目标对抗样本的可迁移性
3DeformRS: Certifying Spatial Deformations on Point Clouds3DeformRS:证明点云上的空间变形
DiGS : Divergence guided shape implicit neural representation for unoriented point cloudsDiGS:无向点云的散度引导形状隐式神经表示
UNICON: Combating Label Noise Through Uniform Selection and Contrastive LearningUNICON:通过统一选择和对比学习来对抗标签噪声
Vision Transformer with Deformable Attention具有可变形注意力的视觉转换器
Diverse Plausible 360-Degree Image Outpainting for Efficient 3DCG Background Creation用于高效 3DCG 背景创建的多样化合理 360 度图像外绘
Industrial Style Transfer with Large-scale Geometric Warping and Content Preservation具有大规模几何变形和内容保留的工业风格转移
Hierarchical Modular Network for Video Captioning用于视频字幕的分层模块化网络
Optimal LED Spectral Multiplexing for NIR2RGB Translation用于 NIR2RGB 转换的最佳 LED 光谱复用
Exploring Frequency Adversarial Attacks for Face Forgery Detection探索用于面部伪造检测的频率对抗攻击
LAR-SR: A Local Autoregressive Model for Image Super ResolutionLAR-SR:图像超分辨率的局部自回归模型
What do navigation agents learn about their environment? How do navigation agents learn about their environment?
HOP: History-and-Order Aware Pre-training for Vision-and-Language NavigationHOP: History-and-Order Aware Pre-training
Entropy-based Active Learning for Object Detection with Progressive Diversity Constraint Entropy-based Progressive Diversity Constraint Target Detect Active Learning
Class Similarity Weighted Knowledge Distillation for Continual Semantic Segmentation Swin
Transformer V2: Scaling Up Capacity and ResolutionSwin Transformer V2: Expand Capacity and Resolution
Knowledge Distillation via the Target-aware Transformer Object-Aware Transformers for Knowledge Distillation
Sparse Object-level Supervision for Instance Segmentation with Pixel Embeddings Sparse Object-level Supervision for Instance Segmentation with Pixel Embeddings
Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources Exemplar-based Pattern Synthesis with
Implicit Periodic Field Network Example-Based Pattern Synthesis with Implicit Periodic Field Network
RigidFlow: Self-Supervised Scene Flow Learning on Point Clouds by Local Rigidity Prior RigidFlow: Self-Supervised Scene Flow Learning on Point Clouds via Local Rigidity Prior Weakly
Supervised Segmentation on Outdoor 4D point clouds with Temporal Matching and Spatial Graph Propagation
E^2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action RecognitionE^2(GO) MOTION: Motion-Enhanced Event Streaming for Egocentric Action Recognition
Ego4D: Around the World in 3,000 Hours of Egocentric Video Ego4D: Around the World in 3,000 Hours of Egocentric Video
Spiking Transformers for Event-based Single Object Tracking for A spike converter for event-based single-object tracking
Few-Shot Incremental Learning for Label-to-Image Translation用于标签到图像翻译的少量增量学习
CD^2-pFed: Cyclic Distillation-guided Channel Decoupling for Model Personalization in Federated LearningCD^2-pFed:循环蒸馏引导的联合学习中模型个性化的通道解耦
OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution GeneralizationOoD-Bench:量化和理解分布外泛化的两个维度
Speed up Object Detection on Gigapixel-level Image with Patch Arrangement使用补丁排列加速千兆像素级图像的目标检测
Learning Adaptive Warping for Real-World Rolling Shutter Correction学习自适应翘曲以进行真实世界的卷帘快门校正
Robust and Accurate Superquadric Recovery: a Probabilistic Approach稳健且准确的超二次曲线恢复:一种概率方法
SimVP: Simpler yet Better Video PredictionSimVP:更简单但更好的视频预测
Hyperspherical Consistency Regularization超球面一致性正则化
Dense Depth Priors for Neural Radiance Fields from Sparse Input Views来自稀疏输入视图的神经辐射场的密集深度先验
HyperInverter: Improving StyleGAN Inversion via HypernetworkHyperInverter:通过超网络改进 StyleGAN 反转
Target-Relevant Knowledge Preservation for Multi-Source Domain Adaptive Object Detection多源域自适应目标检测的目标相关知识保存
Whose Hands are These? Hand Detection and Hand-Body Association in the Wild这些是谁的手?野外手部检测与手体关联
Blind Face Restoration via Integrating Face Shape and Generative Priors通过整合人脸形状和生成先验的盲人脸恢复
Multimodal Material Segmentation多模态材料分割
Do explanation methods explain? Model knows best解释方法能解释吗?模型最清楚
Deep Hybrid Models for Out-of-Distribution Detection分布外检测的深度混合模型
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetics用于视觉语义算术的零样本图像到文本生成
Detecting Camouflaged Object in Frequency DomainExploring
Structure-aware Transformer over Interaction Proposals for Human-Object Interaction DetectionExploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction DetectionExploring Structure-aware Transformer
Appearance and Structure Aware Robust Deep Visual Graph Matching : Attack, Defense and Beyond Appearance and Structure-Aware Robust Depth Visual Graph Matching: Attack, Defense, and Beyond
PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging ObjectsPhoCaL: For Photometrically Challenging Objects Multimodal datasets for category-level object pose estimation
HINT: Hierarchical Neuron Concept Explainer HINT: Hierarchical Neuron Concept Explainer
Vox2Cortex: Fast Explicit Reconstruction of Cortical Surfaces from 3D MRI Scans with Geometric Deep Neural Networks Vox2Cortex: Using Geometric Deep Neural Networks from Rapid and explicit reconstruction of cortical surfaces in 3D MRI scans
Generative Cooperative Learning for Unsupervised Video Anomaly Detection
Panoptic, Instance and Semantic Relations: A Relational Context Encoder to Enhance Panoptic Segmentation Panoptic, Instance and Semantic Relations: A Relational Context Encoder to Enhance Panoptic Segmentation
Object-Relation Reasoning Graph for Action Recognition Action Recognition Object-Relational Reasoning Graph
Lifelong Graph Learning Lifelong Graph Learning
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation
Arch-Graph: Acyclic Architecture Relation Predictor for Task-Transferable Neural Architecture Search Arch-Graph:
Rethinking Minimal Sufficient Representation in Contrastive Learning Rethinking Minimal Sufficient Representation in Contrastive Learning
Physical Simulation Layer for Accurate 3D ModelingPhysical simulation layer for accurate 3D modeling
Image Animation with Perturbed Masks带有扰动蒙版的图像动画
Sparse to Dense Dynamic 3D Facial Expression Generation稀疏到密集的动态 3D 面部表情生成
AIM: an Auto-Augmenter for Images and MeshesAIM:图像和网格的自动增强器
PlanarRecon: Real-time 3D Plane Detection and Reconstruction from Posed Monocular VideosPlanarRecon:实时 3D 平面检测和从姿势单目视频重建
Modular Action Concept Grounding in Semantic Video Prediction语义视频预测中的模块化动作概念接地
Generating Representative Samples for Few-Shot Classification为 Few-Shot 分类生成代表性样本
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface EmbeddingsSurfEmb:具有学习表面嵌入的对象姿态估计的密集和连续对应分布
Sequential Voting with Relational Box Fields for Active Object Detection用于主动对象检测的带有关系框字段的顺序投票
Are Multimodal Transformers Robust to Missing Modality? Are Multimodal Transformers Robust to Missing Modality?
Debiased Learning from Naturally Imbalanced Pseudo-Labels
Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos
Learning to deblur using light field generated and real defocus images learn to use generated light fields and real defocused images to deblur
TOAD: Topologically-Aware Deformation Fields for Single-view 3D ReconstructionTOAD: Topologically Aware Deformation Fields for Single-View 3D Reconstruction
An Empirical Study of Training End- PLAD: Learning to Infer Shape
Programs with Pseudo-Labels and Approximate Distributions PLAD: Learning to Infer Shape Programs with Pseudo-Labels and Approximate Distributions
The Neurally-Guided Shape Parser: Grammar-based Labeling of 3D Shape Regions with Approximate Inference The Neurally-Guided Shape Parser: Grammar-based Labeling of 3D Shape Regions with Approximate Inference Imposing Consistency for Optical Flow Estimation Imposing Consistency for Optical Flow
Estimation
Generating Diverse 3D Reconstructions from a Single Occluded Face Image
RecDis-SNN: Rectifying Membrane Potential Distribution for Directly Training Spiking Neural Networks RecDis-SNN: Correction Film for Directly Training Spiking Neural Networks Potential Distribution
3D Moments from Near-Duplicate Photos
CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
MatteFormer: Transformer-Based Image Matting via Prior-TokensMatteFormer: Transformer-based image matting through Prior-Tokens
Deformable ProtoPNet: An Interpretable Image Classifier Using Deformable PrototypesDeformable ProtoPNet: An Interpretable Image Classifier Using Deformable Prototypes
Learning Bayesian Sparse Networks with Full Experience Replay for Continuous
Learning -Aware Transformer Network for Better Human-Object Interaction Detection Category-aware transformer network for better human-object interaction detection
Segment, Magnify and Reiterate: Detecting Camouflaged Objects the Hard Way Objects of
UNIST: Unpaired Neural Implicit Shape-to-Shape TranslationUNIST: Unpaired Neural Implicit Shape-to-Shape Translation
REGTR: End-to-end Point Cloud Correspondences with TransformersREGTR: End-to-end point cloud communication with Transformers
Show, Deconfound and Tell: Image Captioning with Causal Inference Show, Puzzle, and Tell: Image Captioning with Causal Inference
DeepFake Disrupter: The Detector of DeepFake Is My FriendDeepFake Disrupter: The detector of DeepFake is my friend
Lite Vision Transformer with Enhanced Self-Attention Lite Vision Transformer
Bi-directional Object-context Prioritization Learning for Saliency Ranking Saliency OSKDet: Orientation- sensitive
Keypoint Localization for Rotated Object Detection OSKDet: Orientation-sensitive Keypoint Localization for Rotated Object Detection
Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification Double Intersection Attention Learning for Granular Visual Classification and Object
Re-Recognition Invariant Grounding for Video Question Answering
Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning Distillation fine-tuning the global model
Learning Robust Image-Based Rendering on Sparse Scene Geometry via Depth Completion
FENeRF: Face Editing in Neural Radiance Fields FENeRF: Face Editing in Neural Radiance Fields
A Probabilistic Graphical Model Based on Neural-symbolic Reasoning for Visual Relationship Detection
CVNet: Contour Vibration Network for Building Extraction CVNet: Contour Vibration Network for Building Extraction
What to Look at and Where: Semantic and Spatial Refined Transformer for Detecting Human-Object Interactions What to look at and where to look: Semantic and Space Refinement Transformers for Detecting Human-Object Interactions
Nested Hyperbolic Spaces for Dimensionality Reduction and Hyperbolic NN Design
ABPN for Dimensionality Reduction and Hyperbolic NN Design : Adaptive Blend Pyramid Network for Real-Time Local Retouching of Ultra High-Resolution PhotoABPN: Adaptive Blend Pyramid Network for Real-Time Local Retouching of Ultra High-Resolution PhotoABPN
Does Robustness on ImageNet Transfer to Downstream Tasks?ImageNet 的鲁棒性是否会转移到下游任务?
Crowd Counting in the Frequency Domain频域中的人群计数
SimMIM: A Simple Framework for Masked Image ModelingSimMIM:蒙版图像建模的简单框架
GrainSpace: A Large-scale Dataset for Fine-grained and Domain-adaptive Recognition of Cereal GrainsGrainSpace:用于细粒度和域自适应识别谷物的大规模数据集
End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps基于占用网格图的端到端轨迹分布预测
MPViT : Multi-Path Vision Transformer for Dense PredictionMPViT:用于密集预测的多路径视觉转换器
Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer记住差异:通过元内存传输的跨域少镜头语义分割
ARCS: Accurate Rotation and Correspondence SearchARCS:准确的旋转和对应搜索
Ranking Distance Calibration for Cross-Domain Few-Shot Learning
MetaFSCIL: A Meta-Learning Approach for Few-Shot Class Incremental LearningMetaFSCIL: A Meta-learning Method for Few-Shot Class Incremental Learning
Fisher Information Guidance for Learned Time -of-Flight Imaging Fisher Information Guide for Learning Time-of-Flight Imaging Joint
Video Summarization and Moment Localization by Cross-Task Sample Transfer
Physical Correction Enhancement for Human Motion Prediction
Deep Color Consistent Network for Low-Light Image Enhancement
Non-Probability Sampling Network for Stochastic Human Trajectory Prediction Non-Probability Sampling Network for Random Human Trajectory Prediction
GCFSR: a Generative and Controllable Face Super Resolution Method Without Facial and GAN PriorsGCFSR: a Generative and Controllable Face Super Resolution Method Without Facial and GAN
Priors Attacks to improve adversarial transferability
HiVT: Hierarchical Vector Transformer for Multi-Agent Motion PredictionHiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction
Pooling Revisited: Your Receptive Field is Sub-optimal Revisit Pooling: Your The receptive field is not optimal
Compressing Models with Few Samples: Mimicking then Replacing
Range measurementLayered
Depth Refinement with Mask GuidanceUsing mask guidance for layered depth refinementHighly
-efficient Incomplete Large-scale Multi-view Clustering with Consensus Bipartite GraphEfficient incomplete large-scale multi-view clustering based on consensus bipartite graph
Scaling Up Vision-Language Pretraining for Image Captioning
Optimal Correction Cost for Object Detection EvaluationDeformable
Video TransformerHigh
-fidelity Monocular Human Reconstruction by Combining Implicit and Explicit Representations Combining implicit and explicit representation of high-fidelity monocular human body reconstruction
Nonlocal Sparse CRF Non-local sparse CRF
Long-Short Temporal Contrastive Learning of Video Transformers Video Transformer's long-short time comparison learning
QS-Attn: Query-Selected Attention for Contrastive Learning in I2I TranslationQS-Attn: Query Selection for Contrastive Learning in I2I Translation Attention
All-In-One Image Restoration for Unknown Corruption
Learning to Detect Scene Landmarks for Camera Localization Learning to Detect Scene Landmarks for Camera Localization
WildNet: Learning Domain Generalized Semantic Segmentation from the WildWildNet:
Pushing the Envelope of Gradient Boosting Forests via Globally-Optimized Oblique Trees
Egocentric Scene Understanding via Multimodal Spatial Rectifier Understanding Egocentric Scenes with Multimodal Spatial Rectifiers
OSSGAN: Open-Set Semi-Supervised Image GenerationOSSGAN: Open-Set Semi-Supervised Image Generation
Large-scale Video Panoptic Segmentation in the Wild: A Benchmark
Unsupervised Representation Learning for Binary Networks by Joint Classifier Learning Joint Classifier Learning Binary Network Unsupervised Representation Learning
β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search β-DARTS: Beta-Decay Regularization
Stereo Depth from Microarchitecture Search Events Cameras: Concentrate and Focus on the Future
Transferable Sparse Adversarial Attack
FAM: Visual Explanations for the Feature Representations from Deep Convolutional NetworksFAM: Noise
-Aware NeRFs for Burst-Denoising Noise-Aware NeRFs for Burst-Denoising NeRF
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point CloudsVoxel Set Transformer: Set-to-Set method for 3D object detection from point clouds Bayesian Invariant Risk Minimization Bayesian Invariant Risk Minimization
Extracting
Triangular 3D Models, Materials, and Lighting From Images Extract triangular 3D models, materials, and lighting from images
RelTransformer: A Transformer-Based Long-Tail Visual Relationship RecognitionRelTransformer: A Transformer-based long-tail visual relationship recognition
Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution Transformer-enabled multi-scale context matching and aggregation for multi-contrast MRI super-resolution
SphericGAN: Semi-supervised Hyper-spherical Generative Adversarial Networks for Fine-grained Image SynthesisSphericGAN: Semi-supervised Hyper-spherical Generative Adversarial Networks for Fine-grained Image
Synthesis -ConGR:
Unifying Panoptic Segmentation for Autonomous Driving, a large RGB-D video dataset for long-distance continuous gesture recognitionVisualGPT
: Data-efficient Adaptation of Pretrained Language Models for Image CaptioningVisualGPT: Pre-training for image captions The data of the language model is efficiently adapted to
Interspace Pruning: Using Adaptive Filter Representations to Improve Training of Sparse CNNs Space Pruning: Using Adaptive Filter Representations to Improve the Training of Sparse CNNs
NightLab: A Dual-level Architecture with Hardness Detection for Segmentation at NightNightLab: Use Two-stage architecture with hardness detection for night segmentation
Learning to Memorize Feature Hallucination for One-Shot Image Generation学习记忆特征幻觉以生成一次性图像
FedCorr: Multi-Stage Federated Learning for Label Noise CorrectionFedCorr:标签噪声校正的多阶段联合学习
GeoNeRF: Generalizing NeRF with Geometry PriorsGeoNeRF:使用几何先验概括 NeRF
Neural 3D Video Synthesis神经 3D 视频合成
TransforMatcher: Match-to-Match Attention for Semantic CorrespondenceTransforMatcher:语义对应的匹配注意
Represent, Compare, and Learn: A Similarity-Aware Framework for Class-Agnostic Counting表示、比较和学习:用于类不可知计数的相似性感知框架
AxIoU: An Axiomatically Justified Measure for Video Moment RetrievalAxIoU:一种公理合理的视频时刻检索度量
Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase.深度安全的多视图聚类:降低视图增加导致的聚类性能下降的风险。
Burst Image Restoration and Enhancement Burst Image Restoration and Enhancement
Modeling Indirect Illumination for Inverse Rendering
Knowledge Mining with Scene Text for Fine-Grained Recognition Knowledge Mining with Scene Text for Fine-Grained Recognition
FlexIT: Towards Flexible Semantic Image TranslationFlexIT: Towards Flexible Semantic Image Translation
Surpassing the Human Accuracy: Detecting Gallbladder Cancer from USG Images with Curriculum LearningMore than
Words: In-the-Wild Visually- Driven Prosody for Text-to-Speech is more than words: Vision-driven prosody in the wild for text-to-speech
Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning
Multi -Person Extreme Motion Prediction multiplayer extreme sports prediction

Does text attract attention on e - commerce images: A novel saliency prediction dataset and method Network Quantization
Energy-based Latent Aligner for Incremental Learning Energy-based Latent Aligner for Incremental Learning
Semi-supervised Video Paragraph Grounding with Contrastive Encoder Semi-supervised Video Paragraph Grounding with Contrastive Encoder
Personalized Image Aesthetics Assessment with Rich Attributes Personalized Image Aesthetics Evaluation
Attention Concatenation Volume for Accurate and Efficient Stereo Matching
Split Hierarchal Variational Compression Split Hierarchal Variational Compression
MS2DG-Net: Progressive Correspondence Learning via Multi Sparse Semantic Dynamic GraphMS2DG-Net: Progressive Correspondence Learning Based on Multiple Sparse Semantic Dynamic Graphs
Large Loss Matters in Weakly Supervised Multi-Label Classification
Recurring the Transformer for Video Action RecognitionRecurring the Transformer for Video Action RecognitionLook
Closer to Supervise Better: One-Shot Font Generation via Component-Based Discriminator takes a closer look for better supervision: Generating One-Shot Fonts via Component-Based Discriminators
KG-SP: Knowledge Guided Simple Primitives for Open World Compositional Zero-Shot Learning Hyperbolic
Vision Transformers: Combining Improvements in Metric Learning Hyperbolic Vision Transformers: Combining Metric Learning Improvements
Camera Pose Estimation using Implicit Distortion Models
A Structured Dictionary Perspective on Implicit Neural Representations Implicit Neural Representations Structured Dictionary Perspective
ST-MFNet: A Spatio-Temporal Multi-Flow Network for Frame InterpolationST-MFNet: Spatio-Temporal Multi-Flow Network for Frame Interpolation
Geometric Structure Preserving Warp for Natural Image Stitching
Slimmable Domain Adaptation
Meta Convolutional Neural Networks for Single Domain Generalization
Label Matching Semi-Supervised Object DetectionLabel matchingSparse
Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation LearningSparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation LearningAbandoning the
Bayer-Filter to See in the Dark Bayer filter to see in the dark
Deep Hierarchical Semantic Segmentation Deep Hierarchical Semantic Segmentation
MixFormer: End-to-End Tracking with Iterative Mixed AttentionMixFormer: End-to-End Tracking with Iterative Mixed Attention
ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with GeneticsContIG: Self-supervised multimodal contrastive learning for medical imaging with genetics
Occlusion-robust Face Alignment using A Viewpoint-invariant Hierarchical Network Architecture
Segment-Fusion: Hierarchical Context Fusion for Robust 3D Semantic SegmentationSegment-Fusion: For Robust 3D Semantic Segmentation Hierarchical context fusion
STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video PredictionSTRPM: High-resolution video prediction spatio-temporal residual prediction model
Boosting 3D Object Detection by Simulating Multimodality on Point Clouds Boost by simulating multimodality on point clouds 3D Object DetectionRADU
: Ray-Aligned Depth Update Convolutions for ToF Data DenoisingRADU: Ray-Aligned Depth Update Convolutions for ToF Data DenoisingAuto-
Encoder is All You NeedAuto-encoder is what you needWhose
Track Is It Anyway? Whose track is Improving Robustness to Tracking Errors with Affinity-Based Prediction? Improving robustness to tracking errors using affinity-based predictions
Multi-marginal Contrastive Learning for Multi-label Subcellular Protein Localization
Stand-Alone Inter-Frame Attention in Video Models Stand-Alone Inter-Frame Attention in Video Models
Hyperbolic Image Segmentation Hyperbolic Image Segmentation
RepMLPNet: Hierarchical Vision MLP with Re-parameterized LocalityRepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality
Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous DrivingTime3D: End-to-End Joint Monocular 3D for Autonomous Driving Object Detection and Tracking
SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-MaximizationSWEM: Real-Time Video Object Segmentation via Sequential Weighted Expectation MaximizationART-
Point: Improving Rotation Robustness of Point Cloud Classifiers via Adversarial RotationART-Point: Through Adversarial Rotation Rotation improves rotation robustness of point cloud classifiers
Super-Fibonacci Spirals: Fast, Low-Discrepancy Sampling of SO(3)Super Fibonacci Spiral: Fast, Low-Discrepancy Sampling of SO(3)
Learning to Learn and Remember Super Long Multi-Domain Task Sequence Learning Learning and Memory Super Long Multi-Domain Task Sequence
Noise Is Also Useful: Negative Correlation-Steered Latent Contrastive Learning Noise is also useful: Negative Correlation-Steered Latent Contrastive Learning
FLOAT: Factorized Learning of Object Attributes for Improved Multi-object Multi-part Scene ParsingFLOAT: Object attribute decomposition learning for improved multi-object multi-part scene parsing
Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis Surface-Aligned Neural Radiance for Controllable 3D Human Synthesis Field
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language ModelReal
World Self-Supervised Multi-Image Super-Resolution for Multi-Exposure Push-Frame SatellitesMulti-Exposure Push-Frame Satellites Real-world self-supervised multi-image super-resolution
Knowledge Distillation with the Reused Teacher Classifier
Geometry-Aware Guided Loss for Deep Crack Recognition用于深度裂缝识别的几何感知引导损失
AdaMixer: A Simple and Accurate Query-based Object DetectorAdaMixer:一个简单而准确的基于查询的对象检测器
Learning Structured Gaussians to Approximate Deep Ensembles学习结构化高斯函数以逼近深度集成
Input-level Inductive Biases for 3D Reconstruction用于 3D 重建的输入级归纳偏差
BTS: A Bi-lingual Benchmark for Text Segmentation in the WildBTS:野外文本分割的双语基准
Stereo Magnification with Multi-Layer Images具有多层图像的立体放大
Segment and Complete: Defending Object Detectors against Adversarial Patch Attacks with Robust Patch Detection分段和完整:通过强大的补丁检测保护对象检测器免受对抗性补丁攻击
Coherent Point Drift Revisited for Non-rigid Shape Matching and Registration重新审视非刚性形状匹配和配准的相干点漂移
Alleviating Semantics Distortion in Unsupervised Low-Level Image-to-Image Translation via Structure Consistency Constraint CNN Filter DB: An Empirical Investigation of Trained Convolutional FiltersCNN Filter Database
: An Empirical Study of Training Convolutional Filters
Text2Mesh: Text-Driven Neural Stylization for MeshesText2Mesh: Text-
Driven Neural Stylization for Meshes Quasi-fused unsupervised network
Image Dehazing Transformer with Transmission-Aware 3D Position Embedding
Label Relation Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi-Granularity Classification for hierarchical multi-granularity classification Label relationship graph enhanced layered residual network
RGB-Multispectral Matching: Dataset, Learning Methodology, EvaluationRGB-Multispectral Matching: Dataset, Learning Method, Evaluation
Maintaining Reasoning Consistency in Compositional Visual Question Answering在组合视觉问答中保持推理一致性
PolyWorld: Polygonal Building Extraction with Graph Neural Networks in Satellite ImagesPolyWorld:卫星图像中使用图神经网络的多边形建筑物提取
Fast Algorithm for Low-rank Tensor Completion in Delay-embedded Space延迟嵌入空间中低秩张量补全的快速算法
Dynamic Sparse R-CNN动态稀疏 R-CNN
Improving Robustness Against Stealthy Weight Bit-Flip Attacks by Output Code Matching通过输出代码匹配提高对隐匿权重位翻转攻击的鲁棒性
NPBG++: Accelerating Neural Point-Based GraphicsNPBG++:加速基于神经点的图形
Forward Compatible Few-Shot Class-Incremental Learning前向兼容 Few-Shot Class-Incremental Learning
Weakly-supervised Metric Learning with Cross-Module Communications for the Classification of Anterior Chamber Angle Images用于前房角度图像分类的跨模块通信的弱监督度量学习
Learning Canonical F-Correlation Projection for Compact Multiview Representation Learning Canonical F-Correlation Projection for Compact Multiview Representation
Learning Non-target Knowledge for Few-shot Semantic Segmentation Learning Non-target Knowledge for Few-shot Semantic Segmentation
Towards Low-Cost and Efficient Malaria Detection Towards Low-Cost and Efficient Malaria Detection
PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose TrackingPoseTrack21: A Dataset for Person Search, Multi-Object Tracking, and Multi-Person Pose Tracking NeuralHDHair: Automatic High-fidelity
Hair Modeling from a Single Image Using Implicit Neural RepresentationsNeuralHDHair: Automatic high-fidelity hair modeling from a single image using implicit neural representations
ClusterGNN: Cluster-based Coarse-to-fine Graph Neural Network for Efficient Feature MatchingClusterGNN: For efficient feature matching based on An Iterative Quantum Approach for Transformation Estimation from Point Sets
An Iterative Quantum Approach for Transformation Estimation from Point Sets An Iterative Quantum Approach for Transformation Estimation from Point Sets
ATPFL: Automatic Trajectory Prediction Model Design under Federated Learning FrameworkATPFL:联邦学习框架下的自动轨迹预测模型设计
Understanding and Increasing Efficiency of Frank-Wolfe Adversarial Training了解 Frank-Wolfe 对抗训练并提高效率
Targeted Supervised Contrastive Learning for Long-Tailed Recognition用于长尾识别的有针对性的监督对比学习
Optimizing Elimination Templates by Greedy Parameter Search通过贪心参数搜索优化消除模板
M3T: three-dimensional Medical image classifier using Multi-plane and Multi-slice TransformerM3T:使用 Multi-plane 和 Multi-slice Transformer 的三维医学图像分类器
Projective Manifold Gradient Layer for Deep Rotation Regression用于深度旋转回归的投影流形梯度层
PUMP: Pyramidal and Uniqueness Matching Priors for Unsupervised Learning of Local DescriptorsPUMP:用于局部描述符无监督学习的金字塔和唯一性匹配先验
Deep orientation-aware functional maps : Tackling symmetry issues in Shape Matching
A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation Under the guidance of panoramic segmentation, A Versatile Multi-View Framework for LiDAR-Based 3D Object Detection Lite
-MDETR: A Lightweight Multi-Modal DetectorLite-MDETR: Lightweight Multi-Modal Detector
Cross Modal Retrieval with Querybank Normalization
Learning Contrastive Representations for Learning with Noisy LabelsCross
-view transformers for real-time map-view semantic segmentationTowards
Data-Free Model Stealing in a Hard Label Setting achieves no data model stealing in the hard label setting
The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting The devil is in the details: A Diagnostic Evaluation Benchmark for Video Inpainting
Unseen Classes at a Later Time? No Problem No problemChannel
Balancing for Accurate Quantization of Winograd ConvolutionsInstance masks are what you need: Segmentation parity from object
boundariesInstance masks are what you need:Segmentation parity from object boundariesTVConv
: Efficient Translation Variant Convolution for Layout-aware Visual ProcessingTVConv:
Scanline Homographies for Rolling-Shutter Plane Absolute Pose Scanline Homographies for Rolling-Shutter Plane Absolute Pose Dual
-Shutter Optical Vibration Sensing
DoubleField: Bridging the Neural Surface and Radiance Fields for High-fidelity Human Reconstruction and RenderingDoubleField: Bridging the Neural Surface and Radiance Fields for High-Fidelity Human Reconstruction and Rendering

Robust Structured Declarative Classifiers for 3D Point Clouds: Defending Adversarial Attacks with Implicit Gradients Transformer
Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization
Contour-Hugging Heatmaps for Landmark Detection Contour-Hugging Heatmaps for Landmark Detection
Local Attention Pyramid for Scene Image Generation Local Attention Pyramid
Implicit Feature Decoupling with Depthwise Quantization for Scene Image GenerationImplicit Feature Decoupling with Depthwise QuantizationInsetGAN for Full-Body Image GenerationInsetGAN Recurrent Variational Network: A Deep Learning Inverse Problem Solver applied to
Full-Body Image Generation
The task of Accelerated MRI Reconstruction Cyclic Variational Networks: A Deep Learning Inverse Problem Solver Applied to Accelerated MRI Reconstruction Tasks
Robust Invertible Image Steganography
Disentangling visual and written concepts in CLIP
Causal CLIP Fine-tuning for Fashion Product Retrieval
Accelerating Neural Network Optimization Through an Automated Control Theory Lens accelerates neural network optimization through automatic control theory lens
Comprehending and Ordering Semantics for Image Captioning Comprehending and Ordering Semantics for Image Captioning
Grounded Language-Image Pre-training Grounded Language-Image Pre-training
Hierarchical Self-supervised Representation Learning for Movie Understanding for Hierarchical Self-Supervised Representation Learning for Movie Understanding RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-ResolutionRSTT: Real-time Spatial-Time Transformer for
Spatio-Temporal Video Super-Resolution
: Oriented Attention in Transformer Approaches for Robust Action Recognition
Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes Consistency-driven sequential Transformers for partially observable scenes
Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-AttentionParamixer: Parameterized sparse factor in Hybrid linkage works better than dot-product self-attention
How Well Do Sparse ImageNet Models Transfer? How Well Do Sparse ImageNet Models Transfer?
Towards Principled Disentanglement for Domain Generalization Decoupling
Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition
Path-CNN: Topology- Aware Centerline Segmentation Using Sparse AnnotationPath-CNN:
Image Based Reconstruction of Liquids from 2D Surface Detections Image Based Reconstruction of Liquids from 2D Surface Detections
Neural Convolutional Surfaces Neural Convolutional Surfaces
Graph-context Attention Networks for Size-varied Deep Graph Matching
Learning to Solve Hard Minimal Problems
Neural Mesh Simplification Neural Mesh Simplification
SPAct: Self-supervised Privacy Preservation for Action RecognitionSPAct: Towards Language-free Training for Action Recognition with Privacy Preservation
Towards Language-free Training for Text-to-Image Generation
Rep-Net: Efficient On-Device Learning via Feature ReprogrammingRep-Net: Passed Feature Reprogramming for Efficient On-Device Learning
3D-VField: Learning to Adversarially Deform Point Clouds for Robust 3D Object Detection3D-VField: Learning to Adversarially Deform Point Clouds for Robust 3D Object Detection
TrackFormer: Multi-Object Tracking with Transformers multiple object tracking
Deep 3D-to-2D Watermarking: Embedding Messages in 3D Meshes and Extracting Them from 2D Renderings A Comprehensive Study of Image Classification Model Sensitivity to Foregrounds
, Backgrounds, and Visual Attributes A Comprehensive Study of Image Classification Models ' Sensitivity to Foreground, Background, and Visual Attributes
EnvEdit: Environment Editing for Vision-and-Language Navigation
Earth Mover's Distance Improves Out-of-Distribution Face IdentificationDeepFace-EMD: Using Patch-wise Earth Mover's distance reordering improves out-of-distribution face recognition Mega-NERF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-ThroughsMega-NERF:
with Scalable Construction of Large NeRF for Virtual TraversalMulT
: An End-to-End Multitask Learning TransformerMulT: An End-to-End Multitask Learning Transformer
Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance FieldsMip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery DetectionUse
All The Labels: A Hierarchical Multi-Label Contrastive Learning FrameworkUse all labels: Plenoxels
: Radiance Fields without Neural NetworksPlenoxels: Radiance Fields without Neural NetworksPushing
the Limits of Simple Pipelines for Practical Few-Shot Learning The Limits of the Pipeline for Practical Few-shot Learning
PONI: Potential Functions for ObjectGoal Navigation with Interaction-free LearningPONI: Potential Functions for ObjectGoal Navigation with Interaction-free LearningCO-SNE
: Dimensionality Reduction and Visualization for Hyperbolic DataCO-SNE: Hyperbolic Data Dimensionality reduction and visualization of
EASE: Unsupervised Discriminant Subspace Learning for Transductive Few-Shot Learning EASE: Unsupervised Discriminant Subspace Learning for Transductive Few-Shot Learning
3D Photo Stylization: Learning to Generate Stylized Novel Views from a Single Image Stylized Novel View SIMBAR: Single Image -
Based Scene Relighting For Effective Data Augmentation For Automated Driving Vision Tasks
for Vision-and-Language TasksVL-Adapter: Parameter Efficient Transfer Learning for Vision and Language Tasks
VALHALLA: Visual Hallucination for Machine TranslationVALHALLA: Visual Hallucination for Machine Translation
Learning Pairwise Affinity for Open-World Instance Segmentation
CAD: Co-Adapting Discriminative Features for Improved Few-Shot ClassificationCAD: Co-adapting discriminative features for improved Few-Shot classification
Investigating the Impact of Multi-LiDAR Placement on Object Detection for Autonomous Driving
Hypergraph-Induced Semantic Tuplet Loss for Deep Metric Learning
Generalized Category DiscoveryDeep
Image-based Illumination HarmonizationIllumination Harmonization Based on Depth ImageMixed
Differential Privacy in Computer VisionMUSE
-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory PredictionMUSE-VAE:Use Multi-scale VAE
UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog UTC: A unified transformer with inter-task contrastive learning for visual
dialogue Rotation-invariant aerial target detection network
Evaluation-oriented Knowledge Distillation for Deep Face Recognition
Robust Cross-Modal Representation Learning with Progressive Self-Distillation with Progressive Self-Distillation Robust Cross-Modal Representation Learning Transformer
Tracking with Cyclic Shifting Window Attention Transformer Tracking with Cyclic Shifting Window Attention
LTP: Lane-based Trajectory Prediction for Autonomous DrivingLTP : Lane-based Trajectory Prediction for Autonomous Driving Generating 3D Bio -
Printable Patches Using Wound Segmentation and Reconstruction to Treat Diabetic Foot Ulcers
Clustering performs multi-instance point cloud registration through efficient corresponding clusteringAdaFocus
V2: End-to-End Training of Spatial Dynamic Networks for Video RecognitionAdaFocus V2: End-to-end training of spatial dynamic networks for video recognitionAutoLoss
-GMS: Searching Generalized Margin -based Softmax Loss Function for Person Re-identificationAutoLoss-GMS: Search for a Softmax loss function based on generalized margins for person re-identification
Convolution of Convolution: Let Kernels Spatially Collaborate Convolution of Convolution: Let Kernels Collaborate in Space
DiffPoseNet: Direct Differentiable Camera Pose EstimationDiffPoseNet: Direct Differentiable Camera Pose Estimation
Modeling sRGB Camera Noise with Normalizing FlowsUse normalized flow to sRGB camera noise ModelingSemantic
-shape Adaptive Feature Modulation for Semantic Image SynthesisSemantic-shape Adaptive Feature Modulation for Semantic Image SynthesisFederated
Learning with Position-Aware NeuronsSymmetry
and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation Symmetry and uncertainty-aware object SLAM for 6DoF object pose estimation
Point Density-Aware Voxels for LiDAR 3D Object Detection
A Conservative Approach for Unbiased Learning on Unknown Biases A Conservative Approach for Unbiased Learning on Unknown Biases A Conservative Approach to Unbiased Learning with Unknown Bias
The Majority Can Help the Minority: Context-rich Minority Oversampling for Long-tailed ClassificationThe majority can help a few: for long-tail classification context-rich minority oversampling Symmetry-aware Neural Architecture for
Embodied Visual Exploration used to embody the symmetry of visual exploration Perceptual Neural Architecture
DearKD: Data-Efficient Early Knowledge Distillation for Vision TransformersDearKD:
Egocentric Prediction of Action Target in 3D
What makes transfer learning work for medical images: feature reuse & other factors What makes transfer learning suitable for medical images: feature reuse and other factors
Alignment-Uniformity aware Representation Learning for Zero-shot Video
Classification DECORE: Deep Compression with Reinforcement Learning
DECORE: Deep Compression with Reinforcement Learning
RGB-Depth Fusion GAN for Indoor Depth Completion RGB depth fusion GAN for indoor depth completion
MERLOT Reserve: Neural Script Knowledge through Vision and Language and SoundMERLOT Reserve: Neural script knowledge through vision, language and sound
Class-Aware Contrastive Semi-Supervised Learning Class Perception vs. Semi-Supervised Learning Learning to
Prompt for Continual Learning Learning to Prompt for Continuous Learning
DEFEAT: Deep Hidden Feature Backdoor Attacks by Imperceptible Perturbation and Latent Representation Constraints DEFEAT: Self-
Supervised Dense Consistency Regularization for Image-to-Image TranslationSelf-supervised Dense Consistency RegularizationForward Compatible
Training for Large-Scale Embedding Retrieval SystemsForward Compatible Training for Large-Scale Embedding Retrieval SystemsJoint
Forecasting of Panoptic Segmentations with Difference Attention Joint Prediction for Panoptic Segmentation with Differential Attention
Revisiting the Transferability of Supervised Pretraining: an MLP PerspectiveDisentangling
Visual Embeddings for Attributes and ObjectsSeeThroughNet
: Resurrection of Auxiliary Loss by Preserving Class Probability InformationSeeThroughNet: Through Preserving class probability information to revive auxiliary loss
Neural Reflectance for Shape Recovery with Shadow Handling Neural Reflectance for Shape Recovery with Shadow Handling
Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow Topology-Preserving Shape Reconstruction and Registration via Neural Differential Flow
XYDeblur: Divide and Conquer for Single Image DeblurringXYDeblur:
ScePT: Scene-consistent, Policy-based Trajectory Predictions for PlanningScePT: Scene-consistent, Policy-based Trajectory Predictions for Planning
Visual Acoustic Matching Visual Acoustic Matching
Fair Contrastive Learning for Facial Attribute ClassificationFair Contrastive Learning for Facial Attribute ClassificationNeural
Prior for Trajectory EstimationNeural Prior for Trajectory EstimationAutoMine
: An Unmanned Mine DatasetAutoMine: Unmanned Mine DatasetSMARTADAPT
: Multi-branch Object Detection Framework for Videos on MobilesSMARTADAPT:
Neural Face Identification in a 2D Wireframe Projection of a Manifold Object
AlignMixup: Improving Representations By Interpolating Aligned Features AlignMixup: By Interpolating Aligned Features Aligned features to improve representation
Memory-Augmented Non-Local Attention for Video Super-Resolution
ESCNet: Gaze Target Detection with the Understanding of 3D ScenesESCNet: Gaze Target Detection based on 3D scenes

AdaptPose : Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation Sample Learning
When Does Contrastive Visual Representation Learning Work?
Privacy-preserving Online AutoML for Domain-Specific Face Detection for privacy-preserving domain-specific face detection Online AutoML
Robust outlier detection by de-biasing VAE likelihoods Robust outlier detection by debiasing VAE likelihood
GridShift: A Faster Mode- Seeking Algorithm for Image Segmentation and Object TrackingGridShift: Faster Pattern Search Algorithm for Image Segmentation and Object TrackingContinual
Learning with Lifelong Vision TransformerContinuous Learning with Lifelong Vision Transformer
M2I: From Factored Marginal Trajectory Prediction to Interactive Predictionm2i: From factor margins to interactive prediction
StochaStic Reduced Ensarial Attack For Boosting the Adversarial Transferabitation random variance reduces integrated confrontation attacks to improve the confrontation of the counterrsatable repressenting
3D Shapes with Probabilistic Direct Fields represent 3D shapes with probabilistic oriented distance fieldsRestormer
: Efficient Transformer for High-Resolution Image RestorationRestormer: Efficient Transformer for High-Resolution Image RestorationLearning
with Twin Noisy Labels for Visible-Infrared Person Re-IdentificationUse double noise labels to learn visible infrared Person re-identificationFew
-shot Learning with Noisy LabelsCo
-Domain Symmetry for Complex-Valued Deep LearningCo-Domain Symmetry for Complex-Valued Deep LearningPyramid
Architecture for Multi-Scale Processing in Point Cloud SegmentationPoint Cloud Segmentation Pyramid structure in multi-scale processing

GCR : Gradient Coreset based Replay Buffer Selection for Continual
Learning -Based Siamese Visual Tracking Based on ranking Siamese visual tracking
Coarse-to-Fine Disentangling Transformer for Human-Object Interaction Detection
MDAN: Multi-level Dependent Attention Network for Visual Emotion AnalysisMDAN: Multilevel Dependent Attention Networks for Visual Sentiment Analysis
AdaSTE: An Adaptive Straight-Through Estimator to Train Binary Neural NetworksAdaSTE: An Adaptive Straight-Through Estimator for Training Binary Neural Networks DiffusionCLIP
: Text-Guided Diffusion Models for Robust Image ManipulationDiffusionCLIP: Text-Guided Diffusion Model for Robust Image ProcessingDTA
: Physical Camouflage Attacks using Differentiable Transformation NetworkDTA: Physical Camouflage Attacks Using Differentiable Transformation Networks
Layer-wised Model Aggregation for Personalized Federated Learning用于个性化联邦学习的分层模型聚合
Video Swin Transformer视频旋转变压器
Online Continual Learning on a Contaminated Data Stream with Blurry Task Boundaries任务边界模糊的受污染数据流的在线持续学习
General Incremental Learning with Domain-aware Categorical Representations具有领域感知分类表示的一般增量学习
Crafting Better Contrastive Views for Siamese Representation Learning为连体表示学习制作更好的对比视图
A Style-aware Discriminator for Controllable Image Translation一种用于可控图像翻译的风格感知鉴别器
BoosterNet: Improving Domain Generalization of Deep Neural Nets using Culpability-Ranked FeaturesBoosterNet:使用 Culpability-Ranked 特征改进深度神经网络的域泛化
A Unified Framework for Implicit Sinkhorn Differentiation隐式 Sinkhorn 微分的统一框架
Brain-Supervised Image Editing脑监督图像编辑
Neural Shape Mating: Self-Supervised Object Assembly with Adversarial Shape PriorsNeural Shape Matching: Self-Supervised Object Assembly with Confrontational Shape PriorsMultimodal
Colored Point Cloud to Image AlignmentMultimodal Colored Point Cloud to Image AlignmentGraph
-based Spatial Transformer with Memory Replay For Multi-future Pedestrian Trajectory PredictionFor Multi-future Pedestrian Trajectory PredictionMulti
-Objective Diverse Human Motion Prediction with Knowledge DistillationMulti-Objective Diverse Human Motion Prediction with Knowledge DistillationTwo Coupled
Rejection Metrics Can Tell Adversarial Examples Apart Two coupled rejection metrics can distinguish adversarial examples
Autoregressive Image Generation using Residual Quantization
SGTR: End-to-end Scene Graph Generation with Transformer SGTR: End-to-end Scene Graph Generation with Transformer end scene graph
Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-robust Makeup Transfer
PPDL: Predicate Probability Distribution based Loss for Unbiased Scene Graph Generation PPDL: Unbiased Scene Based on Predicate Probability Distribution Graph generation loss
Localized Adversarial Domain Generalization Localized Adversarial Domain Generalization
Patch-level Representation Learning for Self-supervised Vision Transformers Self-supervised Vision Transformers Patch-level representation learning
KNN Local Attention for Image Restoration KNN Local Attention
Overcoming Catastrophic for Image Restoration Forgetting in Incremental Object Detection via Elastic Response Distillation Overcome catastrophic forgetting in incremental object detection through elastic response distillation
PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework PILC: End-to-end GPU-oriented neural framework Practical image lossless compression for

DAD-3DHeads: A Large - scale Dense, Accurate and Diverse Dataset for 3D Dense Head Alignment from a Single Image Mapping Necessary for Realistic PointGoal Navigation? Is mapping necessary for Realistic PointGoal Navigation?
Cross-Domain Correlation Distillation for Unsupervised Domain Adaptation in Nighttime Semantic Segmentation
LiT: Zero-Shot Transfer with Locked-image text TuningLiT: Zero-Shot Transfer
Scaling with Locked Image Text Adjustment Vision Transformers Zoom Vision Transformers
Spatial Commonsense Graph for Object Localization in Partial Scenes Spatial Commonsense Graph for Object Localization in Partial Scenes
Trajectory Optimization for Physics-Based Reconstruction of 3d Human Pose from Monocular Video Trajectory for Physics-Based Reconstruction of 3d Human Pose from Monocular Video optimization
3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos Upright-Net: Learning
Upright Orientation for 3D Point Cloud Upright-Net: Learning 3D Point Cloud The vertical direction of
D*-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object DetectionD*-V2X: a large-scale data set for Vehicle-Infrastructure Cooperative 3D Object
Detection Differential Dynamics of 3d Human Motion Reconstruction
Clean Implicit 3D Structure from Noisy 2D STEM Images Clean Implicit 3D Structure from Noisy 2D STEM Images
MPC: Multi-view Probabilistic ClusteringMPC: Multi-view Probabilistic Clustering
Node-aligned Graph Convolutional Network for Whole -slide Image Representation and ClassificationMultidimensional Belief Quantification for Label-Efficient Meta-
Learning Multidimensional Belief Quantification for Label-Efficient Meta-Learning
Bayesian Nonparametric Submodular Video Partition for Robust Anomaly DetectionUni6D
: A Unified CNN Framework without Projection Breakdown in 6D Pose EstimationUni6D: Unified CNN without projection decomposition in 6D pose estimation Framework
Exploring Patch-wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks Exploring Patch-wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks Enabling Equivariance for Arbitrary Lie Groups Enabling Equivariance for Arbitrary
Lie Groups
Multi-Scale Memory-Based Video Deblurring Memory-Based Multi-Scale Video Deblurring
Privacy Preserving Partial Localization Privacy Preserving Partial Localization
Towards Robust and Reproducible Active Learning using Neural Networks Using Neural Networks to Achieve Robust and Repeatable Active Learning
Marginal Contrastive Correspondence for Exemplar-based Image Translation Based on Samples Marginal contrast correspondences for image translation
TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repeated Action CountingTransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repeated Action Counting
Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation Performance Low Latency Spiking Neural Networks
FaceFormer: Speech-Driven 3D Facial Animation with TransformersFaceFormer: Speech-Driven 3D Facial Animation with TransformersLARGE:
Latent-Based Regression Through GAN SemanticsLarge: Latent-Based Regression Through GAN SemanticsLarge: Latent-Based Regression Through GAN
SemanticsTransVPR: Transformer -Based Place Recognition with Multi-Level Attention AggregationTransVPR: Transformer-based location recognition with multi-level attention aggregation
AR-NeRF: Unsupervised Learning of Depth and Defocus Effects from Natural Images with Aperture Rendering Neural Radiance FieldsAR-NeRF: Neural Radiance Fields with Aperture Rendering Unsupervised Learning of Depth and Defocus Effects in Natural Images of Radiation Fields
CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object DetectionCAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object DetectionSASIC:
Stereo Image Compression with Latent Shifts and Stereo AttentionSASIC: Stereo with Latent Shifts and Stereo Attention Image CompressionControllable
Animation of Fluid Elements in Still ImagesRevisiting
BatchNorm's Learnable Affines in Few-Shot Transfer LearningRevisiting BatchNorm's Learnable Affines in Few-Shot Transfer
LearningLearning Graph Regularization for Guided Super -Resolution-guided super-resolution learning graph regularization
Topology Preserving Local Road Network Estimation from Single Onboard Camera Image
Video-Text Representation Learning via Differentiable Weak Temporal Alignment through weak time alignment Learning Video-Text Representation
BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial LearningBppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning
face Data deviation for expression recognition
Leveraging Equivariant Features for Absolute Pose RegressionUsing equivariant features
for absolute pose regressionSelf-Supervised Transformers for Unsupervised Object Discovery using Normalized CutMulti
-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View GeometryZZ
-Net: A Universal Rotation Equivariant Architecture for 2D Point CloudsZZ-Net: Universal rotation of two-dimensional point clouds by fusing single-view depth probability and multi-view geometry for multi-view depth estimation equivariant architecture
Interactive Disentanglement: Learning Concepts by Interacting with their Prototype Representations交互式解开:通过与原型表示交互来学习概念
Incremental Learning in Semantic Segmentation from Image Labels从图像标签进行语义分割的增量学习
Complex Backdoor Detection by Symmetric Feature Differencing基于对称特征差分的复杂后门检测
Constrained Few-shot Class-incremental Learning约束少样本类增量学习
HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNetHyperSegNAS:使用 HyperNet 将 One-Shot 神经架构搜索与 3D 医学图像分割相结合
Amodal Panoptic SegmentationAmodal全景分割
Not Just Selection, but Exploration: Online Class-Incremental Continual Learning via Dual View Consistency不仅仅是选择,而是探索:通过双视图一致性的在线课堂增量持续学习
Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via DiscretisationCoarse-to-Fine Q-attention:通过离散化实现视觉机器人操作的高效学习
Learning ABCs: Approximate Bijective Correspondence for isolating factors of variation学习 ABC:用于隔离变异因素的近似双射对应
Pin the Memory: Learning to Generalize Semantic Segmentation固定记忆:学习泛化语义分割
Long-tailed Visual Recognition via Gaussian Clouded Logit Adjustment基于高斯云 Logit 调整的长尾视觉识别
Knowledge distillation: A good teacher is patient and consistent知识升华:好老师有耐心、始终如一
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language具有跨模态注意力和语言的视听广义零样本学习
Searching the Deployable Convolution Neural Networks for GPUs在可部署的卷积神经网络中搜索 GPU
MLP-3D: A MLP-like 3D Architecture with Grouped Time MixingMLP-3D:
Condensing CNNs with Partial Differential Equations Condensing CNNs with Partial Differential
Equations Adaptive Early Learning Correction for Segmentation in Noise Annotations
Bounded Adversarial Attack on Deep Content Features
Towards Driving-Oriented Metric for Lane Detection Models Towards Driving-Oriented Metric for Lane Detection Models
Give Me Your Attention: Dot -Product Attention Considered Harmful for Adversarial Patch RobustnessBetter
Trigger Inversion Optimization in Backdoor ScanningLeveling
Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers Computer Vision Level Drop: Pareto Inefficiencies in Fair Deep Classifiers

Towards Understanding and Simplifying MoCo: Dual Temperature Helps Contrastive Learning without Many Negative Samples Technique Smooth Max Unit: Deep Network Smooth Activation Function Using Smooth Max Technique
Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer Text-to-Image Synthesis Based on Object-Guided Joint Decoding Transformer
Image Segmentation Using Text and Image Prompts Text and image prompts for image segmentationUncertainty
-Aware Adaptation for Self-Supervised 3D Human Pose EstimationUncertainty-Aware Adaptation for Self-Supervised 3D Human Pose EstimationVision
-Language Pre-Training with Triple Contrastive
LearningTemporal Context Matters: Enhancing Single Image Prediction with Disease Progression Representations
Globetrotter: Connecting Languages ​​by Connecting ImagesGlobetrotter: Connecting Languages ​​by Connecting Images
Single-Stage 3D Geometry-Preserving Depth Estimation Model Training on Dataset Mixtures with Uncalibrated Stereo Data
It's Time for Artistic Correspondence in Music and Video
Equivariant Point Set Analysis via Learning Orientations for Message Passing Equivariant Point Set Analysis via Learning Orientations for Message PassingKeyTr
: Keypoint Transporter for 3D Reconstruction of Deformable Objects in VideosKeyTr: Keypoint Transmitter for Deformable Objects in 3D Reconstructed VideosP3IV
: Probabilistic Procedure Planning from Instructional Videos with Weak SupervisionP3IV: Probabilistic Procedure Planning from Instructional Videos with Weak SupervisionGlideNet
: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionGlideNet: Global, Local and Intrinsic Based Dense Embedding Network for Multi-category Attributes Prediction
MatchFAME: Fast, Accurate and Memory-Efficient Multi-Object MatchingMatchFAME: Fast, Accurate and Memory-Efficient Multi-Object MatchingNeural
Emotion Director: Speech-preserving semantic control of facial expressions in “in-the-wild” videosNeural Emotion Director: “in Speech-preserving semantic control of facial expressions in -the-wild” videos
Id-Free Person Similarity Learning
Alleviating Emotional bias in Affective Image Captioning by Contrastive Data Collection Alleviating Emotional Bias in Affective Image Captioning by Contrastive Data Collection
A study on the distribution of social biases in self-supervised learning visual models
Motron: Multimodal Probabilistic Human Motion ForecastingMotron: Multimodal Probabilistic Human Motion Forecasting
Gaussian Process Modeling of Approximate Inference Errors for Variational Autoencoders Gaussian Process Modeling of Approximate Inference Errors in Variational Autoencoders
Real-time hyperspectral imaging in hardware via trained metasurface encoders SmartPortraits
: Depth Powered Handheld Smartphone Dataset of Human Portraits for State Estimation, Reconstruction and Synthesis SmartPortraits: For State Estimation , Reconstruction and Synthesis of Portrait Depth Powered Handheld Smartphone Dataset
Improving Segmentation of the Inferior Alveolar Nerve through Deep Label Propagation Improved Segmentation of the Lower Alveolar Nerve Through Deep Label PropagationSLIC
: Self-Supervised Learning with Iterative Clustering for Human Action VideosSLIC: Has Self-supervised Learning of Human Action Videos for Iterative ClusteringSelf
-supervised Spatial Reasoning on Multi-View Line DrawingsContrastive
Test-Time AdaptationContrastive Test-Time AdaptationWhy
Discard if You can Recycle?:A Recycling Why should Max Pooling Module for 3D Point Cloud Analysis be discarded if it can be recycled? : Recycled Max Pooling Module for 3D Point Cloud Analysis
Do learned representations respect causal relationships?学习表示尊重因果关系吗?
Zero-Query Transfer Attacks on Context-Aware Object Detectors对上下文感知对象检测器的零查询传输攻击
Training Quantised Neural Networks with STE Variants: the Additive Noise Annealing Algorithm使用 STE 变体训练量化神经网络:加性噪声退火算法
Contrastive Dual Gating: Learning Sparse Features With Contrastive Learning对比双门控:通过对比学习学习稀疏特征
Efficient Maximal Coding Rate Reduction by Variational Forms变分形式的有效最大编码率降低
Everything at Once - Multi-modal Fusion Transformer for Video Retrieval一切尽在 - 用于视频检索的多模态融合转换器
Towards Efficient and Scalable Sharpness-Aware Minimization迈向高效和可扩展的锐度感知最小化
X-Pool: Cross-Modal Language-Video Attention for Text-Video RetrievalX-Pool:用于文本视频检索的跨模态语言视频注意
Merry
Go Round: Rotate a Frame and Fool a DNN
Style structure Separate features and normalize processes for different icon coloring
How Much More Data Do I Need? Estimating Requirements For Downstream Tasks How Much More Data Do I Need? Estimating Requirements For Downstream Tasks How Much More Data Do I Need? Estimating the needs of downstream tasksA
sampling-based approach for efficient clustering in large datasetsDeep
Equilibrium Optical Flow EstimationDeep Equilibrium Optical Flow EstimationPolarity
Sampling: Quality and Diversity Control of Pre -Trained Generative Networks via Singular Values ​​Polarity Sampling:
Multi-label Iterated Learning for Image Classification with Label Ambiguity Multi-label Iterative Learning for Image Classification with Label Ambiguity
Cross-modal Map Learning for Vision and Language Navigation Cross-modal Map Learning for Visual and Language Navigation
Learning with Neighbor Consistency for Noisy Labels
Measuring Compositional Consistency for Video Question Answering Measuring Compositional Consistency for Video Question Answering
Failure Modes of Domain Generalization Algorithms Domain Generalization Algorithms Failure Mode
AutoRF: Learning 3D Object Radiance Fields from Single View ObservationsAutoRF: Learning 3D Object Radiance Fields from Single View Observations
A Unified Model for Line Projections in Catadioptric Cameras Model
OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks Improving Visual Grounding with Visual-Linguistic Verification
and Iterative Reasoning Base
Cluster-guided Image Synthesis with Unconditional Models
Self-supervised object detection from audio-visual correspondence
Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers Are Super-Hyperbolic Classifiers Song Classifier
Local Learning Matters: Rethinking Data Heterogeneity in Federated Learning Local Learning Matters: Rethinking Data Heterogeneity in Federated
Learning Weak supervision generation and foundation
How much does input data type impact final face model accuracy? How much does the input data type affect the accuracy of the final face model?
Certified Patch Robustness via Smoothed Vision Transformers
PubTables-1M: Towards comprehensive table extraction from unstructured documentsPubTables-1M: Towards comprehensive table extraction from unstructured documentsPubTables-1M: Comprehensive table extraction from unstructured documents
Fine-tuning Image Transformers using Learnable Memory
GuideFormer: Transformers for Image Guided Depth CompletionGuideFormer: Transformers
Motion-Adjustable Neural Implicit Video Representation for Image Guided Depth Completion Motion-Adjustable Neural Implicit Video Representation
LiDARCap : Long-range Marker-less 3D Human Motion Capture with LiDAR Point CloudsLiDARCap: Use LiDAR point cloud for long-range markerless 3D human motion capture Multi-modal Alignment using Representation Codebook
Use the multi-modal alignment of the codebook
NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External KnowledgeNOC-REK:
Investigating Top- kk with Novel Object Captions for Retrieving Vocabulary from External Knowledgek White-Box and Transferable Black-box Attack调查 Top- k k kWhite box and transferable black box attack
GPU-Based Homotopy Continuation for Minimal Problems in Computer Vision
On the Instability of Relative Pose Estimation and RANSAC's Role On the Instability of Relative Pose Estimation And the role of RANSACDual
Task Learning by Leveraging Both Dense Correspondence and Mis-Correspondence for Robust Change Detection With Imperfect MatchesM3L: Language-
based Video Editing via Multi-Modal Multi-Level TransformersM3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers Dynamic
Scene Graph Generation via Anticipatory Pre-training ScanQA
: 3D Question Answering for Spatial Scene UnderstandingScanQA : 3D Q&A for Spatial Scene Understanding
PixMix: Dreamlike Pictures Comprehensively Improve Safety MeasuresPixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
Large Images as Long Documents: Hierarchical ViTs with Self-Supervised Pretraining in Gigapixel Image Pyramids大图像作为长文档:在千兆像素图像金字塔中具有自我监督预训练的分层 ViT
Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection人类对象交互检测中变形金刚解码路径增强的一致性学习
On Guiding Visual Attention with Language Specification用语言规范引导视觉注意
OnePose: One-Shot Object Pose Estimation without CAD ModelsOnePose:没有 CAD 模型的 One-Shot 对象姿态估计
Thin-Plate Spline Motion Model for Image Animation用于图像动画的薄板样条运动模型
PokeBNN: A Binary Pursuit of Lightweight AccuracyPokeBNN:对轻量级准确性的二元追求
Semi-Supervised Few-shot Learning via Multi-Factor Clustering基于多因素聚类的半监督小样本学习
FashionVLP: Vision Language Transformer for Fashion Retrieval with FeedbackFashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
CLIPstyler: Image Style Transfer with a Single Text ConditionCLIPstyler: Image Style Transfer with a Single Text Condition
Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather ConditionsIthaca365: Datasets and Driving Perception in Repetitive and Challenging Weather ConditionsOut
-of-distribution Generalization with Causal Invariant TransformationsOut-of-distribution Generalization with Causal Invariant TransformationsZero
-Shot Text-Guided Object Generation with Dream FieldsWith Zero-shot text-guided object generation for dream fields
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Distribution and Score Matching
TransGeo: Transformer Is All You Need for Cross-view Image Geo-localizationTransGeo: TransGeo is all you need for cross-view image geolocation
NICGSlowDown: Evaluating the Efficiency Robustness of Neural Image Caption Generation ModelsNICGSlowDown: Evaluating the Efficiency Robustness of Neural Image Caption Generation Models
Deep Unlearning via Randomized Conditionally Independent Hessians Deep Learning via Randomized Conditionally Independent Hessians
Multi-Modal Dynamic Graph Transformer for Visual Grounding Propagation Regularizer
for Semi-supervised Learning with Extremely Scarce Labeled Samples
Discrete Wasserstein Distributional Matching for Quantization in Image Hashing Quantified discrete Wasserstein distribution matching
Robust fine-tuning of zero-shot models Robust fine-tuning of zero-shot models
Probabilistic Representations for Video Contrastive Learning Probabilistic representation of video contrastive learning
Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic ContractionCome-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Random
Shrinkage Alignment for fine-grained object classification
One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
A Framework for Learning Ante-hoc Explainable Models via Concepts A Framework for Explaining ModelsRetrieval
Augmented Classification for Long Tail Visual RecognitionRetrieval Enhanced Classification of Long Tail Visual RecognitionDeep
Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and LocalizationDeep Spectral Methods: Unsupervised Semantic Segmentation and Localization's Amazingly Strong Baseline Learning
Video Representations of Human Motion from Synthetic Data Learning Video Representations of Human Motion from Synthetic Data
Exploiting Pseudo Labels in a Self-Supervised Learning Framework for Improved Monocular Depth Estimation Efficient
Deep Embedded Subspace Clustering Efficient Deep Embedded Subspace Clustering
Local-Adaptive Face Recognition via Graph- based Meta-Clustering and Regularized AdaptationGenDR
: A Generalized Differentiable RendererGenDR: Generalized Differentiable Renderer
Fingerprinting Deep Neural Networks Globally via Universal Adversarial Perturbations via Universal Adversarial Perturbations Fingerprinting Deep Neural Networks on a Global Scale
Learning Multiple Adverse Weather Removal via Two-stage Knowledge Learning and Multi-contrastive Regularization: Toward a Unified Model Learning Multiple Adverse Weather Removal via Two-stage Knowledge Learning and Multi-contrastive Regularization: Mai Unified Model

Guess you like

Origin blog.csdn.net/dovings/article/details/125607045