Paper Digest: ECCV 2020 Highlights
Readers are also encouraged to read our ECCV 2020 Papers with Code/Data Page, which lists those papers that have published their code or data.
The European Conference on Computer Vision (ECCV) is one of the top computer vision conferences in the world. In 2020, it is to be held virtually due to covid-19 pandemic.
To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.
Paper Digest Team
team@paperdigest.org
TABLE 1: Paper Digest: ECCV 2020 Highlights
Title | Authors | Highlight | |
---|---|---|---|
1 | Quaternion Equivariant Capsule Networks for 3D Point Clouds | Yongheng Zhao; Tolga Birdal; Jan Eric Lenssen; Emanuele Menegatti; Leonidas Guibas; Federico Tombari; | We present a 3D capsule module for processing point clouds that is equivariant to 3D rotations and translations, as well as invariant to permutations of the input points. |
2 | DeepFit: 3D Surface Fitting via Neural Network Weighted Least Squares | Yizhak Ben-Shabat; Stephen Gould; | We propose a surface fitting method for unstructured 3D point clouds. |
3 | NSGANetV2: Evolutionary Multi-Objective Surrogate-Assisted Neural Architecture Search | Zhichao Lu; Kalyanmoy Deb; Erik Goodman; Wolfgang Banzhaf; Vishnu Naresh Boddeti; | In this paper, we propose an efficient NAS algorithm for generating task-specific models that are competitive under multiple competing objectives. |
4 | Describing Textures using Natural Language | Chenyun Wu; Mikayla Timm; Subhransu Maji; | In this paper, we study the problem of describing visual attributes of texture on a novel dataset containing rich descriptions of textures, and conduct a systematic study of current generative and discriminative models for grounding language to images on this dataset. |
5 | Empowering Relational Network by Self-Attention Augmented Conditional Random Fields for Group Activity Recognition | Rizard Renanda Adhi Pramono; Yie Tarng Chen; Wen Hsien Fang; | This paper presents a novel relational network for group activity recognition. |
6 | AiR: Attention with Reasoning Capability | Shi Chen; Ming Jiang; Jinhui Yang; Qi Zhao; | In this work, we propose an Attention with Reasoning capability (AiR) framework that uses attention to understand and improve the process leading to task outcomes. |
7 | Self6D: Self-Supervised Monocular 6D Object Pose Estimation | Gu Wang; Fabian Manhardt; Jianzhun Shao; Xiangyang Ji; Nassir Navab ; Federico Tombari; | To overcome this shortcoming, we propose the idea of monocular 6D pose estimation by means of self-supervised learning, removing the need for real annotations. |
8 | Invertible Image Rescaling | Mingqing Xiao; Shuxin Zheng; Chang Liu; Yaolong Wang; Di He; Guolin Ke; Jiang Bian; Zhouchen Lin; Tie-Yan Liu; | In this work, we propose to solve this problem by modeling the downscaling and upscaling processes from a new perspective, i.e. an invertible bijective transformation, which can largely mitigate the ill-posed nature of image upscaling. |
9 | Synthesize then Compare: Detecting Failures and Anomalies for Semantic Segmentation | Yingda Xia; Yi Zhang; Fengze Liu; Wei Shen; Alan L. Yuille; | In this paper, we systematically study failure and anomaly detection for semantic segmentation and propose a unified framework, consisting of two modules, to address these two related problems. |
10 | House-GAN: Relational Generative Adversarial Networks for Graph-constrained House Layout Generation | Nelson Nauata; Kai-Hung Chang; Chin-Yi Cheng; Greg Mori; Yasutaka Furukawa; | This paper proposes a novel graph-constrained generative adversarial network, whose generator and discriminator are built upon relational architecture. |
11 | Crowdsampling the Plenoptic Function | Zhengqi Li; Wenqi Xian; Abe Davis; Noah Snavely; | In this paper,we present a new approach to novel view synthesis under time-varying illumination from such data. |
12 | VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment | Hanyue Tu; Chunyu Wang; Wenjun Zeng; | We present mph{VoxelPose} to estimate $3$D poses of multiple people from multiple camera views. |
13 | End-to-End Object Detection with Transformers | Nicolas Carion; Francisco Massa; Gabriel Synnaeve; Nicolas Usunier; Alexander Kirillov; Sergey Zagoruyko; | We present a new method that views object detection as a direct set prediction. |
14 | DeepSFM: Structure From Motion Via Deep Bundle Adjustment | Xingkui Wei; Yinda Zhang; Zhuwen Li; Yanwei Fu; Xiangyang Xue; | In this work, we design a physical driven architecture, namely DeepSFM, inspired by traditional Bundle Adjustment (BA), which consists of two cost volume based architectures for depth and pose estimation respectively, iteratively running to improve both. |
15 | Ladybird: Quasi-Monte Carlo Sampling for Deep Implicit Field Based 3D Reconstruction with Symmetry | Yifan Xu; Tianqi Fan; Yi Yuan; Gurprit Singh; | Based on Farthest Point Sampling algorithm, we propose a sampling scheme that theoretically encourages better generalization performance, and results in fast convergence for SGD-based optimization algorithms. |
16 | Segment as Points for Efficient Online Multi-Object Tracking and Segmentation | Zhenbo Xu; Wei Zhang; Xiao Tan; Wei Yang; Huan Huang; Shilei Wen; Errui Ding; Liusheng Huang; | In this paper, we propose a highly effective method for learning instance embeddings based on segments by converting the compact image representation to un-ordered 2D point cloud representation. |
17 | Conditional Convolutions for Instance Segmentation | Zhi Tian; Chunhua Shen; Hao Chen; | We propose a simple yet effective instance segmentation framework, termed CondInst (conditional convolutions for instance segmentation). |
18 | MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution | Taojiannan Yang; Sijie Zhu; Chen Chen; Shen Yan; Mi Zhang; Andrew Willis; | We propose the width-resolution mutual learning method (MutualNet) to train a network that is executable at dynamic resource constraints to achieve adaptive accuracy-efficiency trade-offs at runtime. |
19 | Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset | Menglin Jia; Mengyun Shi; Mikhail Sirotenko; Yin Cui; Claire Cardie ; Bharath Hariharan; Hartwig Adam; Serge Belongie; | In order to solve this challenging task, we propose a novel Attribute-Mask R-CNN model to jointly perform instance segmentation and localized attribute recognition, and provide a novel evaluation metric for the task. |
20 | Privacy Preserving Structure-from-Motion | Marcel Geppert; Viktor Larsson; Pablo Speciale; Johannes L. Schönberger; Marc Pollefeys; | In this paper, we further build upon this idea and propose solutions to the different core algorithms of an incremental Structure-from-Motion pipeline based on random line features. |
21 | Rewriting a Deep Generative Model | David Bau; Steven Liu; Tongzhou Wang; Jun-Yan Zhu; Antonio Torralba; | In this paper, we introduce a new problem setting: manipulation of specific rules encoded by a deep generative model. |
22 | Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets | Jiuniu Wang; Wenjia Xu; Qingzhong Wang; Antoni B. Chan; | In this paper, we aim to improve the distinctiveness of image captions through training with sets of similar images. |
23 | Long-term Human Motion Prediction with Scene Context | Zhe Cao; Hang Gao; Karttikeya Mangalam; Qi-Zhi Cai; Minh Vo; Jitendra Malik; | In this work, we propose a novel three-stage framework that exploits scene context to tackle this task. |
24 | NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis | Ben Mildenhall; Pratul P. Srinivasan; Matthew Tancik; Jonathan T. Barron; Ravi Ramamoorthi; Ren Ng; | We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. |
25 | ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes | Panos Achlioptas; Ahmed Abdelreheem; Fei Xia; Mohamed Elhoseiny; Leonidas Guibas; | In this work we study the problem of using referential language to identify common objects in real-world 3D scenes. |
26 | MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images | Benjamin Attal; Selena Ling; Aaron Gokaslan; Christian Richardt; James Tompkin; | We introduce a method to convert stereo 360 (omnidirectional stereo) imagery into a layered, multi-sphere image representation for six degree-of-freedom (6DoF) rendering. |
27 | Learning and Aggregating Deep Local Descriptors for Instance-level Recognition | Giorgos Tolias; Tomas Jenicek; Ond?ej Chum; | We propose an efficient method to learn deep local descriptors for instance-level recognition. |
28 | A Consistently Fast and Globally Optimal Solution to the Perspective-n-Point Problem | George Terzakis; Manolis Lourakis; | An approach for estimating the pose of a camera given a set of 3D points and their corresponding 2D image projections is presented. |
29 | Learn to Recover Visible Color for Video Surveillance in a Day | Guangming Wu; Yinqiang Zheng; Zhiling Guo; Zekun Cai; Xiaodan Shi; Xin Ding; Yifei Huang; Yimin Guo; Ryosuke Shibasaki; | In this paper, we present a deep learning based approach that directly generates human-friendly, visible color for video surveillance in a day. |
30 | Deep Fashion3D: A Dataset and Benchmark for 3D Garment Reconstruction from Single Images | Heming Zhu; Yu Cao; Hang Jin; Weikai Chen; Dong Du; Zhangye Wang; Shuguang Cui; Xiaoguang Han; | We propose to fill this gap by introducing DeepFashion3D, the largest collection to date of 3D garment models, with the goal of establishing a novel benchmark and dataset for the evaluation of image-based garment reconstruction systems. |
31 | Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation | Zhenda Xie; Zheng Zhang; Xizhou Zhu; Gao Huang; Stephen Lin; | Towards reducing this superfluous computation, we propose to compute features only at sparsely sampled locations, which are probabilistically chosen according to activation responses, and then densely reconstruct the feature map with an efficient interpolation procedure. |
32 | BorderDet: Border Feature for Dense Object Detection | Han Qiu; Yuchen Ma; Zeming Li; Songtao Liu; Jian Sun; | In this paper, We propose a simple and efficient operator called Border-Align to extract “border features” from the extreme point of the border to enhance the point feature. |
33 | Regularization with Latent Space Virtual Adversarial Training | Genki Osada; Budrul Ahsan; Revoti Prasad Bora; Takashi Nishide; | To address this problem we propose LVAT (Latent space VAT), which injects perturbation in the latent space instead of the input space. |
34 | Du²Net: Learning Depth Estimation from Dual-Cameras and Dual-Pixels | Yinda Zhang; Neal Wadhwa; Sergio Orts-Escolano; Christian Häne; Sean Fanello; Rahul Garg; | We present a novel approach based on neural networks for depth estimation that combines stereo from dual cameras with stereo from a dual-pixel sensor, which is increasingly common on consumer cameras. |
35 | Model-Agnostic Boundary-Adversarial Sampling for Test-Time Generalization in Few-Shot learning | Jaekyeom Kim; Hyoungseok Kim; Gunhee Kim; | We propose a model-agnostic method that improves the test-time performance of any few-shot learning models with no additional training, and thus is free from the training-test domain gap. |
36 | Targeted Attack for Deep Hashing based Retrieval | Jiawang Bai; Bin Chen; Yiming Li; Dongxian Wu; Weiwei Guo; Shu-Tao Xia; En-Hui Yang; | In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval. |
37 | Gradient Centralization: A New Optimization Technique for Deep Neural Networks | Hongwei Yong; Jianqiang Huang; Xiansheng Hua; Lei Zhang; | Different from those previous methods that mostly operate on activations or weights, we present a new optimization technique, namely gradient centralization (GC), which operates directly on gradients by centralizing the gradient vectors to have zero mean. |
38 | Content-Aware Unsupervised Deep Homography Estimation | Jirong Zhang; Chuan Wang; Shuaicheng Liu; Lanpeng Jia; Nianjin Ye; Jue Wang; Ji Zhou; Jian Sun; | To overcome these problems, in this work we propose an unsupervised deep homography method with a new architecture design. |
39 | Multi-View Optimization of Local Feature Geometry | Mihai Dusmanu; Johannes L. Schönberger; Marc Pollefeys; | In this work, we address the problem of refining the geometry of local image features from multiple views without known scene or camera geometry. |
40 | The Phong Surface: Efficient 3D Model Fitting using Lifted Optimization | Jingjing Shen; Thomas J. Cashman; Qi Ye; Tim Hutton; Toby Sharp; Federica Bogo; Andrew Fitzgibbon; Jamie Shotton; | To solve model-fitting problems for HoloLens 2 hand tracking, where the computational budget is approximately 100 times smaller than an iPhone 7, we introduce a new surface model: the `Phong surface’. |
41 | Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video | Miao Liu; Siyu Tang; Yin Li; James M. Rehg; | Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. |
42 | Learning Stereo from Single Images | Jamie Watson; Oisin Mac Aodha; Daniyar Turmukhambetov; Gabriel J. Brostow; Michael Firman; | We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. |
43 | Prototype Rectification for Few-Shot Learning | Jinlu Liu; Liang Song; Yongqiang Qin; | In this paper, we figure out two key influencing factors of the process: the intra-class bias and the cross-class bias. We then propose a simple yet effective approach for prototype rectification in transductive setting. |
44 | Learning Feature Descriptors using Camera Pose Supervision | Qianqian Wang; Xiaowei Zhou; Bharath Hariharan; Noah Snavely; | In this paper we propose a novel weakly-supervised framework that can learn feature descriptors solely from relative camera poses between images. |
45 | Semantic Flow for Fast and Accurate Scene Parsing | Xiangtai Li; Ansheng You; Zhen Zhu; Houlong Zhao; Maoke Yang; Kuiyuan Yang; Shaohua Tan; Yunhai Tong; | In this paper, we focus on designing effective method for fast and accurate scene parsing. |
46 | Appearance Consensus Driven Self-Supervised Human Mesh Recovery | Jogendra Nath Kundu; Mugalodi Rakesh; Varun Jampani; Rahul Mysore Venkatesh; R. Venkatesh Babu; | We present a self-supervised human mesh recovery framework to infer human pose and shape from monocular images in the absence of any paired supervision. |
47 | Diffraction Line Imaging | Mark Sheinin; Dinesh N. Reddy; Matthew O’Toole; Srinivasa G. Narasimhan; | We present a novel computational imaging principle that combines diffractive optics with line (1D) sensing. |
48 | Aligning and Projecting Images to Class-conditional Generative Networks | Minyoung Huh; Richard Zhang; Jun-Yan Zhu; Sylvain Paris; Aaron Hertzmann; | We present a method for projecting an input image into the space of a class-conditional generative neural network. |
49 | Suppress and Balance: A Simple Gated Network for Salient Object Detection | Xiaoqi Zhao; Youwei Pang; Lihe Zhang; Huchuan Lu; Lei Zhang; | In this work, we propose a simple gated network (GateNet) to solve both issues at once. |
50 | Visual Memorability for Robotic Interestingness via Unsupervised Online Learning | Chen Wang; Wenshan Wang; Yuheng Qiu; Yafei Hu; Sebastian Scherer; | In this paper, we explore the problem of interesting scene prediction for mobile robots. |
51 | Post-Training Piecewise Linear Quantization for Deep Neural Networks | Jun Fang; Ali Shafiee; Hamzah Abdel-Aziz; David Thorsley; Georgios Georgiadis; Joseph H. Hassoun; | In this paper, we propose a PieceWise Linear Quantization (PWLQ) scheme to enable accurate approximation for tensor values that have bell-shaped distributions with long tails. |
52 | Joint Disentangling and Adaptation for Cross-Domain Person Re-Identification | Yang Zou; Xiaodong Yang; Zhiding Yu; B.V.K. Vijaya Kumar; Jan Kautz; | In this paper, we seek to improve adaptation by purifying the representation space to be adapted. |
53 | In-Home Daily-Life Captioning Using Radio Signals | Lijie Fan; Tianhong Li; Yuan Yuan; Dina Katabi; | We introduce RF-Diary, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home’s floormap. |
54 | Self-Challenging Improves Cross-Domain Generalization | Zeyi Huang; Haohan Wang; Eric P. Xing; Dong Huang; | We introduce a simple training heuristic, Representation Self-Challenging (RSC), that significantly improves the generalization of CNN to the out-of-domain data. |
55 | A Competence-aware Curriculum for Visual Concepts Learning via Question Answering | Qing Li; Siyuan Huang; Yining Hong; Song-Chun Zhu; | To mimic this efficient learning ability, we propose a competence-aware curriculum for visual concept learning in a question-answering manner. |
56 | Multitask Learning Strengthens Adversarial Robustness | Chengzhi Mao; Amogh Gupta; Vikram Nitin; Baishakhi Ray; Shuran Song ; Junfeng Yang; Carl Vondrick; | We present both theoretical and empirical analyses that connect the adversarial robustness of a model to the number of tasks that it is trained on. |
57 | S2DNAS: Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search | Zhihang Yuan; Bingzhe Wu; Guangyu Sun; Zheng Liang; Shiwan Zhao; Weichen Bi; | In this paper, we introduce a general framework, S2DNAS, which can transform various static CNN models to support dynamic inference via neural architecture search. |
58 | Improving Deep Video Compression by Resolution-adaptive Flow Coding | Zhihao Hu; Zhenghao Chen; Dong Xu; Guo Lu; Wanli Ouyang; Shuhang Gu; | In this work, we propose a new framework called Resolution-adaptive Flow Coding (RaFC) to effectively compress the flow maps globally and locally, in which we use multi-resolution representations instead of single-resolution representations for both the input flow maps and the output motion features of the MV encoder. |
59 | Motion Capture from Internet Videos | Junting Dong; Qing Shuai; Yuanqing Zhang; Xian Liu; Xiaowei Zhou; Hujun Bao; | To address these challenges, we propose a novel optimization-based framework and experimentally demonstrate its ability to recover much more precise and detailed motion from multiple videos, compared against monocular motion capture methods. |
60 | Appearance-Preserving 3D Convolution for Video-based Person Re-identification | Xinqian Gu; Hong Chang; Bingpeng Ma; Hongkai Zhang; Xilin Chen; | To address this problem, we propose Appearance-Preserving 3D Convolution (AP3D), which is composed of two components: an Appearance-Preserving Module (APM) and a 3D convolution kernel. |
61 | Solving the Blind Perspective-n-Point Problem End-To-End With Robust Differentiable Geometric Optimization | Dylan Campbell; Liu Liu; Stephen Gould; | We instead propose the first fully end-to-end trainable network for solving the blind PnP problem efficiently and globally, that is, without the need for pose priors. |
62 | Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation | Xingang Pan; Xiaohang Zhan; Bo Dai; Dahua Lin; Chen Change Loy; Ping Luo; | This work presents an effective way to exploit the image prior captured by a generative adversarial network (GAN) trained on large-scale natural images. |
63 | Deep Spatial-angular Regularization for Compressive Light Field Reconstruction over Coded Apertures | Mantang Guo; Junhui Hou; Jing Jin; Jie Chen; Lap-Pui Chau; | To tackle this challenge, we propose a novel learning-based framework for the reconstruction of high-quality LFs from acquisitions via learned coded apertures. |
64 | Video-based Remote Physiological Measurement via Cross-verified Feature Disentangling | Xuesong Niu; Zitong Yu; Hu Han; Xiaobai Li; Shiguang Shan; Guoying Zhao; | To address these challenges, we propose a cross-verified feature disentangling strategy to disentangle the physiological features with non-physiological representations such as head movements and lighting conditions, and then use the distilled physiological features for robust multi-task physiological measurements. |
65 | Combining Implicit Function Learning and Parametric Models for 3D Human Reconstruction | Bharat Lal Bhatnagar; Cristian Sminchisescu; Christian Theobalt; Gerard Pons-Moll; | Given sparse 3D point clouds sampled on the surface of a dressed person, we use an Implicit Part Network (IP-Net) to jointly predict the outer 3D surface of the dressed person, the inner body surface, and the semantic correspondences to a parametric body model. |
66 | Orientation-aware Vehicle Re-identification with Semantics-guided Part Attention Network | Tsai-Shien Chen; Chih-Ting Liu; Chih-Wei Wu; Shao-Yi Chien; | In this work, we propose a dedicated Semantics-guided Part Attention Network (SPAN) to robustly predict part attention masks for different views of vehicles given only image-level semantic labels during training. |
67 | Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation | Guolei Sun; Wenguan Wang; Jifeng Dai; Luc Van Gool; | This paper studies the problem of learning semantic segmentation from image-level supervision only. |
68 | CoReNet: Coherent 3D Scene Reconstruction from a Single RGB Image | Stefan Popov; Pablo Bauszat; Vittorio Ferrari; | Building on common encoder-decoder architectures for this task, we propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner (2) a hybrid 3D volume representation that enables building translation equivariant models, while at the same time encoding fine object details without an excessive memory footprint (3) a reconstruction loss tailored to capture overall object geometry. |
69 | Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs | Lei Huang; Jie Qin; Li Liu; Fan Zhu; Ling Shao; | To this end, we propose layer-wise conditioning analysis, which explores the optimization landscape with respect to each layer independently. |
70 | RAFT: Recurrent All-Pairs Field Transforms for Optical Flow | Zachary Teed; Jia Deng; | We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for estimating optical flow. |
71 | Domain-invariant Stereo Matching Networks | Feihu Zhang; Xiaojuan Qi; Ruigang Yang; Victor Prisacariu; Benjamin Wah; Philip Torr; | In this paper, we aim at designing a domain-invariant stereo matching network (DSMNet) that generalizes well to unseen scenes. |
72 | DeepHandMesh: A Weakly-supervised Deep Encoder-Decoder Framework for High-fidelity Hand Mesh Modeling | Gyeongsik Moon; Takaaki Shiratori; Kyoung Mu Lee; | In this study, we firstly propose DeepHandMesh, a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. |
73 | Content Adaptive and Error Propagation Aware Deep Video Compression | Guo Lu; Chunlei Cai; Xiaoyun Zhang; Li Chen; Wanli Ouyang; Dong Xu ; Zhiyong Gao; | To address these two problems, we propose a content adaptive and error propagation aware video compression system. |
74 | Towards Streaming Perception | Mengtian Li; Yu-Xiong Wang; Deva Ramanan; | To these ends, we present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception, which we refer to as ""streaming accuracy"". |
75 | Towards Automated Testing and Robustification by Semantic Adversarial Data Generation | Rakshith Shetty; Mario Fritz; Bernt Schiele; | In this work we propose semantic adversarial editing,a method to synthesize plausible but difficult data points on which our target model breaks down. |
76 | Adversarial Generative Grammars for Human Activity Prediction | AJ Piergiovanni; Anelia Angelova; Alexander Toshev; Michael S. Ryoo; | In this paper we propose an adversarial generative grammar model for future prediction. |
77 | GDumb: A Simple Approach that Questions Our Progress in Continual Learning | Ameya Prabhu; Philip H. S. Torr; Puneet K. Dokania; | To validate this, we propose GDumb that (1) greedily stores samples in memory as they come and (2) at test time, trains a model from scratch using samples only in the memory. |
78 | Learning Lane Graph Representations for Motion Forecasting | Ming Liang; Bin Yang; Rui Hu; Yun Chen; Renjie Liao; Song Feng; Raquel Urtasun; | We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions. |
79 | What Matters in Unsupervised Optical Flow | Rico Jonschkowski; Austin Stone; Jonathan T. Barron; Ariel Gordon; Kurt Konolige; Anelia Angelova; | By combining the results of our investigation with our improved model components, we are able to present a new unsupervised flow technique that significantly outperforms the previous unsupervised state-of-the-art and performs on par with supervised FlowNet2 on the KITTI 2015 dataset, while also being significantly simpler than related approaches. |
80 | Synthesis and Completion of Facades from Satellite Imagery | Xiaowei Zhang; Christopher May; Daniel Aliaga; | We present a machine learning-based inverse procedural modeling method to automatically create synthetic facades from satellite imagery. |
81 | Mapillary Planet-Scale Depth Dataset | Manuel López Antequera; Pau Gargallo; Markus Hofinger; Samuel Rota Bulò Yubin Kuang; Peter Kontschieder; | We introduce a new depth dataset that is an order of magnitude larger than previous datasets, but more importantly, contains an unprecedented gamut of locations, camera models and scene types while offering metric depth (not just up-to-scale). |
82 | V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction | Tsun-Hsuan Wang; Sivabalan Manivasagam; Ming Liang; Bin Yang; Wenyuan Zeng; Raquel Urtasun; | In this paper, we explore the use of vehicle-to-vehicle (V2V) communication to improve the perception and motion forecasting performance of self-driving vehicles. |
83 | Training Interpretable Convolutional Neural Networks by Differentiating Class-specific Filters | Haoyu Liang; Zhihao Ouyang; Yuyuan Zeng; Hang Su; Zihao He; Shu-Tao Xia; Jun Zhu; Bo Zhang; | Inspired by cellular differentiation, we propose a novel strategy to train interpretable CNNs by encouraging class-specific filters, among which each filter responds to only one (or few) class. |
84 | EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning | Bailin Li; Bowen Wu; Jiang Su; Guangrun Wang; | In this work, we present a pruning method called EagleEye, in which a simple yet efficient evaluation component based on adaptive batch normalization is applied to unveil a strong correlation between different pruned DNN structures and their final settled accuracy. |
85 | Intrinsic Point Cloud Interpolation via Dual Latent Space Navigation | Marie-Julie Rakotosaona; Maks Ovsjanikov; | We present a learning-based method for interpolating and manipulating 3D shapes represented as point clouds, that is explicitly designed to preserve intrinsic shape properties. |
86 | Cross-Domain Cascaded Deep Translation | Oren Katzir; Dani Lischinski; Daniel Cohen-Or; | We mitigate this by descending the deep layers of a pre-trained network, where the deep features contain more semantics, and applying the translation between these deep feature. |
87 | “Look Ma, no landmarks!” – Unsupervised, Model-based Dense Face Alignment | Tatsuro Koizumi; William A. P. Smith; | In this paper, we show how to train an image-to-image network to predict dense correspondence between a face image and a 3D morphable model using only the model for supervision. |
88 | Online Invariance Selection for Local Feature Descriptors | Rémi Pautrat; Viktor Larsson; Martin R. Oswald; Marc Pollefeys; | We propose to overcome this limitation with a disentanglement of invariance in local descriptors and with an online selection of the most appropriate invariance given the context. |
89 | Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations | Hongyu Liu; Bin Jiang; Yibing Song; Wei Huang; Chao Yang; | In this paper, we propose a mutual encoder-decoder CNN for joint recovery of both. |
90 | TextCaps: a Dataset for Image Captioning with Reading Comprehension | Oleksii Sidorov; Ronghang Hu; Marcus Rohrbach; Amanpreet Singh; | To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. |
91 | It is not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction | Karttikeya Mangalam; Harshayu Girase; Shreyas Agarwal; Kuan-Hui Lee; Ehsan Adeli; Jitendra Malik; Adrien Gaidon; | In this work, we present Predicted Endpoint Conditioned Network (PECNet) for flexible human trajectory prediction. |
92 | Learning What to Learn for Video Object Segmentation | Goutam Bhat; Felix Järemo Lawin; Martin Danelljan; Andreas Robinson; Michael Felsberg; Luc Van Gool; Radu Timofte; | We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learner. |
93 | SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing | Garvita Tiwari; Bharat Lal Bhatnagar; Tony Tung; Gerard Pons-Moll; | In this paper, we introduce SizerNet to predict 3D clothing conditionedon human body shape and garment size parameters, and ParserNet toinfer garment meshes and shape under clothing with personal details in asingle pass from an input mesh. |
94 | LIMP: Learning Latent Shape Representations with Metric Preservation Priors | Luca Cosmo; Antonio Norelli; Oshri Halimi; Ron Kimmel; Emanuele Rodolà | In this paper, we advocate the adoption of metric preservation as a powerful prior for learning latent representations of deformable 3D shapes. |
95 | Unsupervised Sketch to Photo Synthesis | Runtao Liu; Qian Yu; Stella X. Yu; | We study unsupervised sketch to photo synthesis for the first time, learning from unpaired sketch and photo data where the target photo for a sketch is unknown during training. |
96 | A Simple Way to Make Neural Networks Robust Against Diverse Image Corruptions | Evgenia Rusak; Lukas Schott; Roland S. Zimmermann; Julian Bitterwolf ; Oliver Bringmann; Matthias Bethge; Wieland Brendel; | Here, we demonstrate that a simple but properly tuned training with additive Gaussian and Speckle noise generalizes surprisingly well to unseen corruptions, easily reaching the state of the art on the corruption benchmark ImageNet-C (with ResNet50) and on MNIST-C. |
97 | SoftPoolNet: Shape Descriptor for Point Cloud Completion and Classification | Yida Wang; David Joseph Tan; Nassir Navab; Federico Tombari; | In this paper, we propose a method for 3D object completion and classification based on point clouds. |
98 | Hierarchical Face Aging through Disentangled Latent Characteristics | Peipei Li; Huaibo Huang; Yibo Hu; Xiang Wu; Ran He; Zhenan Sun; | To explore the age effects on facial images, we propose a Disentangled Adversarial Autoencoder (DAAE) to disentangle the facial images into three independent factors: age, identity and extraneous information. |
99 | Hybrid Models for Open Set Recognition | Hongjie Zhang; Ang Li; Jie Guo; Yanwen Guo; | We propose the OpenHybrid framework, which is composed of an encoder to encode the input data into a joint embedding space, a classifier to classify samples to inlier classes, and a flow-based density estimator to detect whether a sample belongs to the unknown category. |
100 | TopoGAN: A Topology-Aware Generative Adversarial Network | Fan Wang; Huidong Liu; Dimitris Samaras; Chao Chen; | In this paper, we propose a novel GAN model that learns the topology of real images, i.e., connectedness and loopy-ness. |
101 | Learning to Localize Actions from Moments | Fuchen Long; Ting Yao; Zhaofan Qiu; Xinmei Tian; Jiebo Luo; Tao Mei; | In this paper, we introduce a new design of transfer learning type to learn action localization for a large set of action categories, but only on action moments from the categories of interest and temporal annotations of untrimmed videos from a small set of action classes. |
102 | ForkGAN: Seeing into the Rainy Night | Ziqiang Zheng; Yang Wu; Xinran Han; Jianbo Shi; | We present a ForkGAN for task-agnostic image translation that can boost multiple vision tasks in adverse weather conditions. |
103 | TCGM: An Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning | Xinwei Sun; Yilun Xu; Peng Cao; Yuqing Kong; Lingjing Hu; Shanghang Zhang; Yizhou Wang; | In this paper, we propose a novel information-theoretic approach \– namely, extbf{T}otal extbf{C}orrelation extbf{G}ain extbf{M}aximization (TCGM) \— for semi-supervised multi-modal learning, which is endowed with promising properties: (i) it can utilize effectively the information across different modalities of unlabeled data points to facilitate training classifiers of each modality (ii) has theoretical guarantee to have theoretical guarantee to identify Bayesian classifiers, i.e., the ground truth posteriors of all modalities. |
104 | ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval | Quan Cui; Qing-Yuan Jiang; Xiu-Shen Wei; Wu-Jun Li; Osamu Yoshie; | In this paper, we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images, leveraging the search and storage efficiency of hash learning to alleviate the aforementioned problems. |
105 | TSIT: A Simple and Versatile Framework for Image-to-Image Translation | Liming Jiang; Changxu Zhang; Mingyang Huang; Chunxiao Liu; Jianping Shi; Chen Change Loy; | We introduce a simple and versatile framework for image-to-image translation. |
106 | ProxyBNN: Learning Binarized Neural Networks via Proxy Matrices | Xiangyu He; Zitao Mo; Ke Cheng; Weixiang Xu; Qinghao Hu; Peisong Wang; Qingshan Liu; Jian Cheng; | In this paper, by introducing an appropriate proxy matrix, we reduce the weights quantization error while circumventing explicit binary regularizations on the full-precision auxiliary variables. |
107 | HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular Multi-Person 3D Pose Estimation | Can Wang; Jiefeng Li; Wentao Liu; Chen Qian; Cewu Lu; | In this paper, we attempt to address the lack of a global perspective of the top-down approaches by introducing a novel form of supervision – Hierarchical Multi-person Ordinal Relations (HMOR). |
108 | Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve | Weicheng Kuo; Anelia Angelova; Tsung-Yi Lin; Angela Dai; | We present Mask2CAD, which jointly detects objects in real-world images and for each detected object, optimizes for the most similar CAD model and its pose. |
109 | A Unified Framework of Surrogate Loss by Refactoring and Interpolation | Lanlan Liu; Mingzhe Wang; Jia Deng; | We introduce UniLoss, a unified framework to generate surrogate losses for training deep networks with gradient descent, reducing the amount of manual design of task-specific surrogate losses. |
110 | Deep Reflectance Volumes: Relightable Reconstructions from Multi-View Photometric Images | Sai Bi; Zexiang Xu; Kalyan Sunkavalli; Miloš Hašan; Yannick Hold-Geoffroy; David Kriegman; Ravi Ramamoorthi; | We present a deep learning approach to reconstruct scene appearance from unstructured images captured under collocated point lighting. |
111 | Memory-augmented Dense Predictive Coding for Video Representation Learning | Tengda Han; Weidi Xie; Andrew Zisserman; | The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. |
112 | PointMixup: Augmentation for Point Clouds | Yunlu Chen; Vincent Tao Hu; Efstratios Gavves; Thomas Mensink; Pascal Mettes; Pengwan Yang; Cees G. M. Snoek; | In this paper, we define data augmentation between point clouds as a shortest path linear interpolation. |
113 | Identity-Guided Human Semantic Parsing for Person Re-Identification | Kuan Zhu; Haiyun Guo; Zhiwei Liu; Ming Tang; Jinqiao Wang; | In this paper, we propose the identity-guided human semantic parsing approach (ISP) to locate both the human body parts and personal belongings at pixel-level for aligned person re-ID only with person identity labels. |
114 | Learning Gradient Fields for Shape Generation | Ruojin Cai; Guandao Yang; Hadar Averbuch-Elor; Zekun Hao; Serge Belongie; Noah Snavely; Bharath Hariharan; | In this work, we propose a novel technique to generate shapes from point cloud data. |
115 | COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder | Kuniaki Saito; Kate Saenko; Ming-Yu Liu; | To address the issue, we propose a new few-shot image translation model, COCO-FUNIT, which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. |
116 | Corner Proposal Network for Anchor-free, Two-stage Object Detection | Kaiwen Duan; Lingxi Xie; Honggang Qi; Song Bai; Qingming Huang; Qi Tian; | This paper proposes a novel anchor-free, two-stage framework which first extracts a number of object proposals by finding potential corner keypoint combinations and then assigns a class label to each proposal by a standalone classification stage. |
117 | PhraseClick: Toward Achieving Flexible Interactive Segmentation by Phrase and Click | Henghui Ding; Scott Cohen; Brian Price; Xudong Jiang; | We propose to employ phrase expressions as another interaction input to infer the attributes of target object. |
118 | Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing | Yapeng Tian; Dingzeyu Li; Chenliang Xu; | In this paper, we introduce a new problem, named audio-visual video parsing, which aims to parse a video into temporal event segments and label them as either audible, visible, or both. |
119 | Learning Delicate Local Representations for Multi-Person Pose Estimation | Yuanhao Cai; Zhicheng Wang; Zhengxiong Luo; Binyi Yin; Angang Du; Haoqian Wang; Xiangyu Zhang; Xinyu Zhou; Erjin Zhou; Jian Sun; | In this paper, we propose a novel method called Residual Steps Network (RSN). |
120 | Learning to Plan with Uncertain Topological Maps | Edward Beeching; Jilles Dibangoye; Olivier Simonin; Christian Wolf; | Our main contribution is a data driven learning based approach for planning under uncertainty in topological maps, requiring an estimate of shortest paths in valued graphs with a probabilistic structure. |
121 | Neural Design Network: Graphic Layout Generation with Constraints | Hsin-Ying Lee; Lu Jiang; Irfan Essa; Phuong B Le; Haifeng Gong; Ming-Hsuan Yang; Weilong Yang; | We propose a method for design layout generation that can satisfy user-specified constraints. |
122 | Learning Open Set Network with Discriminative Reciprocal Points | Guangyao Chen; Limeng Qiao; Yemin Shi; Peixi Peng; Jia Li; Tiejun Huang; Shiliang Pu; Yonghong Tian; | In this paper, we propose a new concept, Reciprocal Point, which is the potential representation of the extra-class space corresponding to each known category. |
123 | Convolutional Occupancy Networks | Songyou Peng; Michael Niemeyer; Lars Mescheder; Marc Pollefeys; Andreas Geiger; | In this paper, we propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. |
124 | Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View Geometry | He Chen; Pengfei Guo; Pengfei Li; Gim Hee Lee; Gregory Chirikjian; | In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation. |
125 | TIDE: A General Toolbox for Identifying Object Detection Errors | Daniel Bolya; Sean Foley; James Hays; Judy Hoffman; | We introduce TIDE, a framework and associated toolbox for analyzing the sources of error in object detection and instance segmentation algorithms. |
126 | PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding | Saining Xie; Jiatao Gu; Demi Guo; Charles R. Qi; Leonidas Guibas; Or Litany; | In this work, we aim at facilitating research on 3D representation learning. |
127 | DSA: More Efficient Budgeted Pruning via Differentiable Sparsity Allocation | Xuefei Ning; Tianchen Zhao; Wenshuo Li; Peng Lei; Yu Wang; Huazhong Yang; | In this paper, we propose Differentiable Sparsity Allocation (DSA), an efficient end-to-end budgeted pruning flow. |
128 | Circumventing Outliers of AutoAugment with Knowledge Distillation | Longhui Wei; An Xiao; Lingxi Xie; Xiaopeng Zhang; Xin Chen; Qi Tian; | This paper delves deep into the working mechanism, and reveals that AutoAugment may remove part of discriminative information from the training image and so insisting on the ground-truth label is no longer the best option. |
129 | S2DNet: Learning Image Features for Accurate Sparse-to-Dense Matching | Hugo Germain; Guillaume Bourmaud; Vincent Lepetit; | In this paper, we introduce S2DNet, a novel feature matching pipeline, designed and trained to efficiently establish both robust and accurate correspondences. |
130 | RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving | Peixuan Li; Huaici Zhao; Pengfei Liu; Feidao Cao; | In this work, we propose an efficient and accurate monocular 3D detection framework in single shot. |
131 | Video Object Segmentation with Episodic Graph Memory Networks | Xiankai Lu; Wenguan Wang; Martin Danelljan; Tianfei Zhou; Jianbing Shen; Luc Van Gool; | In this work, a graph memory network is developed to address the novel idea of “learning to update the segmentation model”. |
132 | Rethinking Bottleneck Structure for Efficient Mobile Network Design | Daquan Zhou; Qibin Hou; Yunpeng Chen; Jiashi Feng; Shuicheng Yan; | In this paper, we rethink the necessity of such design change and find it may bring risks of information loss and gradient confusion. |
133 | Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks | Jeffrey O. Zhang; Alexander Sax; Amir Zamir; Leonidas Guibas; Jitendra Malik; | The most commonly employed approaches for network adaptation are fine-tuning and using the pre-trained network as a fixed feature extractor, among others. In this paper, we propose a straightforward alternative:side-tuning. |
134 | Towards Part-aware Monocular 3D Human Pose Estimation: An Architecture Search Approach | Zerui Chen; Yan Huang; Hongyuan Yu; Bin Xue; Ke Han; Yiru Guo; Liang Wang; | To accurately estimate 3D poses of different body parts, we attempt to build a part-aware 3D pose estimator by searching a set of network architectures. |
135 | REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets | Angelina Wang; Arvind Narayanan; Olga Russakovsky; | Overall, the key aim of our work is to tackle the machine learning bias problem early in the pipeline. |
136 | Contrastive Learning for Weakly Supervised Phrase Grounding | Tanmay Gupta; Arash Vahdat; Gal Chechik; Xiaodong Yang; Jan Kautz; Derek Hoiem; | We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. |
137 | Collaborative Learning of Gesture Recognition and 3D Hand Pose Estimation with Multi-Order Feature Analysis | Siyuan Yang; Jun Liu; Shijian Lu; Meng Hwa Er; Alex C. Kot; | In this paper, we present a novel collaborative learning network for joint gesture recognition and 3D hand pose estimation. |
138 | Making an Invisibility Cloak: Real World Adversarial Attacks on Object Detectors | Zuxuan Wu; Ser-Nam Lim; Larry S. Davis; Tom Goldstein; | We present a systematic study of adversarial attacks on state-of-the-art object detection frameworks. |
139 | TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images | Jianxin Lin; Yingxue Pang; Yingce Xia; Zhibo Chen; Jiebo Luo; | In this paper, we argue that even if each domain contains a single image, UI2I can still be achieved. |
140 | Semi-Siamese Training for Shallow Face Learning | Hang Du; Hailin Shi; Yuchi Liu; Jun Wang; Zhen Lei; Dan Zeng; Tao Mei; | In this paper, we aim to address the problem by introducing a novel training method named Semi-Siamese Training (SST). |
141 | GAN Slimming: All-in-One GAN Compression by A Unified Optimization Framework | Haotao Wang; Shupeng Gui; Haichuan Yang; Ji Liu; Zhangyang Wang; | To this end, we propose the first end-to-end optimization framework combining multiple compression means for GAN compression, dubbed GAN Slimming (GS). |
142 | Human Interaction Learning on 3D Skeleton Point Clouds for Video Violence Recognition | Yukun Su; Guosheng Lin; Jinhui Zhu; Qingyao Wu; | This paper introduces a new method for recognizing violent behavior by learning contextual relationships between related people from human skeleton points. |
143 | Binarized Neural Network for Single Image Super Resolution | Jingwei Xin; Nannan Wang; Xinrui Jiang; Jie Li; Heng Huang; Xinbo Gao; | We propose a simple but effective binary neural networks (BNN) based SISR model with a novel binarization scheme. |
144 | Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation | Huiyu Wang; Yukun Zhu; Bradley Green; Hartwig Adam; Alan Yuille; Liang-Chieh Chen; | In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. |
145 | Adaptive Computationally Efficient Network for Monocular 3D Hand Pose Estimation | Zhipeng Fan; Jun Liu; Yao Wang; | In this paper, we investigate the problem of reducing the overall computation cost yet maintaining the high accuracy for 3D hand pose estimation from video sequences. |
146 | Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking | Jinlong Peng; Changan Wang; Fangbin Wan; Yang Wu; Yabiao Wang; Ying Tai; Chengjie Wang; Jilin Li; Feiyue Huang; Yanwei Fu; | Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution (the first as far as we know). |
147 | Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets | Tong Wu; Qingqiu Huang; Ziwei Liu; Yu Wang; Dahua Lin; | We present a new loss function called Distribution-Balanced Loss for the multi-label recognition problems that exhibit long-tailed class distributions. |
148 | Hamiltonian Dynamics for Real-World Shape Interpolation | Marvin Eisenberger; Daniel Cremers; | We revisit the classical problem of 3D shape interpolation and propose a novel, physically plausible approach based on Hamiltonian dynamics. |
149 | Learning to Scale Multilingual Representations for Vision-Language Tasks | Andrea Burns; Donghyun Kim; Derry Wijaya; Kate Saenko; Bryan A. Plummer; | In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. |
150 | Multi-modal Transformer for Video Retrieval | Valentin Gabeur; Chen Sun; Karteek Alahari; Cordelia Schmid; | In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. |
151 | Feature Representation Matters: End-to-End Learning for Reference-based Image Super-resolution | Yanchun Xie; Jimin Xiao; Mingjie Sun; Chao Yao; Kaizhu Huang; | In this paper, we are aiming for a general reference-based super-resolution setting: it does not require the low-resolution image and the high-resolution reference image to be well aligned or with a similar texture. |
152 | RobustFusion: Human Volumetric Capture with Data-driven Visual Cues using a RGBD Camera | Zhuo Su; Lan Xu; Zerong Zheng; Tao Yu; Yebin Liu; Lu Fang; | In this paper, inspired by the huge potential of learning-based human modeling, we propose RobustFusion, a robust human performance capture system combined with various data-driven visual cues using a single RGBD camera. |
153 | Surface Normal Estimation of Tilted Images via Spatial Rectifier | Tien Do; Khiem Vuong; Stergios I. Roumeliotis; Hyun Soo Park; | In this paper, we present a spatial rectifier to estimate surface normals of tilted images. |
154 | Multimodal Shape Completion via Conditional Generative Adversarial Networks | Rundi Wu; Xuelin Chen; Yixin Zhuang; Baoquan Chen; | Hence, we pose a multi-modal shape completion problem, in which we seek to complete the partial shape with multiple outputs by learning a one-to-many mapping. |
155 | Generative Sparse Detection Networks for 3D Single-shot Object Detection | JunYoung Gwak; Christopher Choy; Silvio Savarese; | To this end, we propose Generative Sparse Detection Network (GSDN), a fully-convolutional single-shot sparse detection network that efficiently generates the support for object proposals. |
156 | Grounded Situation Recognition | Sarah Pratt; Mark Yatskar; Luca Weihs; Ali Farhadi; Aniruddha Kembhavi; | We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. |
157 | Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos | Shaoxiang Chen; Wenhao Jiang; Wei Liu; Yu-Gang Jiang; | Inspired by the fact that there exist cross-modal interactions in the human brain, we propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos and thus improve performances on both tasks. |
158 | Unpaired Learning of Deep Image Denoising | Xiaohe Wu; Ming Liu; Yue Cao; Dongwei Ren; Wangmeng Zuo; | We investigate the task of learning blind image denoising networks from an unpaired set of clean and noisy images. |
159 | Self-supervising Fine-grained Region Similarities for Large-scale Image Localization | Yixiao Ge; Haibo Wang; Feng Zhu; Rui Zhao; Hongsheng Li; | To tackle this challenge, we propose to self-supervise image-to-region similarities in order to fully explore the potential of difficult positive images alongside their sub-regions. |
160 | Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video | Youngjoong Kwon; Stefano Petrangeli; Dahun Kim; Haoliang Wang; Eunbyung Park; Viswanathan Swaminathan; Henry Fuchs; | To tackle these challenges, we introduce a human-specific framework that employs a learned 3D-aware representation. |
161 | Side-Aware Boundary Localization for More Precise Object Detection | Jiaqi Wang; Wenwei Zhang; Yuhang Cao; Kai Chen; Jiangmiao Pang; Tao Gong; Jianping Shi; Chen Change Loy; Dahua Lin; | In this paper, we propose an alternative approach, named as Side-Aware Boundary Localization (SABL), where each side of the bounding box is respectively localized with a dedicated network branch. |
162 | SF-Net: Single-Frame Supervision for Temporal Action Localization | Fan Ma; Linchao Zhu; Yi Yang; Shengxin Zha; Gourab Kundu; Matt Feiszli; Zheng Shou; | In this paper, we study an intermediate form of supervision, i.e., single-frame supervision, for temporal action localization (TAL). |
163 | Negative Margin Matters: Understanding Margin in Few-shot Classification | Bin Liu; Yue Cao; Yutong Lin; Qi Li; Zheng Zhang; Mingsheng Long; Han Hu; | In this paper, we unconventionally propose to adopt appropriate negative-margin to softmax loss for few-shot classification, which surprisingly works well for the open-set scenarios of few-shot classification. |
164 | Particularity beyond Commonality: Unpaired Identity Transfer with Multiple References | Ruizheng Wu; Xin Tao; Yingcong Chen; Xiaoyong Shen; Jiaya Jia; | We accordingly propose a new multi-reference identity transfer framework by simultaneously making use of particularity and commonality of reference. |
165 | Tracking Objects as Points | Xingyi Zhou; Vladlen Koltun; Philipp Krähenbühl; | In this paper, we present a simultaneous detection and tracking algorithm that is simpler, faster, and more accurate than the state of the art. |
166 | CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis | Jiadong Liang; Wenjie Pei; Feng Lu; | In this paper we circumvent this problem by focusing on parsing the content of both the input text and the synthesized image thoroughly to model the text-to-image consistency in the semantic level. |
167 | Transporting Labels via Hierarchical Optimal Transport for Semi-Supervised Learning | Fariborz Taherkhani; Ali Dabouei; Sobhan Soleymani; Jeremy Dawson; Nasser M. Nasrabadi; | In this work, we consider the general setting of the SSL problem for image classification,where the labeled and unlabeled data come from the same underlying distribution. |
168 | MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning | Simon Vandenhende; Stamatios Georgoulis; Luc Van Gool; | In this paper, we argue about the importance of considering task interactions at multiple scales when distilling task information in a multi-task learning setup. |
169 | Learning to Factorize and Relight a City | Andrew Liu; Shiry Ginosar; Tinghui Zhou; Alexei A. Efros; Noah Snavely; | We propose a learning-based framework for disentangling outdoor scenes into temporally-varying illumination and permanent scene factors. |
170 | Region Graph Embedding Network for Zero-Shot Learning | Guo-Sen Xie; Li Liu; Fan Zhu; Fang Zhao; Zheng Zhang; Yazhou Yao; Jie Qin; Ling Shao; | In this paper, to model the relations among local image regions, we incorporate the region-based relation reasoning into ZSL. |
171 | GRAB: A Dataset of Whole-Body Human Grasping of Objects | Omid Taheri; Nima Ghorbani; Michael J. Black; Dimitrios Tzionas; | Thus, we collect a new dataset, called GRAB (GRasping Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. |
172 | DEMEA: Deep Mesh Autoencoders for Non-Rigidly Deforming Objects | Edgar Tretschk; Ayush Tewari; Michael Zollhöfer; Vladislav Golyanik; Christian Theobalt; | We propose a general-purpose DEep MEsh Autoencoder \hbox{(DEMEA)} which adds a novel embedded deformation layer to a graph-convolutional mesh autoencoder. |
173 | RANSAC-Flow: Generic Two-stage Image Alignment | Xi Shen; François Darmon; Alexei A. Efros; Mathieu Aubry; | We propose a two-stage process: first, a feature-based parametric coarse alignment using one or more homographies, followed by non-parametric fine pixel-wise alignment. |
174 | Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds | Arun Balajee Vasudevan; Dengxin Dai; Luc Van Gool; | We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a $360^{ |
175 | Neural Object Learning for 6D Pose Estimation Using a Few Cluttered Images | Kiru Park; Timothy Patten; Markus Vincze; | This paper proposes a method, Neural Object Learning (NOL), that creates synthetic images of objects in arbitrary poses by combining only a few observations from cluttered images. |
176 | Dense Hybrid Recurrent Multi-view Stereo Net with Dynamic Consistency Checking | Jianfeng Yan; Zizhuang Wei; Hongwei Yi; Mingyu Ding; Runze Zhang; Yisong Chen; Guoping Wang; Yu-Wing Tai; | In this paper, we propose an efficient and effective dense hybrid recurrent multi-view stereo net with dynamic consistency checking, namely $D^{2}$HC-RMVSNet, for accurate dense point cloud reconstruction. |
177 | Pixel-Pair Occlusion Relationship Map (P2ORM): Formulation, Inference & Application | Xuchong Qiu; Yang Xiao; Chaohui Wang; Renaud Marlet; | The former provides a way to generate large-scale accurate occlusion datasets while, based on the latter, we propose a novel method for task-independent pixel-level occlusion relationship estimation from single images. |
178 | MovieNet: A Holistic Dataset for Movie Understanding | Qingqiu Huang; Yu Xiong; Anyi Rao; Jiaze Wang; Dahua Lin; | In this paper, we introduce MovieNet — a holistic dataset for movie understanding. |
179 | Short-Term and Long-Term Context Aggregation Network for Video Inpainting | Ang Li; Shanshan Zhao; Xingjun Ma; Mingming Gong; Jianzhong Qi; Rui Zhang; Dacheng Tao; Ramamohanarao Kotagiri; | In this work, we present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting. |
180 | DH3D: Deep Hierarchical 3D Descriptors for Robust Large-Scale 6DoF Relocalization | Juan Du; Rui Wang; Daniel Cremers; | For relocalization in large-scale point clouds, we propose the first approach that unifies global place recognition and local 6DoF pose refinement. |
181 | Face Super-Resolution Guided by 3D Facial Priors | Xiaobin Hu; Wenqi Ren; John LaMaster; Xiaochun Cao; Xiaoming Li; Zechao Li; Bjoern Menze; Wei Liu; | In this paper, we propose a novel face super-resolution method that explicitly incorporates 3D facial priors which grasp the sharp facial structures. |
182 | Label Propagation with Augmented Anchors: A Simple Semi-Supervised Learning baseline for Unsupervised Domain Adaptation | Yabin Zhang; Bin Deng; Kui Jia; Lei Zhang; | In this work, we take a step further to study the proper extensions of SSL techniques for UDA. |
183 | Are Labels Necessary for Neural Architecture Search? | Chenxi Liu; Piotr Dollár; Kaiming He; Ross Girshick; Alan Yuille; Saining Xie; | In this paper, we ask the question: can we find high-quality neural architectures using only images, but no human-annotated labels? |
184 | BLSM: A Bone-Level Skinned Model of the Human Mesh | Haoyang Wang; Riza Alp Güler; Iasonas Kokkinos; George Papandreou; Stefanos Zafeiriou; | We introduce BLSM, a bone-level skinned model of the human body mesh where bone scales are set prior to template synthesis, rather than the common, inverse practice. |
185 | Associative Alignment for Few-shot Image Classification | Arman Afrasiyabi; Jean-François Lalonde; Christian Gagné | This paper proposes the idea of associative alignment for leveraging part of the base data by aligning the novel training instances to the closely related ones in the base training set. |
186 | Cyclic Functional Mapping: Self-supervised Correspondence between Non-isometric Deformable Shapes | Dvir Ginzburg; Dan Raviv; | We present the first utterly self-supervised network for dense correspondence mapping between non-isometric shapes. |
187 | View-Invariant Probabilistic Embedding for Human Pose | Jennifer J. Sun; Jiaping Zhao; Liang-Chieh Chen; Florian Schroff; Hartwig Adam; Ting Liu; | In this paper, we propose an approach for learning a compact view-invariant embedding space from 2D joint keypoints alone, without explicitly predicting 3D poses. |
188 | Contact and Human Dynamics from Monocular Video | Davis Rempe; Leonidas J. Guibas; Aaron Hertzmann; Bryan Russell; Ruben Villegas; Jimei Yang; | In this paper, we present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input. |
189 | PointPWC-Net: Cost Volume on Point Clouds for (Self-)Supervised Scene Flow Estimation | Wenxuan Wu; Zhi Yuan Wang; Zhuwen Li; Wei Liu; Li Fuxin; | We propose a novel end-to-end deep scene flow model, called PointPWC-Net, that directly processes 3D point cloud scenes with large motions in a coarse-to-fine fashion. |
190 | Points2Surf Learning Implicit Surfaces from Point Clouds | Philipp Erler; Paul Guerrero; Stefan Ohrhallinger; Niloy J. Mitra; Michael Wimmer; | We present Points2Surf, a novel patch-based learning framework that produces accurate surfaces directly from raw scans without normals. |
191 | Few-Shot Scene-Adaptive Anomaly Detection | Yiwei Lu; Frank Yu; Mahesh Kumar Krishna Reddy; Yang Wang; | In this paper, we propose a novel few-shot scene-adaptive anomaly detection problem to address the limitations of previous approaches. |
192 | Personalized Face Modeling for Improved Face Reconstruction and Motion Retargeting | Bindita Chaudhuri; Noranart Vesdapunt; Linda Shapiro; Baoyuan Wang; | We propose an end-to-end framework that jointly learns a personalized face model per user and per-frame facial motion parameters from a large corpus of in-the-wild videos of user expressions. |
193 | Entropy Minimisation Framework for Event-based Vision Model Estimation | Urbano Miguel Nunes; Yiannis Demiris; | We propose a novel EMin framework for event-based vision model estimation. |
194 | Reconstructing NBA Players | Luyang Zhu; Konstantinos Rematas; Brian Curless; Steven M. Seitz; Ira Kemelmacher-Shlizerman; | Based on these models, we introduce a new method that takes as input a single photo of a clothed player performing any basketball pose and outputs a high resolution mesh and pose of that player. |
195 | PIoU Loss: Towards Accurate Oriented Object Detection in Complex Environments | Zhiming Chen; Kean Chen; Weiyao Lin; John See; Hui Yu; Yan Ke; Cong Yang; | Therefore, a novel loss, Pixels-IoU (PIoU) Loss, is formulated to exploit both the angle and IoU for accurate OBB regression. |
196 | TENet: Triple Excitation Network for Video Salient Object Detection | Sucheng Ren; Chu Han; Xin Yang; Guoqiang Han; Shengfeng He; | In this paper, we propose a simple yet effective approach, named Triple Excitation Network, to reinforce the training of video salient object detection (VSOD) from three aspects, spatial, temporal, and online excitations. |
197 | Deep Feedback Inverse Problem Solver | Wei-Chiu Ma; Shenlong Wang; Jiayuan Gu; Sivabalan Manivasagam; Antonio Torralba; Raquel Urtasun; | We present an efficient, effective, and generic approach towards solving inverse problems. |
198 | Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification | Liuyu Xiang; Guiguang Ding; Jungong Han; | In this paper, we propose a novel self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME). |
199 | Hallucinating Visual Instances in Total Absentia | Jiayan Qiu; Yiding Yang; Xinchao Wang; Dacheng Tao; | In this paper, we investigate a new visual restoration task, termed as hallucinating visual instances in total absentia (HVITA). |
200 | Weakly-supervised 3D Shape Completion in the Wild | Jiayuan Gu; Wei-Chiu Ma; Sivabalan Manivasagam; Wenyuan Zeng; Zihao Wang; Yuwen Xiong; Hao Su; Raquel Urtasun; | To this end, we propose a weakly-supervised method to estimate both 3D canonical shape and 6-DoF pose for alignment, given multiple partial observations associated with the same instance |
201 | DTVNet: Dynamic Time-lapse Video Generation via Single Still Image | Jiangning Zhang; Chao Xu; Liang Liu; Mengmeng Wang; Xia Wu; Yong Liu; Yunliang Jiang; | This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. |
202 | CLIFFNet for Monocular Depth Estimation with Hierarchical Embedding Loss | Lijun Wang; Jianming Zhang; Yifan Wang; Huchuan Lu; Xiang Ruan; | This paper proposes a hierarchical loss for monocular depth estimation, which measures the differences between the prediction and ground truth in hierarchical embedding spaces of depth maps. |
203 | Collaborative Video Object Segmentation by Foreground-Background Integration | Zongxin Yang; Yunchao Wei; Yi Yang; | This paper investigates the principles of embedding learning to tackle the challenging semi-supervised video object segmentation. |
204 | Adaptive Margin Diversity Regularizer for handling Data Imbalance in Zero-Shot SBIR | Titir Dutta; Anurag Singh; Soma Biswas; | Since most real-world training data have a fair amount of imbalance in this work, for the first time in literature, we extensively study the effect of training data imbalance on the generalization to unseen categories, with ZS-SBIR as the application area. |
205 | ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation | Xucong Zhang; Seonwook Park; Thabo Beeler; Derek Bradley; Siyu Tang ; Otmar Hilliges; | In this paper, we propose a new gaze estimation dataset called ETH-XGaze, consisting of over one million high-resolution images of varying gaze under extreme head poses. |
206 | Calibration-free Structure-from-Motion with Calibrated Radial Trifocal Tensors | Viktor Larsson; Nicolas Zobernig; Kasim Taskin; Marc Pollefeys; | In this paper we consider the problem of Structure-from-Motion from images with unknown intrinsic calibration. |
207 | Occupancy Anticipation for Efficient Exploration and Navigation | Santhosh K. Ramakrishnan; Ziad Al-Halah; Kristen Grauman; | We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. |
208 | Unified Image and Video Saliency Modeling | Richard Droste; Jianbo Jiao; J. Alison Noble; | To address this we propose four novel domain adaptation techniques – Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN – in addition to an improved formulation of learned Gaussian priors. |
209 | TAO: A Large-Scale Benchmark for Tracking Any Object | Achal Dave; Tarasha Khurana; Pavel Tokmakov; Cordelia Schmid; Deva Ramanan; | To bridge this gap, we introduce a similarly diverse dataset for Tracking Any Object (TAO). |
210 | A Generalization of Otsu’s Method and Minimum Error Thresholding | Jonathan T. Barron; | We present Generalized Histogram Thresholding (GHT), a simple, fast, and effective technique for histogram-based image thresholding. |
211 | A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks | Unnat Jain; Luca Weihs; Eric Kolve; Ali Farhadi; Svetlana Lazebnik; Aniruddha Kembhavi; Alexander Schwing; | Addressing this, we introduce the novel task FurnMove in which agents work together to move a piece of furniture through a living room to a goal. |
212 | Big Transfer (BiT): General Visual Representation Learning | Alexander Kolesnikov; Lucas Beyer; Xiaohua Zhai; Joan Puigcerver; Jessica Yung; Sylvain Gelly; Neil Houlsby; | We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). |
213 | VisualCOMET: Reasoning about the Dynamic Context of a Still Image | Jae Sung Park; Chandra Bhagavatula; Roozbeh Mottaghi; Ali Farhadi; Yejin Choi; | We propose Visual COMET, the novel framework of visual common-sense reasoning tasks to predict events that might have happened before, events that might happen next, and the intents of the people at present. |
214 | Few-shot Action Recognition with Permutation-invariant Attention | Hongguang Zhang; Li Zhang; Xiaojuan Qi; Hongdong Li; Philip H. S. Torr; Piotr Koniusz; | Many few-shot learning models focus on recognising images. In contrast, we tackle a challenging task of few-shot action recognition from videos. |
215 | Character Grounding and Re-Identification in Story of Videos and Text Descriptions | Youngjae Yu; Jongseok Kim; Heeseung Yun; Jiwan Chung; Gunhee Kim; | In order to solve these related tasks in a mutually rewarding way, we propose a model named Character in Story Identification Network (CiSIN). |
216 | AABO: Adaptive Anchor Box Optimization for Object Detection via Bayesian Sub-sampling | Wenshuo Ma; Tingzhong Tian; Hang Xu; Yimin Huang; Zhenguo Li; | In this paper, we study the problem of automatically optimizing anchor boxes for object detection. |
217 | Learning Visual Context by Comparison | Minchul Kim; Jongchan Park; Seil Na; Chang Min Park; Donggeun Yoo; | In this paper, we present Attend-and-Compare Module (ACM) for capturing the difference between an object of interest and its corresponding context. |
218 | Large Scale Holistic Video Understanding | Ali Diba; Mohsen Fayyaz; Vivek Sharma; Manohar Paluri; Jürgen Gall; Rainer Stiefelhagen; Luc Van Gool; | We fill this gap by presenting a large-scale “Holistic Video Understanding Dataset” (HVU). |
219 | Indirect Local Attacks for Context-aware Semantic Segmentation Networks | Krishna Kanth Nakka; Mathieu Salzmann; | To this end, we introduce an indirect attack strategy, namely adaptive local attacks, aiming to find the best image location to perturb, while preserving the labels at this location and producing a realistic-looking segmentation map. |
220 | Predicting Visual Overlap of Images Through Interpretable Non-Metric Box Embeddings | Anita Rau; Guillermo Garcia-Hernando; Danail Stoyanov; Gabriel J. Brostow; Daniyar Turmukhambetov; | While we don’t obviate the need for geometric verification, we propose an interpretable image-embedding that cuts the search in scale space to essentially a lookup. |
221 | Connecting Vision and Language with Localized Narratives | Jordi Pont-Tuset; Jasper Uijlings; Soravit Changpinyo; Radu Soricut; Vittorio Ferrari; | We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. |
222 | Adversarial T-shirt! Evading Person Detectors in A Physical World | Kaidi Xu; Gaoyuan Zhang; Sijia Liu; Quanfu Fan; Mengshu Sun; Hongge Chen; Pin-Yu Chen; Yanzhi Wang; Xue Lin; | In this work, we proposed adversarial T-shirts, a robust physical adversarial example for evading person detectors even if it could undergo non-rigid deformation due to a moving person’s pose changes. |
223 | Bounding-box Channels for Visual Relationship Detection | Sho Inayoshi; Keita Otani; Antonio Tejero-de-Pablos; Tatsuya Harada; | In this paper, we propose the bounding-box channels, a novel architecture capable of relating the semantic, spatial, and image features strongly. |
224 | Minimal Rolling Shutter Absolute Pose with Unknown Focal Length and Radial Distortion | Zuzana Kukelova; Cenek Albl; Akihiro Sugimoto; Konrad Schindler; Tomas Pajdla; | We present the first minimal solutions for the absolute pose of a rolling shutter camera with unknown rolling shutter parameters, focal length, and radial distortion. |
225 | SRFlow: Learning the Super-Resolution Space with Normalizing Flow | Andreas Lugmayr; Martin Danelljan; Luc Van Gool; Radu Timofte; | In this work, we therefore propose SRFlow: a normalizing flow based super-resolution method capable of learning the conditional distribution of the output given the low-resolution input. |
226 | DeepGMR: Learning Latent Gaussian Mixture Models for Registration | Wentao Yuan; Benjamin Eckart; Kihwan Kim; Varun Jampani; Dieter Fox ; Jan Kautz; | In this paper, we introduce Deep Gaussian Mixture Registration (DeepGMR), the first learning-based registration method that explicitly leverages a probabilistic registration paradigm by formulating registration as the minimization of KL-divergence between two probability distributions modeled as mixtures of Gaussians. |
227 | Active Perception using Light Curtains for Autonomous Driving | Siddharth Ancha; Yaadhav Raaj; Peiyun Hu; Srinivasa G. Narasimhan; David Held; | In this work, we propose a method for 3D object recognition using light curtains, a resource-efficient active sensor that measures depth at selected locations in the environment in a controllable manner. |
228 | Invertible Neural BRDF for Object Inverse Rendering | Zhe Chen; Shohei Nobuhara; Ko Nishino; | We introduce a novel neural network-based BRDF model and a Bayesian framework for object inverse rendering, i.e., joint estimation of reflectance and natural illumination from a single image of an object of known geometry. |
229 | Semi-supervised Semantic Segmentation via Strong-weak Dual-branch Network | Wenfeng Luo; Meng Yang; | To fully explore the potential of the weak labels, we propose to impose separate treatments of strong and weak annotations via a strong-weakdual-branch network, which discriminates the massive inaccurate weak supervisions from those strong ones. |
230 | Practical Deep Raw Image Denoising on Mobile Devices | Yuzhi Wang; Haibin Huang; Qin Xu; Jiaming Liu; Yiqun Liu; Jue Wang; | In this work, we propose a light-weight, efficient neural network-based raw image denoiser that runs smoothly on mainstream mobile devices, and produces high quality denoising results. |
231 | SoundSpaces: Audio-Visual Navigation in 3D Environments | Changan Chen; Unnat Jain; Carl Schissler; Sebastia Vicenc Amengual Gari; Ziad Al-Halah; Vamsi Krishna Ithapu; Philip Robinson; and Kristen Grauman; | We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. |
232 | Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization | Yuanhao Zhai; Le Wang; Wei Tang; Qilin Zhang; Junsong Yuan; Gang Hua; | In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. |
233 | Erasing Appearance Preservation in Optimization-based Smoothing | Lvmin Zhang; Chengze Li; Yi JI; Chunping Liu; Tien-tsin Wong; | In this paper, we call this manipulation as Erasing Appearance Preservation (EAP). |
234 | Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler | Tsu-Jui Fu; Xin Eric Wang; Matthew F. Peterson,Scott T. Grafton; Miguel P. Eckstein; William Yang Wang; | We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data. |
235 | Guided Deep Decoder: Unsupervised Image Pair Fusion | Tatsumi Uezato; Danfeng Hong; Naoto Yokoya; Wei He; | To address this limitation, in this study, we propose a guided deep decoder network as a general prior. |
236 | Filter Style Transfer between Photos | Jonghwa Yim; Jisung Yoo; Won-joon Do; Beomsu Kim; Jihwan Choe; | In this paper, we introduce a new concept of style transfer, Filter Style Transfer (FST). |
237 | JGR-P2O: Joint Graph Reasoning based Pixel-to-Offset Prediction Network for 3D Hand Pose Estimation from a Single Depth Image | Linpu Fang; Xingyan Liu; Li Liu; Hang Xu; Wenxiong Kang; | In this paper, a novel pixel-wise prediction-based method is proposed to address the above issues. |
238 | Dynamic Group Convolution for Accelerating Convolutional Neural Networks | Zhuo Su; Linpu Fang; Wenxiong Kang; Dewen Hu; Matti Pietikäinen; Li Liu; | In this paper, we propose dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly. |
239 | RD-GAN: Few/Zero-Shot Chinese Character Style Transfer via Radical Decomposition and Rendering | Yaoxiong Huang; Mengchao He; Lianwen Jin; Yongpan Wang; | In this paper, a novel radical decomposition-and-rendering-based GAN(RD-GAN) is proposed to utilize the radical-level compositions of Chinese characters and achieves few-shot/zero-shot Chinese character style transfer. |
240 | Object-Contextual Representations for Semantic Segmentation | Yuhui Yuan; Xilin Chen; Jingdong Wang; | In this paper, we address the semantic segmentation problem with a focus on the context aggregation strategy. |
241 | Efficient Spatio-Temporal Recurrent Neural Network for Video Deblurring | Zhihang Zhong; Ye Gao; Yinqiang Zheng; Bo Zheng; | To improve the network efficiency, we adopt residual dense blocks into RNN cells, so as to efficiently extract the spatial features of the current frame. |
242 | Joint Semantic Instance Segmentation on Graphs with the Semantic Mutex Watershed | Steffen Wolf; Yuyan Li; Constantin Pape; Alberto Bailoni; Anna Kreshuk; Fred A. Hamprecht; | We propose a greedy algorithm for joint graph partitioning and labeling derived from the efficient Mutex Watershed partitioning algorithm. |
243 | Photon-Efficient 3D Imaging with A Non-Local Neural Network | Jiayong Peng; Zhiwei Xiong; Xin Huang; Zheng-Ping Li; Dong Liu; Feihu Xu; | In this paper, we first analyze the long-range correlations in both spatial and temporal dimensions of the measurements. Then we propose a non-local neural network for depth reconstruction by exploiting the long-range correlations. |
244 | GeLaTO: Generative Latent Textured Objects | Ricardo Martin-Brualla; Rohit Pandey; Sofien Bouaziz; Matthew Brown; Dan B Goldman; | Inspired by billboards and geometric proxies used in computer graphics, this paper proposes Generative Latent Textured Objects (GeLaTO), a compact representation that combines a set of coarse shape proxies defining low frequency geometry with learned neural textures, to encode both medium and fine scale geometry as well as view-dependent appearance. |
245 | Improving Vision-and-Language Navigation with Image-Text Pairs from the Web | Arjun Majumdar; Ayush Shrivastava; Stefan Lee; Peter Anderson; Devi Parikh; Dhruv Batra; | Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction (‘…stop at the brown sofa’) and a trajectory of panoramic RGB images captured by the agent. |
246 | Directional Temporal Modeling for Action Recognition | Xinyu Li; Bing Shuai; Joseph Tighe; | In this paper, we introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features. |
247 | Shonan Rotation Averaging: Global Optimality by Surfing SO(p)(n) | Frank Dellaert; David M. Rosen; Jing Wu; Robert Mahony; Luca Carlone; | Our method employs semidefinite relaxation in order to recover provably globally optimal solutions of the rotation averaging problem. |
248 | Semantic Curiosity for Active Visual Learning | Devendra Singh Chaplot; Helen Jiang; Saurabh Gupta; Abhinav Gupta; | In this paper, we study the task of embodied interactive learning for object detection. |
249 | Multi-Temporal Recurrent Neural Networks For Progressive Non-Uniform Single Image Deblurring With Incremental Temporal Training | Dongwon Park; Dong Un Kang; Jisoo Kim; Se Young Chun; | To realize MT approach, we propose progressive deblurring over iterations and incremental temporal training with temporally augmented training data. |
250 | ProgressFace: Scale-Aware Progressive Learning for Face Detection | Jiashu Zhu; Dong Li; Tiantian Han; Lu Tian; Yi Shan; | In this work, we propose a novel scale-aware progressive training mechanism to address large scale variations across faces. |
251 | Learning Multi-layer Latent Variable Model via Variational Optimization of Short Run MCMC for Approximate Inference | Erik Nijkamp; Bo Pang; Tian Han; Linqi Zhou; Song-Chun Zhu; Ying Nian Wu; | In this paper, we propose to use noise initialized non-persistent short run MCMC, such as finite step Langevin dynamics initialized from the prior distribution of the latent variables, as an approximate inference engine, where the step size of the Langevin dynamics is variationally optimized by minimizing the Kullback-Leibler divergence between the distribution produced by the short run MCMC and the posterior distribution. |
252 | CoTeRe-Net: Discovering Collaborative Ternary Relations in Videos | Zhensheng Shi; Cheng Guan; Liangjie Cao; Qianqian Li; Ju Liang; Zhaorui Gu; Haiyong Zheng; Bing Zheng; | In this paper, we propose a novel relation model that discovers relations of both implicit and explicit cues as well as their collaboration in videos. |
253 | Modeling the Effects of Windshield Refraction for Camera Calibration | Frank Verbiest; Marc Proesmans; Luc Van Gool; | In this paper, we study the effects of windshield refraction for autonomous driving applications. |
254 | Unsupervised Domain Adaptation for Semantic Segmentation of NIR Images through Generative Latent Search | Prashant Pandey; Aayush Kumar Tyagi; Sameer Ambekar; Prathosh AP; | We propose a method for target-independent segmentation where the ‘nearest-clone’ of a target image in the source domain is searched and used as a proxy in the segmentation network trained only on the source domain. |
255 | PROFIT: A Novel Training Method for sub-4-bit MobileNet Models | Eunhyeok Park; Sungjoo Yoo; | In this work, we report that the activation instability induced by weight quantization (AIWQ) is the key obstacle to sub-4-bit quantization of mobile networks. |
256 | Visual Relation Grounding in Videos | Junbin Xiao; Xindi Shang; Xun Yang; Sheng Tang; Tat-Seng Chua; | In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV). |
257 | Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows | Andrei Zanfir; Eduard Gabriel Bazavan; Hongyi Xu; William T. Freeman; Rahul Sukthankar; Cristian Sminchisescu; | In this paper we present new priors as well as large-scale weakly supervised models for 3D human pose and shape estimation. |
258 | Controlling Style and Semantics in Weakly-Supervised Image Generation | Dario Pavllo; Aurelien Lucchi; Thomas Hofmann; | We propose a weakly-supervised approach for conditional image generation of complex scenes where a user has fine control over objects appearing in the scene. |
259 | Jointly learning visual motion and confidence from local patches in event cameras | Daniel R. Kepple; Daewon Lee; Colin Prepsius; Volkan Isler; Il Memming Park; Daniel D. Lee; | We propose the first network to jointly learn visual motion and confidence from events in spatially local patches. |
260 | SODA: Story Oriented Dense Video Captioning Evaluation Framework | Soichiro Fujita; Tsutomu Hirao; Hidetaka Kamigaito; Manabu Okumura; Masaaki Nagata; | This paper proposes a new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), for measuring the performance of video story description systems. |
261 | Sketch-Guided Object Localization in Natural Images | Aditay Tripathi; Rajath R. Dani; Anand Mishra and Anirban Chakraborty; | We introduce a novel problem of localizing all the instances of an object (seen or unseen during training) in a natural image via sketch query. |
262 | A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses | Malik Boudiaf; Jérôme Rony; Imtiaz Masud Ziko; Eric Granger; Marco Pedersoli; Pablo Piantanida; Ismail Ben Ayed; | However, we provide a theoretical analysis that links the cross-entropy to several well-known and recent pairwise losses. |
263 | Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models | Jize Cao; Zhe Gan; Yu Cheng; Licheng Yu; Yen-Chun Chen; Jingjing Liu; | To reveal the secrets behind the scene, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection) generalizable to standard pre-trained V+L models, to decipher the inner workings of multimodal pre-training (e.g., implicit knowledge garnered in individual attention heads, inherent cross-modal alignment learned through contextualized multimodal embeddings). |
264 | The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement | William Peebles; John Peebles; Jun-Yan Zhu; Alexei Efros; Antonio Torralba; | In this paper, we propose the Hessian Penalty, a simple regularization function that encourages the input Hessian of a function to be diagonal. |
265 | STAR: Sparse Trained Articulated Human Body Regressor | Ahmed A. A. Osman; Timo Bolkart; Michael J. Black; | To address this, we define per-joint pose correctives and learn the subset of mesh vertices that are influenced by each joint movement. This sparse formulation results in more realistic deformations and significantly reduces the number of model parameters to 20% of SMPL. |
266 | Optical Flow Distillation: Towards Efficient and Stable Video Style Transfer | Xinghao Chen; Yiman Zhang; Yunhe Wang; Han Shu; Chunjing Xu; Chang Xu; | This paper proposes to learn a lightweight video style transfer network via knowledge distillation paradigm. |
267 | Collaboration by Competition: Self-coordinated Knowledge Amalgamation for Multi-talent Student Learning | Sihui Luo; Wenwen Pan; Xinchao Wang; Dazhou Wang; Haihong Tang; Mingli Song; | In this paper, we study how to reuse such heterogeneous pre-trained models as teachers, and build a versatile and compact student model, without accessing human annotations. |
268 | Do Not Disturb Me: Person Re-identification Under the Interference of Other Pedestrians | Shizhen Zhao; Changxin Gao; Jun Zhang; Hao Cheng; Chuchu Han; Xinyang Jiang; Xiaowei Guo; Wei-Shi Zheng; Nong Sang; Xing Sun; | To address this problem, this paper presents a novel deep network termed Pedestrian-Interference Suppression Network (PISNet). |
269 | Learning 3D Part Assembly from a Single Image | Yichen Li; Kaichun Mo; Lin Shao; Minhyuk Sung; Leonidas Guibas; | Towards this end, we introduce a novel problem,single-image-guided 3D part assembly, along with a learning-based solution. |
270 | PT2PC: Learning to Generate 3D Point Cloud Shapes from Part Tree Conditions | Kaichun Mo; He Wang; Xinchen Yan; Leonidas Guibas; | In order to learn such a conditional shape generation procedure in an end-to-end fashion, we propose a conditional GAN “part tree”-to-“point cloud” model (PT2PC) that disentangles the structural and geometric factors. |
271 | Highly Efficient Salient Object Detection with 100K Parameters | Shang-Hua Gao; Yong-Qiang Tan; Ming-Ming Cheng; Chengze Lu; Yunpeng Chen; Shuicheng Yan; | In this paper, we aim to relieve the contradiction between computation cost and model performance by improving the network efficiency to a higher degree. |
272 | HardGAN: A Haze-Aware Representation Distillation GAN for Single Image Dehazing | Qili Deng; Ziling Huang; Chung-Chi Tsai; Chia-Wen Lin; | In this paper, we present a Haze-Aware Representation Distillation Generative Adversarial Network named HardGAN for single-image dehazing. |
273 | Lifespan Age Transformation Synthesis | Roy Or-El; Soumyadip Sengupta; Ohad Fried; Eli Shechtman; Ira Kemelmacher-Shlizerman; | We propose a new multi domain image-to-image generative adversarial network architecture, whose learned latent space accurately models the continuous aging process in both directions. |
274 | Domain2Vec: Domain Embedding for Unsupervised Domain Adaptation | Xingchao Peng; Yichen Li; Kate Saenko; | To describe and learn relations between different domains, we propose a novel Domain2Vec model to provide vectorial representations of visual domains based on joint learning of feature disentanglement and Gram matrix. |
275 | Simulating Content Consistent Vehicle Datasets with Attribute Descent | Yue Yao; Liang Zheng; Xiaodong Yang; Milind Naphade; Tom Gedeon; | We propose an attribute descent approach to let VehicleX approximate the attributes in real-world datasets. |
276 | Multiview Detection with Feature Perspective Transformation | Yunzhong Hou; Liang Zheng; Stephen Gould; | To address these questions, we introduce a novel multiview detector, MVDet. |
277 | Learning Object Relation Graph and Tentative Policy for Visual Navigation | Heming Du; Xin Yu; Liang Zheng; | Aiming to improve these two components, this paper proposes three complementary techniques, object relation graph (ORG),trial-driven imitation learning (IL), and a memory-augmented tentative policy network (TPN). |
278 | Adversarial Self-Supervised Learning for Semi-Supervised 3D Action Recognition | Chenyang Si; Xuecheng Nie; Wei Wang; Liang Wang; Tieniu Tan; Jiashi Feng; | To address these issues, we present Adversarial Self-Supervised Learning (ASSL), a novel framework that tightly couples SSL and the semi-supervised scheme via neighbor relation exploration and adversarial learning. |
279 | Across Scales & Across Dimensions: Temporal Super-Resolution using Deep Internal Learning | Liad Pollak Zuckerman; Eyal Naor; George Pisha; Shai Bagon; Michal Irani; | In this paper we propose a “Deep Internal Learning” approach for trueTSR. |
280 | Inducing Optimal Attribute Representations for Conditional GANs | Binod Bhattarai; Tae-Kyun Kim; | We propose a novel end-to-end learning framework based on Graph Convolutional Networks to learn the attribute representations to condition the generator. |
281 | AR-Net: Adaptive Frame Resolution for Efficient Action Recognition | Yue Meng; Chung-Ching Lin; Rameswar Panda; Prasanna Sattigeri; Leonid Karlinsky; Aude Oliva; Kate Saenko; Rogerio Feris; | In this paper, we propose a novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. |
282 | Image-to-Voxel Model Translation for 3D Scene Reconstruction and Segmentation | Vladimir V. Kniaz; Vladimir A. Knyaz; Fabio Remondino; Artem Bordodymov; Petr Moshkantsev; | We propose a single shot image-to-semantic voxel model translation framework. We collected a SemanticVoxels dataset with 116k images, ground-truth semantic voxel models, depth maps, and 6D object poses. |
283 | Consistency Guided Scene Flow Estimation | Yuhua Chen; Luc Van Gool; Cordelia Schmid; Cristian Sminchisescu; | The model takes two temporal stereo pairs as input, and predicts disparity and scene flow. |
284 | Autoregressive Unsupervised Image Segmentation | Yassine Ouali; Céline Hudelot; Myriam Tami; | In this work, we propose a new unsupervised image segmentation approach based on mutual information maximization between different constructed views of the inputs. |
285 | Controllable Image Synthesis via SegVAE | Yen-Chi Cheng; Hsin-Ying Lee; Min Sun; Ming-Hsuan Yang; | In this work, we specifically target at generating semantic maps given a label-set consisting of desired categories. |
286 | Off-Policy Reinforcement Learning for Efficient and Effective GAN Architecture Search | Yuan Tian; Qin Wang; Zhiwu Huang; Wen Li; Dengxin Dai; Minghao Yang ; Jun Wang; Olga Fink; | In this paper, we introduce a new reinforcement learning (RL) based neural architecture search (NAS) methodology for effective and efficient generative adversarial network (GAN) architecture search. |
287 | Efficient Non-Line-of-Sight Imaging from Transient Sinograms | Mariko Isogawa; Dorian Chan; Ye Yuan; Kris Kitani; Matthew O’Toole; | We propose a circular and confocal non-line-of-sight (C$^2$NLOS) scan that involves illuminating and imaging a common point, and scanning this point in a circular path along a wall. |
288 | Texture Hallucination for Large-Factor Painting Super-Resolution | Yulun Zhang; Zhifei Zhang; Stephen DiVerdi; Zhaowen Wang; Jose Echevarria; Yun Fu; | We aim to super-resolve digital paintings, synthesizing realistic details from high-resolution reference painting materials for very large scaling factors (g 8$ imes$, 16$ imes$). |
289 | Learning Progressive Joint Propagation for Human Motion Prediction | Yujun Cai; Lin Huang; Yiwei Wang; Tat-Jen Cham; Jianfei Cai; Junsong Yuan; Jun Liu; Xu Yang; Yiheng Zhu; Xiaohui Shen; Ding Liu; Jing Liu; Nadia Magnenat Thalmann; | In this paper, we address this problem in three aspects. First, to capture the long-range spatial correlations and temporal dependencies, we apply a transformer-based architecture with the global attention mechanism. |
290 | Image Stitching and Rectification for Hand-Held Cameras | Bingbing Zhuang; Quoc-Huy Tran; | In this paper, we derive a new differential homography that can account for the scanline-varying camera poses in Rolling Shutter (RS) cameras, and demonstrate its application to carry out RS-aware image stitching and rectification at one stroke. |
291 | ParSeNet: A Parametric Surface Fitting Network for 3D Point Clouds | Gopal Sharma; Difan Liu; Subhransu Maji; Evangelos Kalogerakis; Siddhartha Chaudhuri; Radomír M?ch; | We propose a novel, end-to-end trainable, deep network called ParSeNet |
292 | The Group Loss for Deep Metric Learning | Ismail Elezi; Sebastiano Vascon; Alessandro Torcinovich; Marcello Pelillo; Laura Leal-Taixé | We propose Group Loss,a loss function based on a differentiable label-propagation method that enforces embedding similarity across all samples of a group while promoting, at the same time, low-density regions amongst data points belonging to different groups. |
293 | Learning Object Depth from Camera Motion and Video Object Segmentation | Brent A. Griffin; Jason J. Corso; | To leverage this progress in 3D applications, this paper addresses the problem of learning to estimate the depth of segmented objects given some measurement of camera motion (e.g., from robot kinematics or vehicle odometry). |
294 | OnlineAugment: Online Data Augmentation with Less Domain Knowledge | Zhiqiang Tang; Yunhe Gao; Leonid Karlinsky; Prasanna Sattigeri; Rogerio Feris; Dimitris Metaxas; | In this work, we offer an orthogonal extit{online} data augmentation scheme together with three new augmentation networks, co-trained with the target learning task. |
295 | Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction | Yiming Qian; Yasutaka Furukawa; | This paper proposes a novel single-image piecewise planar reconstruction technique that infers and enforces inter-plane relationships. |
296 | Intra-class Feature Variation Distillation for Semantic Segmentation | Yukang Wang; Wei Zhou; Tao Jiang; Xiang Bai; Yongchao Xu; | In this paper, different from previous methods performing knowledge distillation for densely pairwise relations, we propose a novel intra-class feature variation distillation (IFVD) to transfer the intra-class feature variation (IFV) of the cumbersome model (teacher) to the compact model (student). |
297 | Temporal Distinct Representation Learning for Action Recognition | Junwu Weng; Donghao Luo; Yabiao Wang; Ying Tai; Chengjie Wang; Jilin Li; Feiyue Huang; Xudong Jiang; Junsong Yuan; | In this paper, we attempt to tackle this issue through two ways. 1) Design a sequential channel filtering mechanism, Progressive Enhancement Module (PEM), to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction. 2) Create a Temporal Diversity Loss (TD Loss) to force the kernels to concentrate on and capture the variations among frames rather than the image regions with similar appearance. |
298 | Representative Graph Neural Network | Changqian Yu; Yifan Liu; Changxin Gao; Chunhua Shen; Nong Sang; | In this paper, we present a Representative Graph (RepGraph) layer to dynamically sample a few representative features, which dramatically reduces redundancy. |
299 | Deformation-Aware 3D Model Embedding and Retrieval | Mikaela Angelina Uy; Jingwei Huang; Minhyuk Sung; Tolga Birdal; Leonidas Guibas; | We introduce a new problem of mph{retrieving} 3D models that are mph{deformable} to a given query shape and present a novel deep mph{deformation-aware} embedding to solve this retrieval task. |
300 | Atlas: End-to-End 3D Scene Reconstruction from Posed Images | Zak Murez; Tarrence van As; James Bartolozzi; Ayan Sinha; Vijay Badrinarayanan; Andrew Rabinovich; | We present an end-to-end 3D reconstruction of a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images. |
301 | Multiple Class Novelty Detection Under Data Distribution Shift | Poojan Oza; Hien V. Nguyen; Vishal M. Patel; | To this end, we consider the problem of multiple class novelty detection under dataset distribution shift to improve the novelty detection performance. |
302 | Colorization of Depth Map via Disentanglement | Chung-Sheng Lai; Zunzhi You; Ching-Chun Huang; Yi-Hsuan Tsai; Wei-Chen Chiu; | In this paper, we propose a depth map colorization method via disentangling appearance and structure factors, so that our model could 1) learn depth-invariant appearance features from an appearance reference and 2) generate colorized images by combining a given depth map and the appearance feature obtained from any reference. |
303 | Beyond Controlled Environments: 3D Camera Re-Localization in Changing Indoor Scenes | Johanna Wald; Torsten Sattler; Stuart Golodetz; Tommaso Cavallari; Federico Tombari; | In this paper, we adapt 3RScan — a recently introduced indoor RGB-D dataset designed for object instance re-localization — to create RIO10, a new long-term camera re-localization benchmark focused on indoor scenes. |
304 | GeoGraph: Graph-based multi-view object detection with geometric cues end-to-end | Ahmed Samy Nassar; Stefano D’Aronco; Sébastien Lefèvre; Jan D. Wegner; | In this paper, we propose an end-to-end learnable approach that detects static urban objects from multiple views, re-identifies instances, and finally assigns a geographic position per object. |
305 | Localizing the Common Action Among a Few Videos | Pengwan Yang; Vincent Tao Hu; Pascal Mettes; Cees G. M. Snoek; | To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. |
306 | TAFSSL: Task-Adaptive Feature Sub-Space Learning for few-shot classification | Moshe Lichtenstein; Prasanna Sattigeri; Rogerio Feris; Raja Giryes; Leonid Karlinsky; | In this paper we propose yet another simple technique that is important for the few shot learning performance – a search for a compact feature sub-space that is discriminative for a given few-shot test task. |
307 | Traffic Accident Benchmark for Causality Recognition | Tackgeun You; Bohyung Han; | We propose a brand new benchmark for analyzing causality in traffic accident videos by decomposing an accident into a pair of events, cause and effect. |
308 | Face Anti-Spoofing with Human Material Perception | Zitong Yu; Xiaobai Li; Xuesong Niu; Jingang Shi; Guoying Zhao; | In this paper we rephrase face anti-spoofing as a material recognition problem and combine it with classical human material perception, intending to extract discriminative and robust features for FAS. |
309 | How Can I See My Future? FvTraj: Using First-person View for Pedestrian Trajectory Prediction | Huikun Bi; Ruisi Zhang; Tianlu Mao; Zhigang Deng; Zhaoqi Wang; | This work presents a novel First-person View based Trajectory predicting model (FvTraj) to estimate the future trajectories of pedestrians in a scene given their observed trajectories and the corresponding first-person view images. |
310 | Multiple Expert Brainstorming for Domain Adaptive Person Re-identification | Yunpeng Zhai; Qixiang Ye; Shijian Lu; Mengxi Jia; Rongrong Ji; Yonghong Tian; | In this paper, we propose a multiple expert brainstorming network (MEB-Net) for domain adaptive person re-ID, opening up a promising direction about model ensemble problem under unsupervised conditions. |
311 | NASA Neural Articulated Shape Approximation | Boyang Deng; JP Lewis; Timothy Jeruzalski; Gerard Pons-Moll; Geoffrey Hinton; Mohammad Norouzi; Andrea Tagliasacchi; | This paper introduces neural articulated shape approximation (NASA), an alternative framework that enables efficient representation of articulated deformable objects using neural indicator functions that are conditioned on pose. |
312 | Towards Unique and Informative Captioning of Images | Zeyu Wang; Berthy Feng; Karthik Narasimhan; Olga Russakovsky; | We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be ‘topped’ using simple captioning systems relying on object detectors. |
313 | When Does Self-supervision Improve Few-shot Learning? | Jong-Chyi Su; Subhransu Maji; Bharath Hariharan; | Based on this analysis we present a technique that automatically selects images for SSL from a large, generic pool of unlabeled images for a given dataset that provides further improvements. |
314 | Two-branch Recurrent Network for Isolating Deepfakes in Videos | Iacopo Masi; Aditya Killekar; Royston Marian Mascarenhas; Shenoy Pratik Gurudatt; Wael AbdAlmageed; | We present a method for deepfake detection based on a two-branch network structure that isolates digitally manipulated faces by learning to amplify artifacts while suppressing the high-level face content. |
315 | Incremental Few-Shot Meta-Learning via Indirect Discriminant Alignment | Qing Liu; Orchid Majumder; Alessandro Achille; Avinash Ravichandran; Rahul Bhotika; Stefano Soatto; | We propose a method to train a model so it can learn new classification tasks while improving with each task solved. |
316 | BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models | Jiahui Yu; Pengchong Jin; Hanxiao Liu; Gabriel Bender; Pieter-Jan Kindermans; Mingxing Tan; Thomas Huang; Xiaodan Song; Ruoming Pang; Quoc Le; | In this work, we propose BigNAS, an approach that challenges the conventional wisdom that post-processing of the weights is necessary to get good prediction accuracies. |
317 | Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation | Sheng Jin; Wentao Liu; Enze Xie; Wenhai Wang; Chen Qian; Wanli Ouyang; Ping Luo; | In this paper, we investigate a new perspective of human part grouping and reformulate it as a graph clustering task. |
318 | Global Distance-distributions Separation for Unsupervised Person Re-identification | Xin Jin; Cuiling Lan; Wenjun Zeng; Zhibo Chen; | To address this problem, we introduce a global distance-distributions separation (GDS) constraint over the two distributions to encourage the clear separation of positive and negative samples from a global view. |
319 | I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image | Gyeongsik Moon; Kyoung Mu Lee; | To resolve the above issues, we propose I2L-MeshNet, an image-to-lixel(line+pixel) prediction network. |
320 | Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose | Hongsuk Choi; Gyeongsik Moon; Kyoung Mu Lee; | To overcome the above weaknesses, we propose Pose2Mesh, a novel graph convolutional neural network (GraphCNN)-based system that estimates the 3D coordinates of human {m mesh vertices} directly from the {m 2D human pose}. |
321 | ALRe: Outlier Detection for Guided Refinement | Mingzhu Zhu; Zhang Gao; Junzhi Yu; Bingwei He; Jiantao Liu; | In this paper, we propose a general outlier detection method for guided refinement. |
322 | Weakly-Supervised Crowd Counting Learns from Sorting rather than Locations | Yifan Yang; Guorong Li; Zhe Wu; Li Su; Qingming Huang; Nicu Sebe; | In this paper, we propose a weakly-supervised counting network, which directly regresses the crowd numbers without the location supervision. |
323 | Unsupervised Domain Attention Adaptation Network for Caricature Attribute Recognition | Wen Ji; Kelei He; Jing Huo; Zheng Gu; Yang Gao; | To facility the research in attribute learning of caricatures, we propose a caricature attribute dataset, namely WebCariA. |
324 | Many-shot from Low-shot: Learning to Annotate using Mixed Supervision for Object Detection | Carlo Biffi; Steven McDonagh; Philip Torr; Aleš Leonardis; Sarah Parisot; | Towards solving this problem we introduce, for the first time, an online annotation module (OAM) that learns to generate a many-shot set of mph{reliable} annotations from a larger volume of weakly labelled images. |
325 | Curriculum DeepSDF | Yueqi Duan; Haidong Zhu; He Wang; Li Yi Ram Nevatia; Leonidas J. Guibas; | In this paper, we design a “""shape curriculum” for learning continuous Signed Distance Function (SDF) on shapes, namely Curriculum DeepSDF. |
326 | Meshing Point Clouds with Predicted Intrinsic-Extrinsic Ratio Guidance | Minghua Liu; Xiaoshuai Zhang; Hao Su; | Instead, we propose to leverage the input point cloud as much as possible, by only adding connectivity information to existing points. |
327 | Improved Adversarial Training via Learned Optimizer | Yuanhao Xiong; Cho-Jui Hsieh; | In this paper, we empirically demonstrate that the commonly used PGD attack may not be optimal for inner maximization, and improved inner optimizer can lead to a more robust model. |
328 | Component Divide-and-Conquer for Real-World Image Super-Resolution | Pengxu Wei; Ziwei Xie; Hannan Lu; Zongyuan Zhan; Qixiang Ye; Wangmeng Zuo; Liang Lin; | In this paper, we present a large-scale Diverse Real-world image Super-Resolution dataset, i.e., DRealSR, as well as a divide-and-conquer Super-Resolution (SR) network, exploring the utility of guiding SR model with low-level image components. |
329 | Enabling Deep Residual Networks for Weakly Supervised Object Detection | Yunhang Shen; Rongrong Ji; Yan Wang; Zhiwei Chen; Feng Zheng; Feiyue Huang; Yunsheng Wu; | In this paper, we discover the intrinsic root with sophisticated analysis and propose a sequence of design principles to take full advantages of deep residual learning for WSOD from the perspectives of adding redundancy, improving robustness and aligning features. |
330 | Deep near-light photometric stereo for spatially varying reflectances | Hiroaki Santo; Michael Waechter; Yasuyuki Matsushita; | This paper presents a near-light photometric stereo method for spatially varying reflectances. |
331 | Learning Visual Representations with Caption Annotations | Mert Bulent Sariyildiz; Julien Perez; Diane Larlus; | To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks. |
332 | Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier | Tz-Ying Wu; Pedro Morgado; Pei Wang; Chih-Hui Ho; Nuno Vasconcelos; | Motivated by this, a deep realistic taxonomic classifier (Deep-RTC) is proposed as a new solution to the long-tail problem, combining realism with hierarchical predictions. |
333 | Regression of Instance Boundary by Aggregated CNN and GCN | Yanda Meng; Wei Meng; Dongxu Gao; Yitian Zhao; Xiaoyun Yang; Xiaowei Huang; Yalin Zheng; | This paper proposes a straightforward, intuitive deep learning approach for (biomedical) image segmentation tasks. |
334 | Social Adaptive Module for Weakly-supervised Group Activity Recognition | Rui Yan; Lingxi Xie; Jinhui Tang; Xiangbo Shu; Qi Tian; | This paper presents a new task named weakly-supervised group activity recognition (GAR) which differs from conventional GAR tasks in that only video-level labels are available, yet the important persons within each frame are not provided even in the training data. |
335 | RGB-D Salient Object Detection with Cross-Modality Modulation and Selection | Chongyi Li; Runmin Cong; Yongri Piao; Qianqian Xu; Chen Change Loy; | We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD). |
336 | RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval | Hung-Yu Tseng; Hsin-Ying Lee; Lu Jiang; Ming-Hsuan Yang; Weilong Yang; | In this work, we aim to synthesize images from scene description with retrieved patches as reference. |
337 | Cheaper Pre-training Lunch: An Efficient Paradigm for Object Detection | Dongzhan Zhou; Xinchi Zhou; Hongwen Zhang; Shuai Yi; Wanli Ouyang; | In this paper, we propose a general and efficient pre-training paradigm, Montage pre-training, for object detection. |
338 | Faster Person Re-Identification | Guan’an Wang; Shaogang Gong; Jian Cheng; Zengguang Hou; | In this work, we introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine (CtF) hashing code search strategy, which complementarily uses short and long codes, achieving both faster speed and better accuracy. |
339 | Quantization Guided JPEG Artifact Correction | Max Ehrlich; Ser-Nam Lim; Larry Davis; Abhinav Shrivastava; | We solve this problem by creating a novel architecture which is parameterized by the JPEG file’s quantization matrix. |
340 | 3PointTM: Faster Measurement of High-Dimensional Transmission Matrices | Yujun Chen; Manoj Kumar Sharma; Ashutosh Sabharwal; Ashok Veeraraghavan; Aswin C. Sankaranarayanan; | In this paper, we propose 3PointTM, an approach for sensing TMs that uses a minimal number of measurements per pixel – reducing the measurement budget by a factor of two as compared to state of the art in phase-shifting holography for measuring TMs – and has a low computational complexity as compared to phase retrieval. |
341 | Joint Bilateral Learning for Real-time Universal Photorealistic Style Transfer | Xide Xia; Meng Zhang; Tianfan Xue; Zheng Sun; Hui Fang; Brian Kulis ; Jiawen Chen; | We propose a new end-to-end model for photorealistic style transfer that is both fast and inherently generates photorealistic results. |
342 | Beyond 3DMM Space: Towards Fine-grained 3D Face Reconstruction | Xiangyu Zhu; Fan Yang; Di Huang; Chang Yu; Hao Wang; Jianzhu Guo; Zhen Lei; Stan Z. Li; | Secondly, we propose a Fine-Grained reconstruction Network (FGNet) that can concentrate on shape modification by warping the network input and output to the UV space. |
343 | World-Consistent Video-to-Video Synthesis | Arun Mallya; Ting-Chun Wang; Karan Sapra; Ming-Yu Liu; | In this work, we propose a framework for utilizing all past generated frames when synthesizing each frame. |
344 | Commonality-Parsing Network across Shape and Appearance for Partially Supervised Instance Segmentation | Qi Fan; Lei Ke; Wenjie Pei; Chi-Keung Tang; Yu-Wing Tai; | We propose to learn the underlying class-agnostic commonalities that can be generalized from mask-annotated categories to novel categories. |
345 | GMNet: Graph Matching Network for Large Scale Part Semantic Segmentation in the Wild | Umberto Michieli; Edoardo Borsato; Luca Rossi; Pietro Zanuttigh; | In this work, we propose a novel framework combining higher object-level context conditioning and part-level spatial relationships to address the task. |
346 | Event-based Asynchronous Sparse Convolutional Networks | Nico Messikommer; Daniel Gehrig; Antonio Loquercio; Davide Scaramuzza; | In this work, we present a general framework for converting models trained on synchronous image-like event representations into asynchronous models with identical output, thus directly leveraging the intrinsic asynchronous and sparse nature of the event data. |
347 | AtlantaNet: Inferring the 3D Indoor Layout from a Single 360(?) Image beyond the Manhattan World Assumption | Giovanni Pintore; Marco Agus; Enrico Gobbetti; | We introduce a novel end-to-end approach to predict a 3D room layout from a single panoramic image. |
348 | AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification | Xiaofang Wang; Xuehan Xiong; Maxim Neumann; AJ Piergiovanni; Michael S. Ryoo; Anelia Angelova; Kris M. Kitani; Wei Hua; | We propose a novel search space for spatiotemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell. |
349 | REMIND Your Neural Network to Prevent Catastrophic Forgetting | Tyler L. Hayes; Kushal Kafle; Robik Shrestha; Manoj Acharya; Christopher Kanan; | Here, we propose REMIND, a brain-inspired approach that enables efficient replay with compressed representations. |
350 | Image Classification in the Dark using Quanta Image Sensors | Abhiram Gnanasambandam; Stanley H. Chan; | In this paper, we present a new low-light image classification solution using Quanta Image Sensors (QIS). |
351 | n-Reference Transfer Learning for Saliency Prediction | Yan Luo; Yongkang Wong; Mohan S. Kankanhalli; Qi Zhao; | To solve this problem, we propose a few-shot transfer learning paradigm for saliency prediction, which enables efficient transfer of knowledge learned from the existing large-scale saliency datasets to a target domain with limited labeled examples. |
352 | Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection | Shuhan Chen; Yun Fu; | In this paper, we aim to develop an efficient and compact deep network for RGB-D salient object detection, where the depth image provides complementary information to boost performance in complex scenarios. |
353 | Bottom-Up Temporal Action Localization with Mutual Regularization | Peisen Zhao; Lingxi Xie; Chen Ju; Ya Zhang; Yanfeng Wang; Qi Tian; | To alleviate this problem, we introduce two regularization terms to mutually regularize the learning procedure: the Intra-phase Consistency (IntraC) regularization is proposed to make the predictions verified inside each phase and the Inter-phase Consistency (InterC) regularization is proposed to keep consistency between these phases. |
354 | On Modulating the Gradient for Meta-Learning | Christian Simon; Piotr Koniusz; Richard Nock; Mehrtash Harandi; | Inspired by optimization techniques, we propose a novel meta-learning algorithm with gradient modulation to encourage fast-adaptation of neural networks in the absence of abundant data. |
355 | Domain-Specific Mappings for Generative Adversarial Style Transfer | Hsin-Yu Chang; Zhixiang Wang; Yung-Yu Chuang; | For addressing this issue, this paper leverages domain-specific mappings for remapping latent features in the shared content space to domain-specific content spaces. |
356 | DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning | Timo Milbich; Karsten Roth; Homanga Bharadhwaj; Samarth Sinha; Yoshua Bengio; Björn Ommer; Joseph Paul Cohen; | To this end, we propose and study multiple complementary learning tasks, targeting conceptually different data relationships by only resorting to the available training samples and labels of a standard DML setting. |
357 | DHP: Differentiable Meta Pruning via HyperNetworks | Yawei Li; Shuhang Gu; Kai Zhang; Luc Van Gool; Radu Timofte; | To circumvent this problem, this paper introduces a differentiable prun-ing method via hypernetworks for automatic network pruning |
358 | Deep Transferring Quantization | Zheng Xie; Zhiquan Wen; Jing Liu; Zhiqiang Liu; Xixian Wu; Mingkui Tan; | Specifically, we propose a method named deep transferring quantization (DTQ) to effectively exploit the knowledge in a pre-trained full-precision model. |
359 | Deep Credible Metric Learning for Unsupervised Domain Adaptation Person Re-identification | Guangyi Chen; Yuhao Lu; Jiwen Lu; Jie Zhou; | In this paper, we propose a deep credible metric learning (DCML) method for unsupervised domain adaptation person re-identification. |
360 | Temporal Coherence or Temporal Motion: Which is More Critical for Video-based Person Re-identification? | Guangyi Chen; Yongming Rao; Jiwen Lu; Jie Zhou; | To distill the temporal coherence part of video representationfrom frame representations, we propose a simple yet effective Adversarial Feature Augmentation (AFA) method, which highlights the temporal coherence features by introducing adversarial augmented temporal motionnoise. |
361 | Arbitrary-Oriented Object Detection with Circular Smooth Label | Xue Yang; Junchi Yan; | We design a new rotation detection baseline, to address the boundary problem by transforming angular prediction from a regression problem to a classification task with little accuracy loss, whereby high-precision angle classification is devised in contrast to previous works using coarse-granularity in rotation detection. |
362 | Learning Event-Driven Video Deblurring and Interpolation | Songnan Lin; Jiawei Zhang; Jinshan Pan; Zhe Jiang; Dongqing Zou; Yongtian Wang; Jing Chen; Jimmy Ren; | In this paper, we propose an effective event-driven video deblurring and interpolation algorithm based on deep convolutional neural networks (CNNs). |
363 | Vectorizing World Buildings: Planar Graph Reconstruction by Primitive Detection and Relationship Inference | Nelson Nauata; Yasutaka Furukawa; | This paper tackles a 2D architecture vectorization problem, whose task is to infer an outdoor building architecture as a 2D planar graph from a single RGB image. |
364 | Learning to Combine: Knowledge Aggregation for Multi-Source Domain Adaptation | Hang Wang; Minghao Xu; Bingbing Ni; Wenjun Zhang; | To mitigate these problems, we propose a Learning to Combine for Multi-Source Domain Adaptation (LtC-MSDA) framework via exploring interactions among domains. |
365 | CSCL: Critical Semantic-Consistent Learning for Unsupervised Domain Adaptation | Jiahua Dong; Yang Cong; Gan Sun; Yuyang Liu; Xiaowei Xu; | To address above challenges, we develop a new Critical Semantic-Consistent Learning (CSCL) model, which mitigates the discrepancy of both domain-wise and category-wise distributions. |
366 | Prototype Mixture Models for Few-shot Semantic Segmentation | Boyu Yang; Chang Liu; Bohao Li; Jianbin Jiao; Qixiang Ye; | In this paper, we propose prototype mixture models (PMMs), which correlate diverse image regions with multiple prototypes to enforce the prototype-based semantic representation. |
367 | Webly Supervised Image Classification with Self-Contained Confidence | Jingkang Yang; Litong Feng; Weirong Chen; Xiaopeng Yan; Huabin Zheng ; Ping Luo; Wayne Zhang; | Inspired by DNNs’ ability on confidence prediction, we introduce self-contained confidence (SCC) by adapting model uncertainty for WSL setting and use it to sample-wisely balance $\mathcal{L}_s$ and $\mathcal{L}_w$. |
368 | Search What You Want: Barrier Panelty NAS for Mixed Precision Quantization | Haibao Yu; Qi Han; Jianbo Li; Jianping Shi; Guangliang Cheng; Bin Fan; | In this paper, we propose a novel soft Barrier Penalty based NAS (BP-NAS) for mixed precision quantization, which ensures all the searched models are inside the valid domain defined by the complexity constraint, thus could return an optimal model under the given constraint by conducting search only one time. |
369 | Monocular 3D Object Detection via Feature Domain Adaptation | Lele Chen; Guofeng Cui; Celong Liu; Zhong Li; Ziyi Kou; Yi Xu; Chenliang Xu; | In this paper, we propose a novel domain adaptation based monocular 3D object detection framework named DA-3Ddet, which adapts the feature from unsound image-based pseudo-LiDAR domain to the accurate real LiDAR domain for performance boosting. |
370 | AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation | Xiaofeng Liu; Tong Che; Yiqun Lu; Chao Yang; Site Li; Jane You; | In the viewer-centered coordinates, we construct an end-to-end trainable conditional variational framework to disentangle the unsupervisely learned relative-pose/rotation and implicit global 3D representation (shape, texture and the origin of viewer-centered coordinates, etc.). |
371 | VPN: Learning Video-Pose Embedding for Activities of Daily Living | Srijan Das; Saurav Sharma; Rui Dai; François Brémond; Monique Thonnat; | In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). |
372 | Soft Anchor-Point Object Detection | Chenchen Zhu; Fangyi Chen; Zhiqiang Shen; Marios Savvides; | In this work, we boost the performance of the anchor-point detector over the key-point counterparts while maintaining the speed advantage. |
373 | Beyond Fixed Grid: Learning Geometric Image Representation with a Deformable Grid | Jun Gao; Zian Wang; Jinchen Xuan; Sanja Fidler; | We introduce mph{Deformable Grid} (Defgrid), a learnable neural network module that predicts location offsets of vertices of a 2-dimensional triangular grid such that the edges of the deformed grid align with image boundaries. |
374 | Soft Expert Reward Learning for Vision-and-Language Navigation | Hu Wang; Qi Wu; Chunhua Shen; | In this paper, we introduce a Soft Expert Reward Learning (SERL) model to overcome the reward engineering designing and generalisation problems of the VLN task. |
375 | Part-aware Prototype Network for Few-shot Semantic Segmentation | Yongfei Liu; Xiangyi Zhang; Songyang Zhang; Xuming He; | In this paper, we propose a novel few-shot semantic segmentation framework based on the prototype representation. |
376 | Learning from Extrinsic and Intrinsic Supervisions for Domain Generalization | Shujun Wang; Lequan Yu; Caizi Li; Chi-Wing Fu; Pheng-Ann Heng; | To this end, we present a new domain generalization framework that learns how to generalize across domains simultaneously from extit{extrinsic} relationship supervision and extit{intrinsic} self-supervision for images from multi-source domains. |
377 | Joint Learning of Social Groups, Individuals Action and Sub-group Activities in Videos | Mahsa Ehsanpour; Alireza Abedin; Fatemeh Saleh; Javen Shi; Ian Reid ; Hamid Rezatofighi; | In this paper, we solve the problem of simultaneously grouping people by their social interactions, predicting their individual actions and the social activity of each social group, which we call the social task. |
378 | Whole-Body Human Pose Estimation in the Wild | Sheng Jin; Lumin Xu; Jin Xu; Can Wang; Wentao Liu; Chen Qian; Wanli Ouyang; Ping Luo; | To fill in this blank, we introduce COCO-WholeBody which extends COCO dataset with whole-body annotations. |
379 | Relative Pose Estimation of Calibrated Cameras with Known SE(3) Invariants | Bo Li; Evgeniy Martyushev; Gim Hee Lee; | In this paper, we present a complete comprehensive study of the relative pose estimation problem for a calibrated camera constrained by known $\mathrm{SE}(3)$ invariant, which involves 5 minimal problems in total. |
380 | Sequential Convolution and Runge-Kutta Residual Architecture for Image Compressed Sensing | Runkai Zheng; Yinqi Zhang; Daolang Huang; Qingliang Chen; | To address the two challenges, this paper proposes a novel Runge-Kutta Convolutional Compressed Sensing Network (RK-CCSNet). |
381 | Deep Hough Transform for Semantic Line Detection | Qi Han; Kai Zhao; Jun Xu; Ming-Ming Cheng; | In this paper, we put forward a simple yet effective method to detect meaningful straight lines, a.k.a. semantic lines, in given scenes. |
382 | Structured Landmark Detection via Topology-Adapting Deep Graph Learning | Weijian Li; Yuhang Lu; Kang Zheng; Haofu Liao; Chihung Lin; Jiebo Luo; Chi-Tung Cheng; Jing Xiao; Le Lu; Chang-Fu Kuo; Shun Miao; | In this work, we present a new topology-adapting deep graph learning approach for accurate anatomical facial and medical (e.g., hand, pelvis) landmark detection. |
383 | 3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning | Xiangyu Xu; Hao Chen; Francesc Moreno-Noguer; László A. Jeni; Fernando De la Torre; | To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. |
384 | Learning to Balance Specificity and Invariance for In and Out of Domain Generalization | Prithvijit Chattopadhyay; Yogesh Balaji; Judy Hoffman; | We introduce Domain-specific Masks for Generalization, a model for improving both in-domain and out-of-domain generalization performance. |
385 | Contrastive Learning for Unpaired Image-to-Image Translation | Taesung Park Alexei A. Efros Richard Zhang Jun-Yan Zhu; | We propose a straightforward method for doing so — maximizing mutual information between the two, using a framework based on contrastive learning. |
386 | DLow: Diversifying Latent Flows for Diverse Human Motion Prediction | Ye Yuan; Kris Kitani; | To address these problems, we propose a novel sampling method, Diversifying Latent Flows (DLow), to produce a diverse set of samples from a pretrained deep generative model. |
387 | GRNet: Gridding Residual Network for Dense Point Cloud Completion | Haozhe Xie; Hongxun Yao; Shangchen Zhou; Jiageng Mao; Shengping Zhang; Wenxiu Sun; | To solve this problem, we introduce 3D grids as intermediate representations to regularize unordered point clouds. |
388 | Gait Lateral Network: Learning Discriminative and Compact Representations for Gait Recognition | Saihui Hou; Chunshui Cao; Xu Liu; Yongzhen Huang; | In this work, we propose a novel network named Gait Lateral Network (GLN) which can learn both discriminative and compact representations from the silhouettes for gait recognition. |
389 | Blind Face Restoration via Deep Multi-scale Component Dictionaries | Xiaoming Li; Chaofeng Chen; Shangchen Zhou; Xianhui Lin; Wangmeng Zuo; Lei Zhang; | To address this issue, this paper suggests a deep face dictionary network (termed as DFDNet) to guide the restoration process of degraded observations. |
390 | Robust Neural Networks inspired by Strong Stability Preserving Runge-Kutta methods | Byungjoo Kim; Bryce Chudomelka; Jinyoung Park; Jaewoo Kang; Youngjoon Hong; Hyunwoo J. Kim; | Motivated by the SSP property and a generalized Runge-Kutta method, we proposed Strong Stability Preserving networks (SSP networks) which improve robustness against adversarial attacks. |
391 | Inequality-Constrained and Robust 3D Face Model Fitting | Evangelos Sariyanidi; Casey J. Zampella; Robert T. Schultz; Birkan Tunc; | We propose a new formulation that does not require the tuning of any weight parameter. |
392 | Gabor Layers Enhance Network Robustness | Juan C. Pérez; Motasem Alfarra; Guillaume Jeanneret; Adel Bibi; Ali Thabet; Bernard Ghanem; Pablo Arbeláez; | In particular, we explore the effect of replacing the first layers of various deep architectures with Gabor layers (i.e. convolutional layers with filters that are based on learnable Gabor parameters) on robustness against adversarial attacks. |
393 | Conditional Image Repainting via Semantic Bridge and Piecewise Value Function | Shuchen Weng; Wenbo Li; Dawei Li; Hongxia Jin; Boxin Shi; | In this work, we improve the compositing by breaking through the latent ceiling using a novel piecewise value function. |
394 | Learnable Cost Volume Using the Cayley Representation | Taihong Xiao; Jinwei Yuan; Deqing Sun; Qifei Wang Xin-Yu Zhang; Kehan Xu; Ming-Hsuan Yang; | To address this issue, we propose a learnable cost volume (LCV) using an elliptical inner product, which generalizes the standard inner product by a positive definite kernel matrix. |
395 | HALO: Hardware-Aware Learning to Optimize | Chaojian Li; Tianlong Chen; Haoran You; Zhangyang Wang; Yingyan Lin; | To this end, we propose hardware-aware learning to optimize (HALO), a practical meta optimizer dedicated to resource-efficient on-device adaptation. |
396 | Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling | Jia Zheng; Junfei Zhang; Jing Li; Rui Tang; Shenghua Gao; Zihan Zhou; | In this paper, we present a new synthetic dataset, Structured3D, with the aim of providing large-scale photo-realistic images with rich 3D structure annotations for a wide spectrum of structured 3D modeling tasks. |
397 | BroadFace: Looking at Tens of Thousands of People at Once for Face Recognition | Yonghyun Kim; Wonpyo Park; Jongju Shin; | To overcome this difficulty, we propose a novel method called BroadFace, which is a learning process to consider a massive set of identities, comprehensively. |
398 | Interpretable Visual Reasoning via Probabilistic Formulation under Natural Supervision | Xinzhe Han; Shuhui Wang; Chi Su; Weigang Zhang; Qingming Huang; Qi Tian; | In this paper, we rethink implicit reasoning process in VQA, and propose a new formulation which maximizes the log-likelihood of joint distribution for the observed question and predicted answer. |
399 | Domain Adaptive Semantic Segmentation Using Weak Labels | Sujoy Paul; Yi-Hsuan Tsai; Samuel Schulter; Amit K. Roy-Chowdhury; Manmohan Chandraker; | We propose a novel framework for domain adaptation in semantic segmentation with image-level weak labels in the target domain. In experiments, we show considerable improvements with respect to the existing state-of-the-arts in UDA and present a new benchmark in the WDA setting. |
400 | Knowledge Distillation Meets Self-Supervision | Guodong Xu; Ziwei Liu; Xiaoxiao Li; Chen Change Loy; | In this paper, we discuss practical ways to exploit those noisy self-supervision signals with selective transfer for distillation. |
401 | Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions | Ignacio Rocco; Relja Arandjelovi?; Josef Sivic; | In this work we target the problem of estimating accurately localised correspondences between a pair of images. |
402 | Reconstructing the Noise Variance Manifold for Image Denoising | Ioannis Marras; Grigorios G. Chrysos; Ioannis Alexiou; Gregory Slabaugh; Stefanos Zafeiriou; | To fill the gap, in this work we introduce the idea of a cGAN which explicitly leverages structure in the image noise variance space. |
403 | Occlusion-Aware Depth Estimation with Adaptive Normal Constraints | Xiaoxiao Long; Lingjie Liu; Christian Theobalt; Wenping Wang; | We present a new learning-based method for multi-frame depth estimation from a color video, which is a fundamental problem in scene understanding, robot navigation or handheld 3D reconstruction. |
404 | VisualEchoes: Spatial Image Representation Learning through Echolocation | Ruohan Gao; Changan Chen; Ziad Al-Halah; Carl Schissler; Kristen Grauman; | We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. |
405 | Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval | Andrew Brown; Weidi Xie; Vicky Kalogeiton; Andrew Zisserman; | To this end, we introduce an objective that optimises instead a smoothed approximation of AP, coined Smooth-AP. |
406 | Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation | Liang-Chieh Chen; Raphael Gontijo Lopes; Bowen Cheng; Maxwell D. Collins; Ekin D. Cubuk; Barret Zoph; Hartwig Adam; Jonathon Shlens; | In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. |
407 | Spatially Aware Multimodal Transformers for TextVQA | Yash Kant; Dhruv Batra; Peter Anderson; Alexander Schwing; Devi Parikh; Jiasen Lu; Harsh Agrawal; | In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. |
408 | Every Pixel Matters: Center-aware Feature Alignment for Domain Adaptive Object Detector | Cheng-Chun Hsu; Yi-Hsuan Tsai; Yen-Yu Lin; Ming-Hsuan Yang; | Different from existing solutions, we propose a domain adaptation framework that accounts for each pixel, especially via predicting pixel-wise objectness and centerness. |
409 | URIE: Universal Image Enhancement for Visual Recognition in the Wild | Taeyoung Son Juwon Kang Namyup Kim Sunghyun Cho Suha Kwak; | To tackle this issue, we present a Universal and Recognition-friendly Image Enhancement network, dubbed URIE, which is attached in front of existing recognition models and enhances distorted input to improve their performance without retraining them. |
410 | Pyramid Multi-view Stereo Net with Self-adaptive View Aggregation | Hongwei Yi; Zizhuang Wei; Mingyu Ding; Runze Zhang; Yisong Chen; Guoping Wang; Yu-Wing Tai; | In this paper, we propose an effective and efficient pyramid multi-view stereo (MVS) net with self-adaptive view aggregation for accurate and complete dense point cloud reconstruction. |
411 | SPL-MLL: Selecting Predictable Landmarks for Multi-Label Learning | Junbing Li; Changqing Zhang; Pengfei Zhu; Baoyuan Wu; Lei Chen; Qinghua Hu; | In this work, we propose to select a small subset of labels as landmarks which are easy to predict according to input (predictable) and can well recover the other possible labels (representative). |
412 | Unpaired Image-to-Image Translation using Adversarial Consistency Loss | Yihao Zhao; Ruihai Wu; Hao Dong; | In this paper, we propose a novel adversarial-consistency loss for image-to-image translation. |
413 | Discriminability Distillation in Group Representation Learning | Manyuan Zhang; Guanglu Song; Hang Zhou; Yu Liu; | We claim the most significant indicator to show whether the group representation can be benefited from one of its element is not the quality or an inexplicable score, but the discriminability w.r.t.the model. |
414 | Monocular Expressive Body Regression through Body-Driven Attention | Vasileios Choutas; Georgios Pavlakos; Timo Bolkart; Dimitrios Tzionas ; Michael J. Black; | We address these limitations by introducing ExPose(EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. |
415 | Dual Adversarial Network: Toward Real-world Noise Removal and Noise Generation | Zongsheng Yue; Qian Zhao; Lei Zhang; Deyu Meng; | In this work, we propose a novel unified framework to simultaneously deal with the noise removal and noise generation tasks. |
416 | Linguistic Structure Guided Context Modeling for Referring Image Segmentation | Tianrui Hui; Si Liu; Shaofei Huang; Guanbin Li; Sansi Yu; Faxi Zhang; Jizhong Han; | To tackle this problem, we propose a “gather-propagate-distribute” scheme to model multimodal context by crossmodal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module. |
417 | Federated Visual Classification with Real-World Data Distribution | Tzu-Ming Harry Hsu; Hang Qi; Matthew Brown; | In this work, we characterize the effect these real-world data distributions have on distributed learning, using as a benchmark the standard Federated Averaging (FedAvg) algorithm. |
418 | Robust Re-Identification by Multiple Views Knowledge Distillation | Angelo Porrello; Luca Bergamini; Simone Calderara; | In this work, we devise a training strategy that allows the transfer of a superior knowledge, arising from a set of views depicting the target object. |
419 | Defocus Deblurring Using Dual-Pixel Data | Abdullah Abuolaim; Michael S. Brown; | We propose an effective defocus deblurring method that exploits data available on dual-pixel (DP) sensors found on most modern cameras. |
420 | RhyRNN: Rhythmic RNN for Recognizing Events in Long and Complex Videos | Tianshu Yu; Yikang Li; Baoxin Li; | To address this, we propose Rhythmic RNN (RhyRNN) which is capable of handling long video sequences (up to 3,000 frames) as well as capturing rhythms at different scales. |
421 | Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping | Uttaran Bhattacharya; Christian Roncal; Trisha Mittal; Rohan Chandra ; Kyra Kapsaskis; Kurt Gray; Aniket Bera; Dinesh Manocha; | We present an autoencoder-based semi-supervised approach to classify perceived human emotions from walking styles obtained from videos or motion-captured data and represented as sequences of 3D poses. |
422 | Weighing Counts: Sequential Crowd Counting by Reinforcement Learning | Liang Liu; Hao Lu; Hongwei Zou; Haipeng Xiong; Zhiguo Cao; Chunhua Shen; | Inspired by scale weighing, we propose a novel ‘counting scale’ termed LibraNet where the count value is analogized by weight. |
423 | Reflection Backdoor: A Natural Backdoor Attack on Deep Neural Networks | Yunfei Liu; Xingjun Ma; James Bailey; Feng Lu; | In this paper, we present a new type of backdoor attack inspired by an important natural phenomenon: reflection. |
424 | Learning to Learn with Variational Information Bottleneck for Domain Generalization | Yingjun Du; Jun Xu; Huan Xiong; Qiang Qiu; Xiantong Zhen; Cees G. M. Snoek; Ling Shao; | Domain generalization models learn to generalize to previously unseen domains, but suffer from prediction uncertainty and domain shift. In this paper, we address both problems. |
425 | Deep Positional and Relational Feature Learning for Rotation-Invariant Point Cloud Analysis | Ruixuan Yu; Xin Wei; Federico Tombari; Jian Sun; | In this paper we propose a rotation-invariant deep network for point clouds analysis. |
426 | Thanks for Nothing: Predicting Zero-Valued Activations with Lightweight Convolutional Neural Networks | Gil Shomron; Ron Banner; Moran Shkolnik; Uri Weiser; | Inspired by the observation that spatial correlation exists in CNN output feature maps (ofms), we propose a method to dynamically predict whether ofm activations are zero-valued or not according to their neighboring activation values, thereby avoiding zero-valued activations and reducing the number of convolution operations. |
427 | Layered Neighborhood Expansion for Incremental Multiple Graph Matching | Zixuan Chen; Zhihui Xie; Junchi Yan Yinqiang Zheng; Xiaokang Yang; | In this paper, we treat the graphs as graphs on a super-graph, and propose a novel breadth first search based method for expanding the neighborhood on the super-graph for a new coming graph, such that the matching with the new graph can be efficiently performed within the constructed neighborhood. |
428 | SCAN: Learning to Classify Images without Labels | Wouter Van Gansbeke; Simon Vandenhende; Stamatios Georgoulis; Marc Proesmans; Luc Van Gool; | In this paper, we deviate from recent works, and advocate a two-step approach where feature learning and clustering are decoupled. |
429 | Graph convolutional networks for learning with few clean and many noisy labels | Ahmet Iscen; Giorgos Tolias; Yannis Avrithis; Ond?ej Chum; Cordelia Schmid; | In this work we consider the problem of learning a classifier from noisy labels when a few clean labeled examples are given. |
430 | Object-and-Action Aware Model for Visual Language Navigation | Yuankai Qi; Zizheng Pan; Shengping Zhang; Anton van den Hengel; Qi Wu; | In this paper, we propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately. |
431 | A Comprehensive Study of Weight Sharing in Graph Networks for 3D Human Pose Estimation | Kenkun Liu; Rongqi Ding; Zhiming Zou; Le Wang; Wei Tang; | The objective of this paper is to have a comprehensive and systematic study of weight sharing in GCNs for 3D HPE. |
432 | MuCAN: Multi-Correspondence Aggregation Network for Video Super-Resolution | Wenbo Li; Xin Tao; Taian Guo; Lu Qi; Jiangbo Lu; Jiaya Jia; | Motivated by these findings, we propose a temporal multi-correspondence aggregation strategy to leverage most similar patches across frames, and also a cross-scale nonlocal-correspondence aggregation scheme to explore self-similarity of images across scales. |
433 | Efficient Semantic Video Segmentation with Per-frame Inference | Yifan Liu; Chunhua Shen; Changqian Yu; Jingdong Wang; | In contrast, here we explicitly consider the temporal consistency among frames as extra constraints during training and process each frame independently in the inference phase. |
434 | Increasing the Robustness of Semantic Segmentation Models with Painting-by-Numbers | Christoph Kamann; Carsten Rother; | We present a new training schema that increases this shape bias. |
435 | Deep Spiking Neural Network: Energy Efficiency Through Time based Coding | Bing Han; Kaushik Roy; | In this work, we propose an ANN to SNN conversion methodology that uses a time-based coding scheme, named Temporal-Switch-Coding (TSC), and a corresponding TSC spiking neuron model. |
436 | InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling | Jun Wang; Shiyi Lan; Mingfei Gao; Larry S. Davis; | To address this issue, we propose a novel 3D object detection framework with dynamic information modeling. |
437 | Utilizing Patch-level Category Activation Patterns for Multiple Class Novelty Detection | Poojan Oza; Vishal M. Patel; | In this paper, we propose a novel method that makes deep convolutional neural networks robust to novel classes. |
438 | People as Scene Probes | Yifan Wang; Brian L. Curless; Steven M. Seitz; | By analyzing the motion of people and other objects in a scene, we demonstrate how to infer depth, occlusion, lighting, and shadow information from video taken from a single camera viewpoint. This information is then used to composite new objects into the same scene with a high degree of automation and realism. |
439 | Mapping in a Cycle: Sinkhorn Regularized Unsupervised Learning for Point Cloud Shapes | Lei Yang; Wenxi Liu; Zhiming Cui; Nenglun Chen; Wenping Wang; | We propose an unsupervised learning framework with the pretext task of finding dense correspondences between point cloud shapes from the same category based on the cycle-consistency formulation. |
440 | Label-Efficient Learning on Point Clouds using Approximate Convex Decompositions | Matheus Gadelha; Aruni RoyChowdhury; Gopal Sharma; Evangelos Kalogerakis; Liangliang Cao; Erik Learned-Miller; Rui Wang; Subhransu Maji; | In this paper, we investigate the use of Approximate Convex Decompositions (ACD) as a self-supervisory signalfor label-efficient learning of point cloud representations. |
441 | TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video | Tiancheng Zhi; Christoph Lassner; Tony Tung; Carsten Stoll; Srinivasa G. Narasimhan; Minh Vo; | We present TexMesh, a novel approach to reconstruct detailed human meshes with high-resolution full-body texture from RGB-D video. |
442 | Consistency-based Semi-supervised Active Learning: Towards Minimizing Labeling Cost | Mingfei Gao; Zizhao Zhang; Guo Yu; Sercan . Ar?k; Larry S. Davis; Tomas Pfister; | Here, we propose to unify unlabeled sample selection and model training towards minimizing labeling cost, and make two contributions towards that end. |
443 | Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation | Fangyun Wei; Xiao Sun; Hongyang Li; Jingdong Wang; Stephen Lin; | While this center-point regression is simple and efficient, we argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries, due to object deformation and scale/orientation variation. To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions. |
444 | Modeling 3D Shapes by Reinforcement Learning | Cheng Lin; Tingxiang Fan; Wenping Wang; Matthias Nießner; | Inspired by such artist-based modeling, we propose a two-step neural framework based on RL to learn 3D modeling policies. |
445 | LST-Net: Learning a Convolutional Neural Network with a Learnable Sparse Transform | Lida Li; Kun Wang; Shuai Li; Xiangchu Feng; Lei Zhang; | In this paper, we propose to mitigate this issue by learning a CNN with a learnable sparse transform (LST), which converts the input features into a more compact and sparser domain so that the spatial and channel-wise redundancy can be more effectively reduced. |
446 | Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision | Damien Teney; Ehsan Abbasnedjad; Anton van den Hengel; | We propose an auxiliary training objective that improves the generalization capabilities of neural networks by leveraging an overlooked supervisory signal found in existing datasets. |
447 | CN: Channel Normalization For Point Cloud Recognition | Zetong Yang; Yanan Sun; Shu Liu; Xiaojuan Qi; Jiaya Jia; | In this paper, we deeply analyze these point recognition frameworks and present a factor, called difference ratio, to measure the influence of structure information among different levels on the final representation. |
448 | Rethinking the Defocus Blur Detection Problem and A Real-Time Deep DBD Model | Ning Zhang; Junchi Yan; | In this work, we propose novel perspectives on the DBD problem and design convenient approach to build a real-time cost-effective DBD model. |
449 | AutoMix: Mixup Networks for Sample Interpolation via Cooperative Barycenter Learning | Jianchao Zhu; Liangliang Shi; Junchi Yan; Hongyuan Zha; | This paper proposes new ways of sample mixing by thinking of the process as generation of barycenter in a metric space for data augmentation. |
450 | Scene Text Image Super-resolution in the wild | Wenjia Wang; Enze Xie; Xuebo Liu; Wenhai Wang; Ding Liang; Chunhua Shen; Xiang Bai; | In this purpose, a new Text Super-Resolution Network, termed TSRN, with three novel modules is developed. |
451 | Coupling Explicit and Implicit Surface Representations for Generative 3D Modeling | Omid Poursaeed; Matthew Fisher; Noam Aigerman; Vladimir G. Kim; | We propose a novel neural architecture for representing 3D surfaces, which harnesses two complementary shape representations: (i) an explicit representation via an atlas, i.e., embeddings of 2D domains into 3D (ii) an implicit-function representation, i.e., a scalar function over the 3D volume, with its levels denoting surfaces. |
452 | Learning Disentangled Representations with Latent Variation Predictability | Xinqi Zhu; Chang Xu; Dacheng Tao; | This paper defines the variation predictability of latent disentangled representations. |
453 | Deep Space-Time Video Upsampling Networks | Jaeyeon Kang; Younghyun Jo; Seoung Wug Oh; Peter Vajda; Seon Joo Kim; | In this paper, we investigate the problem of jointly upsampling videos both in space and time, which is becoming more important with advances in display systems. |
454 | Large-Scale Few-Shot Learning via Multi-Modal Knowledge Discovery | Shuo Wang; Jun Yue; Jianzhuang Liu; Qi Tian; Meng Wang; | To solve these problems, we propose a method based on multi-modal knowledge discovery. |
455 | Fast Video Object Segmentation using the Global Context Module | Yu Li; Zhuoran Shen; Ying Shan; | We developed a real-time, high-quality semi-supervised video object segmentation algorithm. |
456 | Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos | Anurag Arnab; Chen Sun; Arsha Nagrani; Cordelia Schmid; | In this paper, we present a spatio-temporal action recognition model that is trained with only video-level labels, which are significantly easier to annotate. |
457 | Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification | Nikita Dvornik; Cordelia Schmid; Julien Mairal; | In this work, we propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches. |
458 | MessyTable: Instance Association in Multiple Camera Views | Zhongang Cai; Junzhe Zhang; Daxuan Ren; Cunjun Yu; Haiyu Zhao; Shuai Yi; Chai Kiat Yeo; Chen Change Loy; | We present an interesting and challenging dataset that features a large number of scenes with messy tables captured from multiple camera views. |
459 | A Unified Framework for Shot Type Classification Based on Subject Centric Lens | Anyi Rao; Jiaze Wang; Linning Xu; Xuekun Jiang; Qingqiu Huang; Bolei Zhou; Dahua Lin; | To address these issues, we propose a learning framework Subject Guidance Network (SGNet) for shot type recognition. |
460 | BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues | Samuel Albanie; Gül Varol; Liliane Momeni; Triantafyllos Afouras; Joon Son Chung; Neil Fox; Andrew Zisserman; | In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area. |
461 | HTML: A Parametric Hand Texture Model for 3D Hand Reconstruction and Personalization | Neng Qian; Jiayi Wang; Franziska Mueller; Florian Bernard; Vladislav Golyanik; Christian Theobalt; | To fill this gap, in this work we present HTML, the first parametric texture model of human hands. |
462 | CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions | Zhongdao Wang; Jingwei Zhang; Liang Zheng; Yixuan Liu; Yifan Sun; Yali Li; Shengjin Wang; | This paper proposes a self-supervised learning method for the person re-identification (re-ID) problem, where existing unsupervised methods usually rely on pseudo labels, such as those from video tracklets or clustering. |
463 | Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions | Xihui Liu; Zhe Lin; Jianming Zhang; Handong Zhao; Quan Tran; Xiaogang Wang; Hongsheng Li; | We propose a novel algorithm, named Open-Edit, which is the first attempt on open-domain image manipulation with open-vocabulary instructions. |
464 | Towards Real-Time Multi-Object Tracking | Zhongdao Wang; Liang Zheng; Yixuan Liu; Yali Li; Shengjin Wang; | In this paper, we propose an MOT system that allows target detection and appearance embedding to be learned in a shared model. |
465 | A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation | Jian Liang; Yunbo Wang; Dapeng Hu; Ran He; Jiashi Feng; | In this paper, we build on domain adversarial learning and propose a novel domain adaptation method BA$^3$US with two new techniques termed Balanced Adversarial Alignment (BAA) and Adaptive Uncertainty Suppression (AUS), respectively. |
466 | Unsupervised Deep Metric Learning with Transformed Attention Consistency and Contrastive Clustering Loss | Yang Li; Shichao Kan; Zhihai He; | To characterize the consistent pattern of human attention during image comparisons, we introduce the idea of transformed attention consistency. |
467 | STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos | Ali Athar; Sabarinath Mahadevan; Aljosa Osep; Laura Leal-Taixé Bastian Leibe; | In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. |
468 | Hierarchical Style-based Networks for Motion Synthesis | Jingwei Xu; Huazhe Xu; Bingbing Ni; Xiaokang Yang; Xiaolong Wang; Trevor Darrell; | In this paper, we propose an unsupervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location. |
469 | Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop | Benjamin Biggs; Oliver Boyne; James Charles; Andrew Fitzgibbon; Roberto Cipolla; | We introduce an automatic, end-to-end method for recovering the 3D pose and shape of dogs from monocular internet images. |
470 | Learning to Count in the Crowd from Limited Labeled Data | Vishwanath A. Sindagi; Rajeev Yasarla; Deepak Sam Babu; R. Venkatesh Babu; Vishal M. Patel; | In this work, we focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples while leveraging a large pool of unlabeled data. |
471 | SPOT: Selective Point Cloud Voting for Better Proposal in Point Cloud Object Detection | Hongyuan Du; Linjun Li; Bo Liu; Nuno Vasconcelos; | In this work, we propose Selective Point clOud voTing (SPOT) module, a simple effective component that can be easily trained end-to-end in point cloud object detectors to solve this problem. |
472 | Explainable Face Recognition | Jonathan R. Williford; Brandon B. May; Jeffrey Byrne; | In this paper, we provide the first comprehensive benchmark and baseline evaluation for XFR. Finally, we provide a comprehensive benchmark on this dataset comparing five state-of-the-art XFR algorithms on three facial matchers. |
473 | From Shadow Segmentation to Shadow Removal | Hieu Le; Dimitris Samaras; | We propose a shadow removal method that can be trained using only shadow and non-shadow patches cropped from the shadow images themselves. |
474 | Diverse and Admissible Trajectory Prediction through Multimodal Context Understanding | Seong Hyeon Park; Gyubok Lee; Jimin Seo; Manoj Bhat; Minseok Kang; Jonathan Francis; Ashwin Jadhav; Paul Pu Liang; Louis-Philippe Morency; | In this paper, we propose a model that synthesizes multiple input signals from the multimodal world|the environment’s scene context and interactions between multiple surrounding agents|to best model all diverse and admissible trajectories. |
475 | CONFIG: Controllable Neural Face Image Generation | Marek Kowalski; Stephan J. Garbin; Virginia Estellers; Tadas Baltrušaitis; Matthew Johnson; Jamie Shotton; | To this end we propose ConfigNet, a neural face model that allows for controlling individual aspects of output images in semantically meaningful ways and that is a significant step on the path towards finely-controllable neural rendering. |
476 | Single View Metrology in the Wild | Rui Zhu; Xingyi Yang; Yannick Hold-Geoffroy; Federico Perazzi; Jonathan Eisenmann; Kalyan Sunkavalli; Manmohan Chandraker; | We present a novel approach to single view metrology that can recover the absolute scale of a scene represented by 3D heights of objects or camera height above the ground as well as camera parameters of orientation and field of view, using just a monocular image acquired in unconstrained condition. |
477 | Procedure Planning in Instructional Videos | Chien-Yi Chang; De-An Huang; Danfei Xu; Ehsan Adeli; Li Fei-Fei; Juan Carlos Niebles; | In this paper, we study the problem of procedure planning in instructional videos, which can be seen as the first step towards enabling autonomous agents to plan for complex tasks in everyday settings such as cooking. |
478 | Funnel Activation for Visual Recognition | Ningning Ma; Xiangyu Zhang; Jian Sun; | We present a conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition. |
479 | GIQA: Generated Image Quality Assessment | Shuyang Gu; Jianmin Bao; Dong Chen; Fang Wen; | We introduce three GIQA algorithms from two perspectives: learning-based and data-based. |
480 | Adversarial Continual Learning | Sayna Ebrahimi; Franziska Meier; Roberto Calandra; Trevor Darrell; Marcus Rohrbach; | We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks. |
481 | Adapting Object Detectors with Conditional Domain Normalization | Peng Su; Kun Wang; Xingyu Zeng; Shixiang Tang; Dapeng Chen; Di Qiu ; Xiaogang Wang; | In this work, we present the Conditional Domain Normalization (CDN) to bridge the domain distribution gap. |
482 | HARD-Net: Hardness-AwaRe Discrimination Network for 3D Early Activity Prediction | Tianjiao Li; Jun Liu; Wei Zhang; Lingyu Duan; | In this paper, we propose a novel Hardness-AwaRe Discrimination Network (HARD-Net) to specifically investigate the relationships between the similar activity pairs that are hard to be discriminated. |
483 | Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction | Lokender Tiwari; Pan Ji; Quoc-Huy Tran; Bingbing Zhuang; Saket Anand ; Manmohan Chandraker; | In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other’s shortcomings. |
484 | Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting | Shengcai Liao; Ling Shao; | In this paper, beyond representation learning, we consider how to formulate person image matching directly in deep feature maps. |
485 | Self-supervised Bayesian Deep Learning for Image Recovery with Applications to Compressive Sensing | Tongyao Pang; Yuhui Quan; Hui Ji; | Motivated by the practical value of reducing the cost and complexity of constructing labeled training datasets, this paper proposed a self-supervised deep learning approach for image recovery, which is dataset-free. |
486 | Graph-PCNN: Two Stage Human Pose Estimation with Graph Pose Refinement | Jian Wang; Xiang Long; Yuan Gao; Errui Ding; Shilei Wen; | In this paper, we aim to find a better approach to get more accurate localization results. |
487 | Semi-supervised Learning with a Teacher-student Network for Generalized Attribute Prediction | Minchul Shin; | With that in mind, we propose a multi-teacher-single-student (MTSS) approach inspired by the multi-task learning and the distillation of semi-supervised learning. |
488 | Unsupervised Domain Adaptation with Noise Resistible Mutual-Training for Person Re-identification | Fang Zhao; Shengcai Liao; Guo-Sen Xie; Jian Zhao; Kaihao Zhang; Ling Shao; | To depress noises in pseudo-labels, this paper proposes a Noise Resistible Mutual-Training (NRMT) method, which maintains two networks during training to perform collaborative clustering and mutual instance selection. |
489 | DPDist: Comparing Point Clouds Using Deep Point Cloud Distance | Dahlia Urbach; Yizhak Ben-Shabat; Michael Lindenbaum; | We introduce a new deep learning method for point cloud comparison. |
490 | Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation | Xiaokang Chen; Kwan-Yee Lin; Jingbo Wang; Wayne Wu; Chen Qian; Hongsheng Li; Gang Zeng; | In this paper, we propose a unified and efficient Cross-modality Guided Encoder to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively. |
491 | DataMix: Efficient Privacy-Preserving Edge-Cloud Inference | Zhijian Liu; Zhanghao Wu; Chuang Gan; Ligeng Zhu; Song Han; | In this paper, we mediate between the resource-constrained edge devices and the privacy-invasive cloud servers by introducing a novel privacy-preserving edge-cloud inference framework, DataMix. |
492 | Neural Re-Rendering of Humans from a Single Image | Kripasindhu Sarkar; Dushyant Mehta; Weipeng Xu; Vladislav Golyanik; Christian Theobalt; | To ad-dress these challenges, we propose a new method for neural re-renderingof a human under a novel user-defined pose and viewpoint given oneinput image. |
493 | Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation | Filippo Aleotti; Fabio Tosi; Li Zhang; Matteo Poggi; Stefano Mattoccia; | In contrast, to soften typical stereo artefacts, we propose a novel self-supervised paradigm reversing the link between the two. |
494 | PIPAL: a Large-Scale Image Quality Assessment Dataset for Perceptual Image Restoration | Jinjin Gu; Haoming Cai; Haoyu Chen; Xiaoxing Ye; Jimmy S. Ren; Chao Dong; | Based on PIPAL, we present new benchmarks for both IQA and super-resolution methods. |
495 | Why do These Match? Explaining the Behavior of Image Similarity Models | Bryan A. Plummer; Mariya I. Vasileva; Vitali Petsiuk; Kate Saenko; David Forsyth; | In this paper, we introduce Salient Attributes for Network Explanation (SANE) to explain image similarity models, where a model’s output is a score measuring the similarity of two inputs rather than a classification score. |
496 | CooGAN: A Memory-Efficient Framework for High-Resolution Facial Attribute Editing | Xuanhong Chen; Bingbing Ni; Naiyuan Liu; Ziang Liu; Yiliu Jiang; Loc Truong; Qi Tian; | To address these issues, we propose a NOVEL pixel translation framework called Cooperative GAN(CooGAN) for HR facial image editing. |
497 | Progressive Transformers for End-to-End Sign Language Production | Ben Saunders; Necati Cihan Camgoz; Richard Bowden; | In this paper, we propose Progressive Transformers, the first SLP model to translate from discrete spoken language sentences to continuous 3D sign pose sequences in an end-to-end manner. |
498 | Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting | Minghui Liao; Guan Pang; Jing Huang; Tal Hassner; Xiang Bai; | To tackle these problems, we propose Mask TextSpotter v3, an end-to-end trainable scene text spotter that adopts a Segmentation Proposal Network (SPN) instead of an RPN. |
499 | Making Affine Correspondences Work in Camera Geometry Computation | Daniel Barath; Michal Polic; Wolfgang Förstner; Torsten Sattler; Tomas Pajdla; Zuzana Kukelova; | We propose a method for refining the local feature geometries by symmetric intensity-based matching, combine uncertainty propagation inside RANSAC with preemptive model verification, show a general scheme for computing uncertainty of minimal solvers results, and adapt the sample cheirality check for homography estimation to region-to-region correspondences. |
500 | Sub-center ArcFace: Boosting Face Recognition by Large-scale Noisy Web Faces | Jiankang Deng; Jia Guo; Tongliang Liu; Mingming Gong; Stefanos Zafeiriou; | In this paper, we relax the intra-class constraint of ArcFace to improve the robustness to label noise. |
501 | Foley Music: Learning to Generate Music from Videos | Chuang Gan; Deng Huang; Peihao Chen; Joshua B. Tenenbaum; Antonio Torralba; | In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. |
502 | Contrastive Multiview Coding | Yonglong Tian; Dilip Krishnan; Phillip Isola; | We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. |
503 | Regional Homogeneity: Towards Learning Transferable Universal Adversarial Perturbations Against Defenses | Yingwei Li; Song Bai; Cihang Xie; Zhenyu Liao; Xiaohui Shen; Alan Yuille; | This paper focuses on learning transferable adversarial examples specifically against defense models (models to defense adversarial attacks). |
504 | Generative Low-bitwidth Data Free Quantization | Shoukai Xu; Haokun Li; Bohan Zhuang; Jing Liu; Jiezhang Cao; Chuangrun Liang; Mingkui Tan; | In this paper, we investigate a simple-yet-effective method called Generative Low-bitwidth Data Free Quantization(GDFQ) to remove the data dependence burden. |
505 | Local Correlation Consistency for Knowledge Distillation | Xiaojie Li; Jianlong Wu; Hongyu Fang; Yue Liao; Fei Wang; Chen Qian; | In this paper, we propose the local correlation exploration framework for knowledge distillation. |
506 | Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild | Jason Y. Zhang; Sam Pepose; Hanbyul Joo; Deva Ramanan; Jitendra Malik; Angjoo Kanazawa; | We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene, all from a single image in-the-wild captured in an uncontrolled environment. |
507 | Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation | Hang Zhou; Xudong Xu; Dahua Lin; Xiaogang Wang; Ziwei Liu; | To overcome this challenge, we propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio. |
508 | CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations | Yuanhan Zhang; ZhenFei Yin; Yidong Li; Guojun Yin; Junjie Yan; Jing Shao; Ziwei Liu; | Our key insight is that, compared with the commonly-used binary supervision or mid-level geometric representations, rich semantic annotations as auxiliary tasks can greatly boost the performance and generalizability of face anti-spoofing across a wide range of spoof attacks. |
509 | Thinking in Frequency: Face Forgery Detection by Mining Frequency-aware Clues | Yuyang Qian; Guojun Yin; Lu Sheng; Zixuan Chen; Jing Shao; | To introduce frequency into the face forgery detection, we propose a novel Frequency in Face Forgery Network (F$^3$-Net), taking advantages of two different but complementary frequency-aware clues, 1) frequency-aware decomposed image components, and 2) local frequency statistics, to deeply mine the forgery patterns via our two-stream collaborative learning framework. |
510 | Weakly-Supervised Cell Tracking via Backward-and-Forward Propagation | Kazuya Nishimura; Junya Hayashida; Chenyang Wang; Dai Fei Elmer Ker; Ryoma Bise; | We propose a weakly-supervised cell tracking method that can train a convolutional neural network (CNN) by using only the annotation of ""cell detection"" (i.e., the coordinates of cell positions) without association information, in which cell positions can be easily obtained by nuclear staining. |
511 | SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation | John Yang; Hyung Jin Chang; Seungeui Lee; Nojun Kwak; | In this paper, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. |
512 | Rethinking the Distribution Gap of Person Re-identification with Camera-based Batch Normalization | Zijie Zhuang; Longhui Wei; Lingxi Xie; Tianyu Zhang; Hengheng Zhang ; Haozhe Wu; Haizhou Ai; Qi Tian; | This paper rethinks the working mechanism of conventional ReID approaches and puts forward a new solution. |
513 | AMLN: Adversarial-based Mutual Learning Network for Online Knowledge Distillation | Xiaobing Zhang; Shijian Lu; Haigang Gong; Zhipeng Luo; Ming Liu; | In this work, we propose an innovative adversarial-based mutual learning network (AMLN) that introduces process-driven learning beyond outcome-driven learning for augmented online knowledge distillation. |
514 | Online Multi-modal Person Search in Videos | Jiangyue Xia; Anyi Rao; Qingqiu Huang; Linning Xu; Jiangtao Wen; Dahua Lin; | In this paper, we propose an online person search framework, which can recognize people in a video on the fly. |
515 | Single Image Super-Resolution via a Holistic Attention Network | Ben Niu; Weilei Wen; Wenqi Ren; Xiangde Zhang; Lianping Yang; Shuzhen Wang; Kaihao Zhang; Xiaochun Cao; Haifeng Shen; | To address this problem, we propose a new holistic attention network (HAN), which consists of a layer attention module (LAM) and a channel-spatial attention module (CSAM), to model the holistic interdependencies among layers, channels, and positions. |
516 | Can You Read Me Now? Content Aware Rectification using Angle Supervision | Amir Markovitz; Inbal Lavi; Or Perel; Shai Mazor; Roee Litman; | We present CREASE: Content Aware Rectification using Angle Supervision, the first learned method for document rectification that relies on the document’s content, the location of the words and specifically their orientation, as hints to assist in the rectification process. |
517 | Momentum Batch Normalization for Deep Learning with Small Batch Size | Hongwei Yong; Jianqiang Huang; Deyu Meng; Xiansheng Hua; Lei Zhang; | To make a deeper understanding of BN, in this work we prove that BN actually introduces a certain level of noise into the sample mean and variance during the training process, while the noise level depends only on the batch size. |
518 | AdvPC: Transferable Adversarial Perturbations on 3D Point Clouds | Abdullah Hamdi; Sara Rojas; Ali Thabet; Bernard Ghanem; | In this work, we present novel data-driven adversarial attacks against 3D point cloud networks. |
519 | Edge-aware Graph Representation Learning and Reasoning for Face Parsing | Gusi Te; Yinglu Liu; Wei Hu; Hailin Shi; Tao Mei; | To this end, we propose to model and reason the region-wise relations by learning graph representations, and leverage the edge information between regions for optimized abstraction. |
520 | BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network | Deng-Ping Fan; Yingjie Zhai; Ali Borji; Jufeng Yang; Ling Shao; | In this paper, we make the first attempt to leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to develop a novel cascaded refinement network. |
521 | G-LBM:Generative Low-dimensional Background Model Estimation from Video Sequences | Behnaz Rezaei; Amirreza Farnoosh; Sarah Ostadabbas; | In this paper, we propose a computationally tractable and theoretically supported non-linear low-dimensional generative model to represent real-world data in the presence of noise and sparse outliers. |
522 | H3DNet: 3D Object Detection Using Hybrid Geometric Primitives | Zaiwei Zhang; Bo Sun; Haitao Yang; Qixing Huang; | We introduce H3DNet, which takes a colorless 3D point cloud as input and outputs a collection of oriented object bounding boxes (or BB) and their semantic labels. |
523 | Expressive Telepresence via Modular Codec Avatars | Hang Chu; Shugao Ma; Fernando De la Torre; Sanja Fidler; Yaser Sheikh; | This paper aims in this direction and presents Modular Codec Avatars (MCA), a method to generate hyper-realistic faces driven by the cameras in the VR headset. |
524 | Cascade Graph Neural Networks for RGB-D Salient Object Detection | Ao Luo; Xin Li; Fan Yang; Zhicheng Jiao; Hong Cheng; Siwei Lyu; | In this paper, we study the problem of salient object detection for RGB-D images by using both color and depth information. |
525 | FairALM: Augmented Lagrangian Method for Training Fair Models with Little Regret | Vishnu Suresh Lokhande; Aditya Kumar Akash; Sathya N. Ravi; Vikas Singh; | Here, we study mechanisms that impose fairness concurrently while training the model. |
526 | Generating Videos of Zero-Shot Compositions of Actions and Objects | Megha Nawhal; Mengyao Zhai; Andreas Lehrmann; Leonid Sigal; Greg Mori; | In this paper we develop methods for generating such videos — making progress toward addressing the important, open problem of video generation in complex scenes. |
527 | ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language | Zhe Wang; Zhiyuan Fang; Jun Wang; Yezhou Yang; | To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. |
528 | Renovating Parsing R-CNN for Accurate Multiple Human Parsing | Lu Yang; Qing Song; Zhihui Wang; Mengjie Hu; Chun Liu; Xueshi Xin; Wenhe Jia; Songcen Xu; | To reverse this phenomenon, we present Renovating Parsing R-CNN (RP R-CNN), which introduces a global semantic enhanced feature pyramid network and a parsing re-scoring network into the existing high-performance pipeline. |
529 | Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning | Qing Yu; Daiki Ikami; Go Irie; Kiyoharu Aizawa; | Instead of training an OOD detector and SSL separately, we propose a multi-task curriculum learning framework. |
530 | Gradient-Induced Co-Saliency Detection | Zhao Zhang; Wenda Jin; Jun Xu; Ming-Ming Cheng; | In this paper, inspired by human behavior, we propose a gradient-induced co-saliency detection (GICD) method. To evaluate the performance of Co-SOD methods on discovering the co-salient object among multiple foregrounds, we construct a challenging CoCA dataset, where each image contains at least one extraneous foreground along with the co-salient object. |
531 | Nighttime Defogging Using High-Low Frequency Decomposition and Grayscale-Color Networks | Wending Yan; Robby T. Tan; Dengxin Dai; | In this paper, we address the problem of nighttime defogging from a single image. |
532 | SegFix: Model-Agnostic Boundary Refinement for Segmentation | Yuhui Yuan; Jingyi Xie; Xilin Chen; Jingdong Wang; | We present a model-agnostic post-processing scheme to improve the boundary quality for the segmentation result that is generated by any existing segmentation model. |
533 | Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction | Cunjun Yu; Xiao Ma; Jiawei Ren; Haiyu Zhao; Shuai Yi; | In this paper, we present STAR, a Spatio-Temporal grAph tRansformer framework, which tackles trajectory prediction by only attention mechanisms. |
534 | Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars | Egor Zakharov; Aleksei Ivakhnenko; Aliaksandra Shysheya; Victor Lempitsky; | We propose a neural rendering-based system that creates head avatars from a single photograph. |
535 | Neural Geometric Parser for Single Image Camera Calibration | Jinwoo Lee; Minhyuk Sung; Hyunjoon Lee; Junho Kim; | We propose a neural geometric parser learning single image camera calibration for man-made scenes. |
536 | Learning Flow-based Feature Warping for Face Frontalization with Illumination Inconsistent Supervision | Yuxiang Wei; Ming Liu; Haolin Wang; Ruifeng Zhu; Guosheng Hu; Wangmeng Zuo; | We propose a novel Flow-based Feature Warping Model (FFWM) which can learn to synthesize photo-realistic and illumination preserving frontal images with illumination inconsistent supervision. |
537 | Learning Architectures for Binary Networks | Dahyun Kim; Kunal Pratap Singh; Jonghyun Choi; | Questioning that the architectures designed for FP networks might not be the best for binary networks, we propose to search architectures for binary networks (BNAS) by defining a new search space for binary architectures and a novel search objective. |
538 | Semantic View Synthesis | Hsin-Ping Huang; Hung-Yu Tseng; Hsin-Ying Lee; Jia-Bin Huang; | To address the drawbacks, we propose a two-step approach. First, we focus on synthesizing the color and depth of the visible surface of the 3D scene. We then use the synthesized color and depth to impose explicit constraints on the multiple-plane image (MPI) representation prediction process. |
539 | An Analysis of Sketched IRLS for Accelerated Sparse Residual Regression | Daichi Iwata; Michael Waechter; Wen-Yan Lin; Yasuyuki Matsushita; | This paper studies the problem of sparse residual regression, i.e., learning a linear model using a norm that favors solutions in which the residuals are sparsely distributed. |
540 | Relative Pose from Deep Learned Depth and a Single Affine Correspondence | Ivan Eichhardt; Daniel Barath; | We propose a new approach for combining deep-learned nonmetric monocular depth with affine correspondences (ACs) to estimate the relative pose of two calibrated cameras from a single correspondence. |
541 | Video Super-Resolution with Recurrent Structure-Detail Network | Takashi Isobe; Xu Jia; Shuhang Gu; Songjiang Li; Shengjin Wang; Qi Tian; | In this work, we propose a novel recurrent video super-resolution method which is both effective and efficient in exploiting previous frames to super-resolve the current frame. |
542 | Shape Adaptor: A Learnable Resizing Module | Shikun Liu; Zhe Lin; Yilin Wang; Jianming Zhang; Federico Perazzi; Edward Johns; | We present a novel resizing module for neural networks: shape adaptor, a drop-in enhancement built on top of traditional resizing layers, such as pooling, bilinear sampling, and strided convolution. |
543 | Shuffle and Attend: Video Domain Adaptation | Jinwoo Choi; Gaurav Sharma; Samuel Schulter; Jia-Bin Huang; | We address the problem of domain adaptation in videos for the task of human action recognition. |
544 | DRG: Dual Relation Graph for Human-Object Interaction Detection | Chen Gao; Jiarui Xu; Yuliang Zou; Jia-Bin Huang; | In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph (one human-centric and one object-centric). |
545 | Flow-edge Guided Video Completion | Chen Gao; Ayush Saraf; Jia-Bin Huang; Johannes Kopf; | We present a new flow-based video completion algorithm. |
546 | End-to-End Trainable Deep Active Contour Models for Automated Image Segmentation: Delineating Buildings in Aerial Imagery | Ali Hatamizadeh; Debleena Sengupta; Demetri Terzopoulos; | As a solution, we present Trainable Deep Active Contours (TDACs), an automatic image segmentation framework that intimately unites Convolutional Neural Networks (CNNs) and Active Contour Models (ACMs). |
547 | Towards End-to-end Video-based Eye-Tracking | Seonwook Park; Emre Aksan; Xucong Zhang; Otmar Hilliges; | In response to this understanding, we propose a novel dataset and accompanying method which aims to explicitly learn these semantic and temporal relationships. |
548 | Generating Handwriting via Decoupled Style Descriptors | Atsunobu Kotani; Stefanie Tellex; James Tompkin; | Instead, we introduce the Decoupled Style Descriptor (DSD) model for handwriting, which factors both character- and writer-level styles and allows our model to represent an overall greater space of styles. |
549 | LEED: Label-Free Expression Editing via Disentanglement | Rongliang Wu; Shijian Lu; | This paper presents an innovative label-free expression editing via disentanglement (LEED) framework that is capable of editing the expression of both frontal and profile facial images without requiring any expression labels. |
550 | Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards | Xuewen Yang; Heming Zhang; Di Jin; Yingru Liu; Chi-Hao Wu; Jianchao Tan; Dongliang Xie; Jue Wang; Xin Wang; | The goal of this work is to develop a novel learning framework for accurate and expressive fashion captioning. |
551 | Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder | Gouthaman KV; Anurag Mittal; | In this work, we propose a novel model-agnostic question encoder, Visually-Grounded Question Encoder (VGQE), for VQA that reduces this effect. |
552 | Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation | Jogendra Nath Kundu; Ambareesh Revanur; Govind Vitthal Waghmare; Rahul Mysore Venkatesh; R. Venkatesh Babu; | We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation. |
553 | Class-Incremental Domain Adaptation | Jogendra Nath Kundu; Rahul Mysore Venkatesh; Naveen Venkat; Ambareesh Revanur; R. Venkatesh Babu; | In this work, we effectively identify the limitations of these approaches in the CIDA paradigm. |
554 | Anti-Bandit Neural Architecture Search for Model Defense | Hanlin Chen; Baochang Zhang; Song Xue; Xuan Gong; Hong Liu; Rongrong Ji; David Doermann; | In this paper, we defend against adversarial attacks using neural architecture search (NAS) which is based on a comprehensive search of denoising blocks, weight-free operations, Gabor filters and convolutions. |
555 | Wavelet-Based Dual-Branch Network for Image Demoiréing | Lin Liu; Jianzhuang Liu; Shanxin Yuan; Gregory Slabaugh; Aleš Leonardis; Wengang Zhou; Qi Tian; | In this paper, we design a wavelet-based dual-branch network (WDNet) with a spatial attention mechanism for image demoireing. |
556 | Low Light Video Enhancement using Synthetic Data Produced with an Intermediate Domain Mapping | Danai Triantafyllidou; Sean Moran; Steven McDonagh; Sarah Parisot; Gregory Slabaugh; | By generating dynamic video data synthetically, we enable a recently proposed state-of-the-art RAW-to-RGB model to attain higher image quality (improved colour, reduced artifacts) and improved temporal consistency, compared to the same model trained with only static real video data |
557 | Non-Local Spatial Propagation Network for Depth Completion | Jinsun Park; Kyungdon Joo; Zhe Hu; Chi-Kuei Liu; In So Kweon; | In this paper, we propose a robust and efficient end-to-end non-local spatial propagation network for depth completion. |
558 | DanbooRegion: An Illustration Region Dataset | Lvmin Zhang; Yi JI; Chunping Liu; | We detail the challenges in achieving this dataset and present a human-in-the-loop workflow namely Feasibility-based Assignment Recommendation (FAR) to enable large-scale annotating. |
559 | Event Enhanced High-Quality Image Recovery | Bishan Wang; Jingwei He; Lei Yu; Gui-Song Xia; Wen Yang; | Based on this, we propose an explainable network, an event-enhanced sparse learning network (eSL-Net), to recover the high-quality images from event cameras. |
560 | PackDet: Packed Long-Head Object Detector | Kun Ding; Guojin He; Huxiang Gu; Zisha Zhong; Shiming Xiang; Chunhong Pan; | To solve this issue, we propose a packing operator (PackOp) to combine all head branches together at spatial. |
561 | A Generic Graph-based Neural Architecture Encoding Scheme for Predictor-based NAS | Xuefei Ning; Yin Zheng; Tianchen Zhao; Yu Wang; Huazhong Yang; | This work proposes a novel Graph-based neural ArchiTecture Encoding Scheme, a.k.a. GATES, to improve the predictor-based neural architecture search. |
562 | Learning Semantic Neural Tree for Human Parsing | Ruyi Ji; Dawei Du; Libo Zhang; Longyin Wen; Yanjun Wu; Chen Zhao; Feiyue Huang; Siwei Lyu; | In this paper, we design a novel semantic neural tree for human parsing, which uses a tree architecture to encode physiological structure of human body, and design a coarse to fine process in a cascade manner to generate accurate results. |
563 | Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation | Wenbin Wang; Ruiping Wang; Shiguang Shan; Xilin Chen; | Therefore, we argue that a desirable scene graph should be also hierarchically constructed, and introduce a new scheme for modeling scene graph. |
564 | Burst Denoising via Temporally Shifted Wavelet Transforms | Xuejian Rong; Denis Demandolx; Kevin Matzen; Priyam Chatterjee; Yingli Tian; | We propose an end-to-end trainable burst denoising pipeline which jointly captures high-resolution and high-frequency deep features derived from wavelet transforms. |
565 | JSSR: A Joint Synthesis, Segmentation, and Registration System for 3D Multi-Modal Image Alignment of Large-scale Pathological CT Scans | Fengze Liu; Jinzheng Cai; Yuankai Huo; Chi-Tung Cheng; Ashwin Raju; Dakai Jin; Jing Xiao; Alan Yuille; Le Lu; ChienHung Liao; Adam P. Harrison; | In this work, we propose a novel multi-task learning system, JSSR, based on an end-to-end 3D convolutional neural network that is composed of a generator, a registration and a segmentation component. |
566 | SimAug: Learning Robust Representations from Simulation for Trajectory Prediction | Junwei Liang; Lu Jiang; Alexander Hauptmann; | We propose a novel approach to learn robust representation through augmenting the simulation training data such that the representation can better generalize to unseen real-world test data. |
567 | ScribbleBox: Interactive Annotation Framework for Video Object Segmentation | Bowen Chen; Huan Ling; Xiaohui Zeng; Jun Gao; Ziyue Xu; Sanja Fidler; | We introduce ScribbleBox, an interactive framework for annotating object instances with masks in videos with a significant boost in efficiency. |
568 | Rethinking Pseudo-LiDAR Representation | Xinzhu Ma; Shinan Liu; Zhiyi Xia; Hongwen Zhang; Xingyu Zeng; Wanli Ouyang; | In this paper, we perform an in-depth investigation and observe that the pseudo-LiDAR representation is effective because of the coordinate transformation, instead of data representation itself. |
569 | Deep Multi Depth Panoramas for View Synthesis | Kai-En Lin; Zexiang Xu; Ben Mildenhall; Pratul P. Srinivasan; Yannick Hold-Geoffroy; Stephen DiVerdi; Qi Sun; Kalyan Sunkavalli; Ravi Ramamoorthi; | We propose a learning-based approach for novel view synthesis for multi-camera 360$^ |
570 | MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection | Fa-Ting Hong; Xuanteng Huang; Wei-Hong Li; Wei-Shi Zheng; | In this work, we propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning. |
571 | ContactPose: A Dataset of Grasps with Object Contact and Hand Pose | Samarth Brahmbhatt; Chengcheng Tang; Christopher D. Twigg; Charles C. Kemp; James Hays; | We introduce ContactPose, the first dataset of hand-object contact paired with hand pose, object pose, and RGB-D images. |
572 | API-Net: Robust Generative Classifier via a Single Discriminator | Xinshuai Dong; Hong Liu; Rongrong Ji; Liujuan Cao; Qixiang Ye; Jianzhuang Liu; Qi Tian; | This work aims for a solution of generative classifiers that can profit from the merits of both. |
573 | Bias-based Universal Adversarial Patch Attack for Automatic Check-out | Aishan Liu; Jiakai Wang; Xianglong Liu; Bowen Cao; Chongzhi Zhang; Hang Yu; | To address the problem, this paper proposes a bias-based framework to generate class-agnostic universal adversarial patches with strong generalization ability, which exploits both the perceptual and semantic bias of models. |
574 | Imbalanced Continual Learning with Partitioning Reservoir Sampling | Chris Dongjoo Kim; Jinseo Jeong; Gunhee Kim; | We jointly address the two independently solved problems, Catastropic Forgetting and the long-tailed label distribution by ?rst empirically showing a new challenge of destructive forgetting of the minority concepts on the tail. |
575 | Guided Collaborative Training for Pixel-wise Semi-Supervised Learning | Zhanghan Ke; Di Qiu; Kaican Li; Qiong Yan; Rynson W.H. Lau; | In this paper, we present a new SSL framework, named Guided Collaborative Training (GCT), for pixel-wise tasks, with two main technical contributions. |
576 | Stacking Networks Dynamically for Image Restoration Based on the Plug-and-Play Framework | Haixin Wang; Tianhao Zhang; Muzhi Yu; Jinan Sun; Wei Ye; Chen Wang ; Shikun Zhang; | To address this challenge, we leverage the iterative process of the traditional plug-and-play method to provide a dynamic stacked network for Image Restoration. |
577 | Efficient Transfer Learning via Joint Adaptation of Network Architecture and Weight | Ming Sun; Haoxuan Dou; Junjie Yan; | To remedy the above issues, we reduce the super-network size by randomly dropping connection between network blocks while embedding a larger search space. |
578 | Spatial Attention Pyramid Network for Unsupervised Domain Adaptation | Congcong Li; Dawei Du; Libo Zhang; Longyin Wen; Tiejian Luo; Yanjun Wu; Pengfei Zhu; | To that end, in this paper, we design a new spatial attention pyramid network for unsupervised domain adaptation. |
579 | GSIR: Generalizable 3D Shape Interpretation and Reconstruction | Jianren Wang; Zhaoyuan Fang; | We propose to recover 3D shape structures as cuboids from partially reconstructed objects and use the predicted structures to further guide 3D reconstruction. |
580 | Weakly Supervised 3D Object Detection from Lidar Point Cloud | Qinghao Meng; Wenguan Wang; Tianfei Zhou; Jianbing Shen; Luc Van Gool ; Dengxin Dai; | This work proposes a weakly supervised approach for 3D object detection, only requiring a small set of weakly annotated scenes, associated with a few precisely labeled object instances. |
581 | Two-phase Pseudo Label Densification for Self-training based Domain Adaptation | Inkyu Shin; Sanghyun Woo; Fei Pan; In So Kweon; | In order to tackle this problem, we propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD. |
582 | Adaptive Offline Quintuplet Loss for Image-Text Matching | Tianlang Chen; Jiajun Deng; Jiebo Luo; | In this paper, we propose solutions by sampling negatives offline from the whole training set. |
583 | Learning Object Placement by Inpainting for Compositional Data Augmentation | Lingzhi Zhang; Tarmily Wen; Jie Min; Jiancong Wang; David Han; Jianbo Shi; | We propose a self-learning framework that automatically generates the necessary training data without any manual labeling by detecting, cutting, and inpainting objects from an image. |
584 | Deep Vectorization of Technical Drawings | Vage Egiazarian; Oleg Voynov; Alexey Artemov; Denis Volkhonskiy; Aleksandr Safin; Maria Taktasheva; Denis Zorin; Evgeny Burnaev; | We present a new method for vectorization of technical line drawings, such as floor plans, architectural drawings, and 2D CAD images. |
585 | CAD-Deform: Deformable Fitting of CAD Models to 3D Scans | Vladislav Ishimtsev; Alexey Bokhovkin; Alexey Artemov; Savva Ignatyev ; Matthias Niessner; Denis Zorin; Evgeny Burnaev; | In this work, we address this shortcoming by introducing CAD-Deform, a method which obtains more accurate CAD-to-scan fits by non-rigidly deforming retrieved CAD models. |
586 | An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices | Xiaolong Ma; Wei Niu; Tianyun Zhang; Sijia Liu; Sheng Lin; Hongjia Li; Wujie Wen; Xiang Chen; Jian Tang; Kaisheng Ma; Bin Ren; Yanzhi Wang; | To solve the problem, we introduce a new sparsity dimension, namely pattern-based sparsity that comprises pattern and connectivity sparsity, and becoming both highly accurate and hardware friendly. |
587 | AutoTrajectory: Label-free Trajectory Extraction and Prediction from Videos using Dynamic Points | Yuexin Ma; Xinge Zhu; Xinjing Cheng; Ruigang Yang; Jiming Liu; Dinesh Manocha; | In this paper, we present a novel, label-free algorithm, AutoTrajectory, for trajectory extraction and prediction to use raw videos directly. |
588 | Multi-Agent Embodied Question Answering in Interactive Environments | Sinan Tan; Weilai Xiang; Huaping Liu; Di Guo; Fuchun Sun; | We investigate a new AI task — Multi-Agent Interactive Question Answering — where several agents explore the scene jointly in interactive environments to answer a question. |
589 | Conditional Sequential Modulation for Efficient Global Image Retouching | Jingwen He; Yihao Liu; Yu Qiao; Chao Dong; | In this paper, we investigate some commonly-used retouching operations and mathematically find that these pixel-independent operations can be approximated or formulated by multi-layer perceptrons (MLPs). |
590 | Segmenting Transparent Objects in the Wild | Enze Xie; Wenjia Wang; Wenhai Wang; Mingyu Ding; Chunhua Shen; Ping Luo; | To address this important problem, this work proposes a large-scale dataset for transparent object segmentation, named Trans10K, consisting of 10,428 images of real scenarios with carefully manual annotations, which are 10 times larger than the existing datasets. |
591 | Length-Controllable Image Captioning | Chaorui Deng; Ning Ding; Mingkui Tan; Qi Wu; | In this paper, we propose to use a simple length level embedding to endow them with this ability. |
592 | Few-Shot Semantic Segmentation with Democratic Attention Networks | Haochen Wang; Xudong Zhang; Yutao Hu; Yandan Yang; Xianbin Cao; Xiantong Zhen; | In this paper, we propose the Democratic Attention Network (DAN) for few-shot semantic segmentation. |
593 | Defocus Blur Detection via Depth Distillation | Xiaodong Cun; Chi-Man Pun; | To solve these problems, we introduce depth information into DBD for the first time. |
594 | Motion Guided 3D Pose Estimation from Videos | Jingbo Wang; Sijie Yan; Yuanjun Xiong; Dahua Lin; | We propose a new loss function, called motion loss, for the problem of monocular 3D Human pose estimation from 2D pose. |
595 | Reflection Separation via Multi-bounce Polarization State Tracing | Rui Li; Simeng Qiu; Guangming Zang; Wolfgang Heidrich; | In this paper we aim to generalize the reflection removal to real-world scenarios with more complicated light interactions. |
596 | SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation | Jiale Cao; Rao Muhammad Anwer; Hisham Cholakkal; Fahad Shahbaz Khan; Yanwei Pang; Ling Shao; | We propose a fast single-stage instance segmentation method, called SipMask, that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. |
597 | SemanticAdv: Generating Adversarial Examples via Attribute-conditioned Image Editing | Haonan Qiu; Chaowei Xiao; Lei Yang; Xinchen Yan; Honglak Lee; Bo Li; | In this paper, we propose SemanticAdv to generate a new type of semantically realistic adversarial examples via attribute-conditioned image editing. |
598 | Learning with Noisy Class Labels for Instance Segmentation | Longrong Yang; Fanman Meng; Hongliang Li; Qingbo Wu; Qishang Cheng; | To solve this issue, a novel method is proposed in this paper, which uses different losses describing different roles of noisy class labels to enhance the learning. |
599 | Deep Image Clustering with Category-Style Representation | Junjie Zhao; Donghuan Lu; Kai Ma; Yu Zhang; Yefeng Zheng; | In this paper, we propose a novel deep image clustering framework to learn a category-style latent representation in which the category information is disentangled from image style and can be directly used as the cluster assignment. |
600 | Self-supervised Motion Representation via Scattering Local Motion Cues | Yuan Tian; Zhaohui Che; Wenbo Bao; Guangtao Zhai; Zhiyong Gao; | In this paper, we leverage the massive unlabeled video data to learn an accurate explicit motion representation that aligns well with the semantic distribution of the moving objects. |
601 | Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets | Tian Chen; Shijie An; Yuan Zhang; Chongyang Ma ; Huayan Wang; Xiaoyan Guo; Wen Zheng; | One key limitation of existing approaches lies in their lack of structural information exploitation, which leads to inaccurate spatial layout, discontinuous surface, and ambiguous boundaries. In this paper, we tackle this problem in three aspects. |
602 | BMBC: Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation | Junheum Park; Keunsoo Ko; Chul Lee; Chang-Su Kim; | We propose a novel deep-learning-based video interpolation algorithm based on bilateral motion estimation. |
603 | Hard negative examples are hard, but useful | Hong Xuan; Abby Stylianou; Xiaotong Liu; Robert Pless; | In this paper, we characterize the space of triplets and derive why hard negatives make triplet loss training fail. |
604 | ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions | Zechun Liu; Zhiqiang Shen; Marios Savvides; Kwang-Ting Cheng; | In this paper, we propose several ideas for enhancing a bi- nary network to close its accuracy gap from real-valued networks without incurring any additional computational cost. |
605 | Video Object Detection via Object-level Temporal Aggregation | Chun-Han Yao; Chen Fang; Xiaohui Shen; Yangyue Wan; Ming-Hsuan Yang; | In this work we propose to improve video object detection via temporal aggregation. |
606 | Object Detection with a Unified Label Space from Multiple Datasets | Xiangyun Zhao; Samuel Schulter; Gaurav Sharma; Yi-Hsuan Tsai; Manmohan Chandraker; Ying Wu; | Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. |
607 | Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D | Jonah Philion; Sanja Fidler; | We propose a new end-to-end architecture that directly extracts a bird’s-eye-view representation of a scene given image data from an arbitrary number of cameras. |
608 | Comprehensive Image Captioning via Scene Graph Decomposition | Yiwu Zhong; Liwei Wang; Jianshu Chen; Dong Yu; Yin Li; | We address the challenging problem of image captioning by revisiting the representation of image scene graph. |
609 | Symbiotic Adversarial Learning for Attribute-based Person Search | Yu-Tong Cao; Jingya Wang; Dacheng Tao; | In this paper, we present a symbiotic adversarial learning framework, called SAL. |
610 | Amplifying Key Cues for Human-Object-Interaction Detection | Yang Liu; Qingchao Chen; Andrew Zisserman; | In this paper we introduce two methods to amplify key cues in the image, and also a method to combine these and other cues when considering the interaction between a human and an object. |
611 | Rethinking Few-shot Image Classification: A Good Embedding is All You Need? | Yonglong Tian; Yue Wang; Dilip Krishnan; Joshua B. Tenenbaum; Phillip Isola; | In this work, we show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods. |
612 | Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization | Kyle Min; Jason J. Corso; | Despite recent advances, existing methods for weakly-supervised temporal activity localization struggle to recognize when an activity is not occurring. To address this issue, we propose a novel method named A2CL-PT. |
613 | Action Localization through Continual Predictive Learning | Sathyanarayanan Aakur; Sudeep Sarkar; | In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. |
614 | Generative View-Correlation Adaptation for Semi-Supervised Multi-View Learning | Yunyu Liu; Lichen Wang; Yue Bai; Can Qin; Zhengming Ding; Yun Fu; | To address the challenges, we propose a novel View-Correlation Adaptation ( extit{VCA}) framework in semi-supervised fashion. |
615 | READ: Reciprocal Attention Discriminator for Image-to-Video Re-Identification | Minho Shim; Hsuan-I Ho; Jinhyung Kim; Dongyoon Wee; | In this work, we focus on image-to-video re-ID which compares a single query image to videos in the gallery. |
616 | 3D Human Shape Reconstruction from a Polarization Image | Shihao Zou; Xinxin Zuo; Yiming Qian; Sen Wang; Chi Xu; Minglun Gong ; Li Cheng; | This paper tackles the problem of estimating 3D body shape of clothed humans from single polarized 2D images, i.e. polarization images. |
617 | The Devil is in the Details: Self-Supervised Attention for Vehicle Re-Identification | Pirazh Khorramshahi; Neehar Peri; Jun-cheng Chen; Rama Chellappa; | In this paper, we present Self-supervised Attention for Vehicle Re-identification (SAVER), a novel approach to effectively learn vehicle-specific discriminative features. |
618 | Improving One-stage Visual Grounding by Recursive Sub-query Construction | Zhengyuan Yang; Tianlang Chen; Liwei Wang; Jiebo Luo; | To address this query modeling deficiency, we propose a recursive sub-query construction framework, which reasons between image and query for multiple rounds and reduces the referring ambiguity step by step. |
619 | Multi-level Wavelet-based Generative Adversarial Network for Perceptual Quality Enhancement of Compressed Video | Jianyi Wang; Xin Deng; Mai Xu; Congyong Chen; Yuhang Song; | In this paper, we focus on enhancing the perceptualquality of compressed video. |
620 | Example-Guided Image Synthesis using Masked Spatial-Channel Attention and Self-Supervision | Haitian Zheng; Haofu Liao; Lele Chen; Wei Xiong; Tianlang Chen; Jiebo Luo; | In this paper, we tackle a more challenging and general task, where the exemplar is a scene image that is semantically different from the given label map. |
621 | Content-Consistent Matching for Domain Adaptive Semantic Segmentation | Guangrui Li; Guoliang Kang; Wu Liu; Yunchao Wei; Yi Yang; | This paper considers the adaptation of semantic segmentation from the synthetic source domain to the real target domain. |
622 | AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting | Wenhai Wang; Xuebo Liu; Xiaozhong Ji; Enze Xie; Ding Liang; ZhiBo Yang; Tong Lu; Chunhua Shen; Ping Luo; | Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection. |
623 | History Repeats Itself: Human Motion Prediction via Motion Attention | Wei Mao; Miaomiao Liu; Mathieu Salzmann; | Here, we introduce an attention-based feed-forward network that explicitly leverages this observation. |
624 | Unsupervised Video Object Segmentation with Joint Hotspot Tracking | Lu Zhang; Jianming Zhang; Zhe Lin; Radomír M?ch; Huchuan Lu; You He; | Specifically, we propose a Weighted Correlation Siamese Network (WCS-Net) which employs a Weighted Correlation Block (WCB) for encoding the pixel-wise correspondence between a template frame and the search frame. |
625 | SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach | Ailing Zeng; Xiao Sun; Fuyang Huang; Minhao Liu; Qiang Xu; Stephen Lin; | We propose to take advantage of this fact for better generalization to rare and unseen poses. |
626 | CAFE-GAN: Arbitrary Face Attribute Editing with Complementary Attention Feature | Jeong gi Kwak; David K. Han; Hanseok Ko; | To address this unintended altering problem, we propose a novel GAN model which is designed to edit only the parts of a face pertinent to the target attributes by the concept of Complementary Attention Feature (CAFE). |
627 | MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection | Xin Lu; Quanquan Li; Buyu Li; Junjie Yan; | In this paper, we propose MimicDet, a novel and efficient framework to train a one-stage detector by directly mimic the two-stage features, aiming to bridge the accuracy gap between one-stage and two-stage detectors. |
628 | Latent Topic-aware Multi-Label Classification | Jianghong Ma; Yang Liu; | This paper shows that the sample and feature exaction, which are two important procedures for removing noisy and redundant information encoded in training samples in both sample and feature perspectives, can be effectively and efficiently performed in the latent topic space by considering topic-based feature-label correlation. |
629 | Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning | Xiangxi Shi; Xu Yang; Jiuxiang Gu; Shafiq Joty; Jianfei Cai; | In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task. |
630 | Attract, Perturb, and Explore: Learning a Feature Alignment Network for Semi-supervised Domain Adaptation | Taekyung Kim; Changick Kim; | We propose an SSDA framework that aims to align features via alleviation of the intra-domain discrepancy. |
631 | Curriculum Manager for Source Selection in Multi-Source Domain Adaptation | Luyu Yang; Yogesh Balaji; Ser-Nam Lim; Abhinav Shrivastava; | In this paper, we proposed an adversarial agent that learns a dynamic curriculum for source samples, called Curriculum Manager for Source Selection (CMSS). |
632 | Powering One-shot Topological NAS with Stabilized Share-parameter Proxy | Ronghao Guo; Chen Lin; Chuming Li; Keyu Tian; Ming Sun; Lu Sheng; Junjie Yan; | In this work, we try to enhance the one-shot NAS by exploring high-performing network architectures in our large-scale Topology Augmented Search Space (i.e., over 3.4×10^10 different topological structures). |
633 | Classes Matter: A Fine-grained Adversarial Approach to Cross-domain Semantic Segmentation | Haoran Wang; Tong Shen; Wei Zhang; Ling-Yu Duan; Tao Mei; | To fully exploit the supervision in the source domain, we propose a fine-grained adversarial learning strategy for class-level feature alignment while preserving the internal structure of semantics across domains. |
634 | Boundary-preserving Mask R-CNN | Tianheng Cheng; Xinggang Wang; Lichao Huang; Wenyu Liu; | To remedy this, we propose a conceptually simple yet effective Boundary-guided Mask R-CNN (BMask R-CNN) to leverage object boundary information to improve mask localization accuracy. |
635 | Self-supervised Single-view 3D Reconstruction via Semantic Consistency | Xueting Li; Sifei Liu; Kihwan Kim; Shalini De Mello; Varun Jampani; Ming-Hsuan Yang; Jan Kautz; | The key insight of our work is that objects can be represented as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on birds and wheels on cars). |
636 | MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation | Benlin Liu; Yongming Rao; Jiwen Lu; Jie Zhou; Cho-Jui Hsieh; | Specifically, we propose that better soft targets with higher compatibility can be generated by using a label generator to fuse the featuremaps from deeper stages in a top-down manner, and we can employ the meta-learning technique to optimize this label generator. |
637 | Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling | Yuliang Zou; Pan Ji; Quoc-Huy Tran; Jia-Bin Huang; Manmohan Chandraker; | In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. |
638 | The Devil is in Classification: A Simple Framework for Long-tail Instance Segmentation | Tao Wang; Yu Li; Bingyi Kang; Junnan Li; Junhao Liew; Sheng Tang; Steven Hoi; Jiashi Feng; | Based on such an observation, we first consider various techniques for improving long-tail classification performance which indeed enhance instance segmentation results. We then propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach. |
639 | What is Learned in Deep Uncalibrated Photometric Stereo? | Guanying Chen; Michael Waechter; Boxin Shi; Kwan-Yee K. Wong; Yasuyuki Matsushita; | In this paper, we analyze the features learned by this method and find that they strikingly resemble attached shadows, shadings, and specular highlights, which are known to provide useful clues in resolving the generalized bas-relief (GBR) ambiguity. |
640 | Prior-based Domain Adaptive Object Detection for Hazy and Rainy Conditions | Vishwanath A. Sindagi; Poojan Oza; Rajeev Yasarla; Vishal M. Patel; | To address this issue, we propose an unsupervised prior-based domain adversarial object detection framework for adapting the detectors to hazy and rainy conditions. |
641 | Adversarial Ranking Attack and Defense | Mo Zhou; Zhenxing Niu; Le Wang; Qilin Zhang; Gang Hua; | In this paper, we propose two attacks against deep ranking systems,i.e., Candidate Attack and Query Attack, that can raise or lower the rank of chosen candidates by adversarial perturbations. |
642 | ReDro: Efficiently Learning Large-sized SPD Visual Representation | Saimunur Rahman; Lei Wang; Changming Sun; Luping Zhou; | This work proposes a novel scheme called Relation Dropout (ReDro). It is inspired by the fact that eigen-decomposition of a block diagonal matrix can be efficiently obtained by decomposing each of its diagonal square matrices, which are of smaller sizes. |
643 | Graph-Based Social Relation Reasoning | Wanhua Li; Yueqi Duan; Jiwen Lu; Jianjiang Feng; Jie Zhou; | In this paper, we propose a simpler, faster, and more accurate method named graph relational reasoning network (GR$^2$N) for social relation recognition. |
644 | EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection | Tengteng Huang; Zhe Liu; Xiwu Chen; Xiang Bai; | In this paper, we aim at addressing two critical issues in the 3D detection task, including the exploitation of multiple sensors (namely LiDAR point cloud and camera image), as well as the inconsistency between the localization and classification confidence. |
645 | Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency | Jiaxiang Shang; Tianwei Shen; Shiwei li; Lei Zhou; Mingmin Zhen; Tian Fang; Long Quan; | In contrast to previous works that only enforce 2D feature constraints, we propose a self-supervised training architecture by leveraging the multi-view geometry consistency, which provides reliable constraints on face pose and depth estimation. |
646 | Asynchronous Interaction Aggregation for Action Detection | Jiajun Tang; Jin Xia; Xinzhi Mu; Bo Pang; Cewu Lu; | We propose the Asynchronous Interaction Aggregation network (AIA) that leverages different interactions to boost action detection. |
647 | Shape and Viewpoint without Keypoints | Shubham Goel; Angjoo Kanazawa; Jitendra Malik; | We present a learning framework that learns to recover the 3D shape, pose and texture from a single image, trained on an image collection without any ground truth 3D shape, multi-view, camera viewpoints or keypoint supervision. |
648 | Learning Attentive and Hierarchical Representations for 3D Shape Recognition | Jiaxin Chen; Jie Qin; Yuming Shen; Li Liu; Fan Zhu; Ling Shao; | This paper proposes a novel method for 3D shape representation learning, namely Hyperbolic Embedded Attentive Representation (HEAR). |
649 | TF-NAS: Rethinking Three Search Freedoms of Latency-Constrained Differentiable Neural Architecture Search | Yibo Hu; Xiang Wu; Ran He; | In this paper, we rethink three freedoms of differentiable NAS, i.e. operation-level, depth-level and width-level, and propose a novel method, named Three-Freedom NAS (TF-NAS), to achieve both good classification accuracy and precise latency constraint. |
650 | Associative3D: Volumetric Reconstruction from Sparse Views | Shengyi Qian; Linyi Jin; David F. Fouhey; | We propose a new approach that estimates reconstructions, distributions over the camera/object and camera/camera transformations, as well as an inter-view object affinity matrix. |
651 | PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit | Yongqiang Mou; Lei Tan; Hui Yang; Jingying Chen; Leyuan Liu; Rui Yan; Yaohong Huang; | In this paper, we address the problem of recognizing degradation images that are suffering from high blur or low-resolution. |
652 | Memory Selection Network for Video Propagation | Ruizheng Wu; Huaijia Lin; Xiaojuan Qi; Jiaya Jia; | To tackle this challenge, we propose a memory selection network, which learns to select suitable guidance from all previous frames for effective and robust propagation. |
653 | Disentangled Non-local Neural Networks | Minghao Yin; Zhuliang Yao; Yue Cao; Xiu Li; Zheng Zhang; Stephen Lin; Han Hu; | Based on these findings, we present the disentangled non-local block, where the two terms are decoupled to facilitate learning for both terms. |
654 | URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark | Seonguk Seo; Joon-Young Lee; Bohyung Han; | We propose a unified referring video object segmentation network (URVOS). |
655 | Generalizing Person Re-Identification by Camera-Aware Invariance Learning and Cross-Domain Mixup | Chuanchen Luo; Chunfeng Song; Zhaoxiang Zhang; | As for the latter issue, we propose a novel cross-domain mixup scheme. |
656 | Semi-Supervised Crowd Counting via Self-Training on Surrogate Tasks | Yan Liu; Lingqiao Liu; Peng Wang; Pingping Zhang; Yinjie Lei; | Specifically, we proposed a novel semi-supervised crowd counting method which is built upon two innovative components: (1) a set of inter-related binary segmentation tasks are derived from the original density map regression task as the surrogate prediction target (2) the surrogate target predictors are learned from both labeled and unlabeled data by utilizing a proposed self-training scheme which fully exploits the underlying constraints of these binary segmentation tasks. |
657 | Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training | Hongkai Zhang; Hong Chang; Bingpeng Ma; Naiyan Wang; Xilin Chen; | In this work, we first point out the inconsistency problem between the fixed network settings and the dynamic training procedure, which greatly affects the performance. |
658 | Boosting Decision-based Black-box Adversarial Attacks with Random Sign Flip | Weilun Chen; Zhaoxiang Zhang; Xiaolin Hu; Baoyuan Wu; | In this paper, we show that just randomly flipping the signs of a small number of entries in adversarial perturbations can significantly boost the attack performance. |
659 | Knowledge Transfer via Dense Cross-Layer Mutual-Distillation | Anbang Yao; Dawei Sun; | In this paper, we propose Dense Cross-layer Mutual-distillation (DCM), an improved two-way KT method in which the teacher and student networks are trained collaboratively from scratch. |
660 | Matching Guided Distillation | Kaiyu Yue; Jiangfan Deng; Feng Zhou; | In this paper, we present Matching Guided Distillation(MGD) as an efficient and parameter-free manner to solve these problems. |
661 | Clustering Driven Deep Autoencoder for Video Anomaly Detection | Yunpeng Chang; Zhigang Tu; Wei Xie; Junsong Yuan; | Since the abnormal events are usually different from normal events in appearance and/or in motion behavior, we address this issue by designing a novel convolution autoencoder architecture to separately capture spatial and temporal informative representation. |
662 | Learning to Compose Hypercolumns for Visual Correspondence | Juhong Min; Jongmin Lee; Jean Ponce; Minsu Cho; | In this work, we introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match. |
663 | Stochastic Bundle Adjustment for Efficient and Scalable 3D Reconstruction | Lei Zhou; Zixin Luo; Mingmin Zhen; Tianwei Shen; Shiwei Li; Zhuofei Huang; Tian Fang; Long Quan; | In this work, we propose a stochastic bundle adjustment algorithm which seeks to decompose the RCS approximately inside the LM iterations to improve the efficiency and scalability. |
664 | Object-based Illumination Estimation with Rendering-aware Neural Networks | Xin Wei; Guojun Chen; Yue Dong; Stephen Lin; Xin Tong; | We present a scheme for fast environment light estimation from the RGBD appearance of individual objects and their local image areas. |
665 | Progressive Point Cloud Deconvolution Generation Network | Le Hui; Rui Xu; Jin Xie; Jianjun Qian; Jian Yang; | In this paper, we propose an effective point cloud generation method, which can generate multi-resolution point clouds of the same shape from a latent vector. |
666 | SSCGAN: Facial Attribute Editing via Style Skip Connections | Wenqing Chu; Ying Tai; Chengjie Wang; Jilin Li; Feiyue Huang; Rongrong Ji; | In this work, we focus on solving this issue by editing the channel-wise global information denoted as the style feature. |
667 | Negative Pseudo Labeling using Class Proportion for Semantic Segmentation in Pathology | Hiroki Tokunaga; Brian Kenji Iwana; Yuki Teramoto; Akihiko Yoshizawa ; Ryoma Bise; | In this paper, we propose a subtype segmentation method that uses such proportional labels as weakly supervised labels. |
668 | Learn to Propagate Reliably on Noisy Affinity Graphs | Lei Yang; Qingqiu Huang; Huaiyi Huang; Linning Xu; Dahua Lin; | To overcome these difficulties, we propose a new framework that allows labels to be propagated reliably on large-scale real-world data. |
669 | Fair DARTS: Eliminating Unfair Advantages in Differentiable Architecture Search | Xiangxiang Chu; Tianbao Zhou; Bo Zhang; Jixiang Li; | Thereby, we present a novel approach called Fair DARTS where the exclusive competition is relaxed to be collaborative. |
670 | TANet: Towards Fully Automatic Tooth Arrangement | Guodong Wei; Zhiming Cui; Yumeng Liu; Nenglun Chen; Runnan Chen; Guiqing Li; Wenping Wang; | In this work, we proposed a learning-based method for fast and automatic tooth arrangement. |
671 | UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection | Bumsoo Kim; Taeho Choi; Jaewoo Kang; Hyunwoo J. Kim; | To tackle this problem, we propose UnionDet, a one-stage meta architecture for HOI detection powered by a novel union-level detector that eliminates this additional inference stage by directly capturing the region of interaction. |
672 | GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-aware Supervision | Lei Ke; Shichao Li; Yanan Sun; Yu-Wing Tai; Chi-Keung Tang; | We present a novel end-to-end framework named as GSNet ( extbf{\underline{G}}eometric and extbf{\underline{S}}cene-aware \underline{ extbf{Net}}work), which jointly estimates 6DoF poses and reconstructs detailed 3D car shapes from single urban street view. |
673 | Resolution Switchable Networks for Runtime Efficient Image Recognition | Yikai Wang; Fuchun Sun; Duo Li; Anbang Yao; | We propose a general method to train a single convolutional neural network which is capable of switching image resolutions at inference. |
674 | SMAP: Single-Shot Multi-Person Absolute 3D Pose Estimation | Jianan Zhen; Qi Fang; Jiaming Sun; Wentao Liu; Wei Jiang; Hujun Bao ; Xiaowei Zhou; | In this paper, we propose a novel system that first regresses a set of 2.5D representations of body parts and then reconstructs the 3D absolute poses based on these 2.5D representations with a depth-aware part association algorithm. |
675 | Learning to Detect Open Classes for Universal Domain Adaptation | Bo Fu; Zhangjie Cao; Mingsheng Long; Jianmin Wang; | Towards accurate open class detection, we propose Calibrated Multiple Uncertainties (CMU) with a novel transferability measure estimated by a mixture of uncertainty quantities in complementation: entropy, confidence and consistency, defined on conditional probabilities calibrated by a multi-classifier ensemble model. |
676 | Visual Compositional Learning for Human-Object Interaction Detection | Zhi Hou; Xiaojiang Peng; Yu Qiao; Dacheng Tao; | We devise a deep Visual Compositional Learning (VCL) framework, which is a simple yet efficient framework to effectively address this problem. |
677 | Deep Plastic Surgery: Robust and Controllable Image Editing with Human-Drawn Sketches | Shuai Yang; Zhangyang Wang; Jiaying Liu; Zongming Guo; | In this paper, we propose Deep Plastic Surgery, a novel, robust and controllable image editing framework that allows users to interactively edit images using hand-drawn sketch inputs. |
678 | Rethinking Class Activation Mapping for Weakly Supervised Object Localization | Wonho Bae; Junhyug Noh; Gunhee Kim; | We propose three simple but robust techniques that alleviate the problems, including thresholded average pooling, negative weight clamping, and percentile as a standard for thresholding. |
679 | OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features | Anton Osokin; Denis Sumin; Vasily Lomakin; | In this paper, we consider the task of one-shot object detection, which consists in detecting objects defined by a single demonstration. |
680 | Interpretable Neural Network Decoupling | Yuchao Li; Rongrong Ji; Shaohui Lin; Baochang Zhang; Chenqian Yan; Yongjian Wu; Feiyue Huang; Ling Shao; | In this paper, we propose a novel architecture decoupling method to interpret the network from a perspective of investigating its calculation paths. |
681 | Omni-sourced Webly-supervised Learning for Video Recognition | Haodong Duan; Yue Zhao; Yuanjun Xiong; Wentao Liu; Dahua Lin; | We introduce OmniSource, a novel framework for leveraging web data to train video recognition models. |
682 | CurveLane-NAS: Unifying Lane-Sensitive Architecture Search and Adaptive Point Blending | Hang Xu; Shaoju Wang; Xinyue Cai; Wei Zhang; Xiaodan Liang; Zhenguo Li; | In this paper, we propose a novel lane-sensitive architecture search framework named CurveLane-NAS to automatically capture both long-ranged coherent and accurate short-range curve information while unifying both architecture search and post-processing on curve lane predictions via point blending. |
683 | Contextual-Relation Consistent Domain Adaptation for Semantic Segmentation | Jiaxing Huang; Shijian Lu; Dayan Guan; Xiaobing Zhang; | This paper presents an innovative local contextual-relation consistent domain adaptation (CrCDA) technique that aims to achieve local-level consistencies during the global-level alignment. |
684 | Estimating People Flows to Better Count Them in Crowded Scenes | Weizhe Liu; Mathieu Salzmann; Pascal Fua; | In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing. |
685 | Generate to Adapt: Resolution Adaption Network for Surveillance Face Recognition | Han Fang; Weihong Deng; Yaoyao Zhong; Jiani Hu; | To avoid this problem, we propose a novel resolution adaption network (RAN) which contains Multi-Resolution Generative Adversarial Networks (MR-GAN) followed by a feature adaption network. |
686 | Learning Feature Embeddings for Discriminant Model based Tracking | Linyu Zheng; Ming Tang; Yingying Chen; Jinqiao Wang; Hanqing Lu; | After observing that the features used in most online discriminatively trained trackers are not optimal, in this paper, we propose a novel and effective architecture to learn optimal feature embeddings for online discriminative tracking. |
687 | WeightNet: Revisiting the Design Space of Weight Networks | Ningning Ma; Xiangyu Zhang; Jiawei Huang; Jian Sun; | We present a conceptually simple, flexible and effective framework for weight generating networks. |
688 | Partially-Shared Variational Auto-encoders for Unsupervised Domain Adaptation with Target Shift | Ryuhei Takahashi; Atsushi Hashimoto; Motoharu Sonogashira; Masaaki Iiyama; | This paper discusses unsupervised domain adaptation (UDA) with target shift, i.e., UDA with the non-identical label distributions of the source and target domains. |
689 | Learning Where to Focus for Efficient Video Object Detection | Zhengkai Jiang; Yu Liu; Ceyuan Yang; Jihao Liu; Peng Gao; Qian Zhang; Shiming Xiang; Chunhong Pan; | Therefore, a novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among frame features accurately. |
690 | Learning Object Permanence from Video | Aviv Shamsian; Ofri Kleinfeld; Amir Globerson; Gal Chechik; | Here we introduce the setup of learning Object Permanence from labeled videos. |
691 | Adaptive Text Recognition through Visual Matching | Chuhan Zhang; Ankush Gupta; Andrew Zisserman; | We introduce a new model that exploits the repetitive nature of characters in languages, and decouples the visual decoding and linguistic modelling stages through intermediate representations in the form of similarity maps. |
692 | Actions as Moving Points | Yixuan Li; Zixu Wang; Limin Wang; Gangshan Wu; | In this paper, we present a conceptually simple, computationally efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector), by treating an action instance as a trajectory of moving points. |
693 | Learning to Exploit Multiple Vision Modalities by Using Grafted Networks | Yuhuang Hu; Tobi Delbruck; Shih-Chii Liu; | This paper proposes a Network Grafting Algorithm (NGA), where a new front end network driven by unconventional visual inputs replaces the front end network of a pretrained deep network that processes intensity frames. |
694 | Geometric Correspondence Fields: Learned Differentiable Rendering for 3D Pose Refinement in the Wild | Alexander Grabner; Yaming Wang; Peizhao Zhang; Peihong Guo; Tong Xiao; Peter Vajda; Peter M. Roth; Vincent Lepetit; | We present a novel 3D pose refinement approach based on differentiable rendering for objects of arbitrary categories in the wild. |
695 | 3D Fluid Flow Reconstruction Using Compact Light Field PIV | Zhong Li; Yu Ji; Jingyi Yu; Jinwei Ye; | In this paper, we present a PIV solution that uses a compact lenslet-based light field camera to track dense particles floating in the fluid and reconstruct the 3D fluid flow. |
696 | Contextual Diversity for Active Learning | Sharat Agarwal; Himanshu Arora; Saket Anand; Chetan Arora; | Since the context is difficult to evaluate in the absence of ground-truth labels, we introduce the notion of contextual diversity that captures the confusion associated with spatially co-occurring classes. |
697 | Temporal Aggregate Representations for Long-Range Video Understanding | Fadime Sener; Dipika Singhania; Angela Yao; | In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. |
698 | Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition | Zhe Niu; Brian Mak; | In this paper, we propose novel stochastic modeling of various components of a continuous sign language recognition (CSLR) system that is based on the transformer encoder and connectionist temporal classification (CTC). |
699 | General 3D Room Layout from a Single View by Render-and-Compare | Sinisa Stekovic; Shreyas Hampali; Mahdi Rad; Sayan Deb Sarkar; Friedrich Fraundorfer; Vincent Lepetit; | We present a novel method to reconstruct the 3D layout of a room—walls, ?oors, ceilings—from a single perspective view in challenging conditions, by contrast with previous single-view methods restricted to cuboid-shaped layouts. |
700 | Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints | Vikramjit Sidhu; Edgar Tretschk; Vladislav Golyanik; Antonio Agudo; Christian Theobalt; | We introduce the first dense neural non-rigid structure from motion (N-NRSfM) approach, which can be trained end-to-end in an unsupervised manner from 2D point tracks. |
701 | Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability | Anelise Newman; Camilo Fosco; Vincent Casser; Allen Lee; Barry McNamara; Aude Oliva; | We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays. |
702 | Yet Another Intermediate-Level Attack | Qizhang Li; Yiwen Guo; Hao Chen; | In this paper, we propose a novel method to enhance the black-box transferability of baseline adversarial examples. |
703 | Topology-Change-Aware Volumetric Fusion for Dynamic Scene Reconstruction | Chao Li; Xiaohu Guo; | In this paper, the classic framework is re-designed to enable 4D reconstruction of dynamic scene under topology changes, by introducing a novel structure of Non-manifold Volumetric Grid to the re-design of both TSDF and EDG, which allows connectivity updates by cell splitting and replication. |
704 | Early Exit Or Not: Resource-Efficient Blind Quality Enhancement for Compressed Images | Qunliang Xing; Mai Xu; Tianyi Li; Zhenyu Guan; | In this paper, we propose a resource-efficient blind quality enhancement (RBQE) approach for compressed images. |
705 | PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations | Edgar Tretschk; Ayush Tewari; Vladislav Golyanik; Michael Zollhöfer; Carsten Stoll; Christian Theobalt; | In this paper, we present a mid-level patch-based surface representation. |
706 | How does Lipschitz Regularization Influence GAN Training? | Yipeng Qin; Niloy Mitra; Peter Wonka; | In this work, we uncover an even more important effect of Lipschitz regularization by examining its impact on the loss function: It degenerates GAN loss functions to almost linear ones by restricting their domain and interval of attainable gradient values. |
707 | Infrastructure-based Multi-Camera Calibration using Radial Projections | Yukai Lin; Viktor Larsson; Marcel Geppert; Zuzana Kukelova; Marc Pollefeys; Torsten Sattler; | In this paper, we propose to fully calibrate a multi-camera system from scratch using an infrastructure-based approach. |
708 | MotionSqueeze: Neural Motion Feature Learning for Video Understanding | Heeseung Kwon; Manjin Kim; Suha Kwak; Minsu Cho; | In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. |
709 | Polarized Optical-Flow Gyroscope | Masada Tzabari; Yoav Y. Schechner; | We merge by generalization two principles of passive optical sensing of motion. |
710 | Online Meta-Learning for Multi-Source and Semi-Supervised Domain Adaptation | Da Li; Timothy Hospedales; | In this paper we take an orthogonal perspective and propose a framework to further enhance performance by meta-learning the initial conditions of existing DA algorithms. |
711 | An Ensemble of Epoch-wise Empirical Bayes for Few-shot Learning | Yaoyao Liu; Bernt Schiele; Qianru Sun; | In this paper, we propose to meta-learn the ensemble of epoch-wise empirical Bayes models (E3BM) to achieve robust predictions. |
712 | On the Effectiveness of Image Rotation for Open Set Domain Adaptation | Silvia Bucci; Mohammad Reza Loghmani; Tatiana Tommasi; | We propose a novel method to addresses both these problems using the self-supervised task of rotation recognition. |
713 | Combining Task Predictors via Enhancing Joint Predictability | Kwang In Kim; Christian Richardt; Hyung Jin Chang; | We present a new predictor combination algorithm that improves the target by i) measuring the relevance of references based on their capabilities in predicting the target, and ii) strengthening such estimated relevance. |
714 | Multi-Scale Positive Sample Refinement for Few-Shot Object Detection | Jiaxi Wu; Songtao Liu; Di Huang; Yunhong Wang; | To this end, we propose a Multi-scale Positive Sample Refinement (MPSR) approach to enrich object scales in FSOD. |
715 | Single-Image Depth Prediction Makes Feature Matching Easier | Carl Toft; Daniyar Turmukhambetov; Torsten Sattler; Fredrik Kahl; Gabriel J. Brostow; | In this paper, we propose a surprisingly effective enhancement to local feature extraction, which improves matching. |
716 | Deep Reinforced Attention Learning for Quality-Aware Visual Recognition | Duo Li; Qifeng Chen; | In this paper, we build upon the weakly-supervised generation mechanism of intermediate attention maps in any convolutional neural networks and disclose the effectiveness of attention modules more straightforwardly to fully exploit their potential. |
717 | CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization | Yuxi Li; Weiyao Lin; John See; Ning Xu Shugong Xu; Ke Yan; Cong Yang; | In this paper, we propose Coarse-to-Fine Action Detector (CFAD), an original end-to-end trainable framework for efficient spatiotemporal action localization. |
718 | Learning Joint Spatial-Temporal Transformations for Video Inpainting | Yanhong Zeng; Jianlong Fu; Hongyang Chao; | In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. |
719 | Single Path One-Shot Neural Architecture Search with Uniform Sampling | Zichao Guo; Xiangyu Zhang; Haoyuan Mu; Wen Heng; Zechun Liu; Yichen Wei; Jian Sun; | This work propose a Single Path One-Shot model to address the challenge in the training. |
720 | Learning to Generate Novel Domains for Domain Generalization | Kaiyang Zhou; Yongxin Yang; Timothy Hospedales; Tao Xiang; | This paper focuses on domain generalization (DG), the task of learning from multiple source domains a model that generalizes well to unseen domains. |
721 | Continuous Adaptation for Interactive Object Segmentation by Learning from Corrections | Theodora Kontogianni; Michael Gygli; Jasper Uijlings; Vittorio Ferrari; | Instead, we recognize that user corrections can serve as sparse training examples and we propose a method that capitalizes on that idea to update the model parameters on-the-fly to the data at hand. |
722 | Impact of base dataset design on few-shot image classification | Othman Sbai; Camille Couprie; Mathieu Aubry; | In this paper, we systematically study the effect of variations in the training data by evaluating deep features trained on different image sets in a few-shot classification setting. |
723 | Invertible Zero-Shot Recognition Flows | Yuming Shen; Jie Qin; Lei Huang; Li Liu; Fan Zhu; Ling Shao; | To tackle the above limitations, for the first time, this work incorporates a new family of generative models (i.e., flow-based models) into ZSL. |
724 | GeoLayout: Geometry Driven Room Layout Estimation Based on Depth Maps of Planes | Weidong Zhang; Wei Zhang; Yinda Zhang; | In this work, we propose to incorporate geometric reasoning to deep learning for layout estimation. Moreover, we present a new dataset with pixel-level depth annotation of dominant planes. |
725 | Location Sensitive Image Retrieval and Tagging | Raul Gomez; Jaume Gibert; Lluis Gomez; Dimosthenis Karatzas; | In this work, we address the task of image retrieval related to a given tag conditioned on a certain location on Earth. |
726 | Joint 3D Layout and Depth Prediction from a Single Indoor Panorama Image | Wei Zeng; Sezer Karaoglu; Theo Gevers; | In this paper, we propose a method which jointly learns layout prediction and depth estimation from a single indoor panorama image. |
727 | Guessing State Tracking for Visual Dialogue | Wei Pang; Xiaojie Wang; | This paper proposes a guessing state for the Guesser, and regards guess as a process with change of guessing state through a dialogue. |
728 | Memory-Efficient Incremental Learning Through Feature Adaptation | Ahmet Iscen; Jeffrey Zhang; Svetlana Lazebnik; Cordelia Schmid; | We introduce an approach for incremental learning that preserves feature descriptors of training images from previously learned classes, instead of the images themselves, unlike most existing work. |
729 | Neural Voice Puppetry: Audio-driven Facial Reenactment | Justus Thies; Mohamed Elgharib; Ayush Tewari; Christian Theobalt; Matthias Nießner; | We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. |
730 | One-Shot Unsupervised Cross-Domain Detection | Antonio D’Innocente; Francesco Cappio Borlino; Silvia Bucci; Barbara Caputo; Tatiana Tommasi; | This paper addresses this setting, presenting an object detection algorithm able to perform unsupervised adaption across domains by using only one target sample, seen at test time. |
731 | Stochastic Frequency Masking to Improve Super-Resolution and Denoising Networks | Majed El Helou; Ruofan Zhou; Sabine Süsstrunk; | We present an analysis, in the frequency domain, of degradation-kernel overfitting in super-resolution and introduce a conditional learning perspective that extends to both super-resolution and denoising. |
732 | Probabilistic Future Prediction for Video Scene Understanding | Anthony Hu; Fergal Cotter; Nikhil Mohan; Corina Gurau; Alex Kendall; | We present a novel deep learning architecture for probabilistic future prediction from video. |
733 | Suppressing Mislabeled Data via Grouping and Self-Attention | Xiaojiang Peng; Kai Wang; Zhaoyang Zeng; Qing Li; Jianfei Yang; Yu Qiao; | To suppressing the impact of mislabeled data, this paper proposes a conceptually simple yet efficient training block, termed as Attentive Feature Mixup (AFM), which allows paying more attention to clean samples and less to mislabeled ones via sample interactions in small groups. |
734 | Class-wise Dynamic Graph Convolution for Semantic Segmentation | Hanzhe Hu; Deyi Ji; Weihao Gan; Shuai Bai; Wei Wu; Junjie Yan; | In order to avoid potential misleading contextual information aggregation in previous work, we propose a class-wise dynamic graph convolution(CDGC) module to adaptively propagate information. |
735 | Character-Preserving Coherent Story Visualization | Yun-Zhu Song; Zhi Rui Tam; Hung-Jen Chen; Huiao-Han Lu; Hong-Han Shuai; | Therefore, we propose a new framework named Character-Preserving Coherent Story Visualization (CP-CSV) to tackle the challenges. |
736 | GINet: Graph Interaction Network for Scene Parsing | Tianyi Wu; Yu Lu; Yu Zhu; Chuang Zhang; MingWu; Zhanyu Ma; Guodong Guo; | In this work, we explore how to incorperate the linguistic knowledge to promote context reasoning over image regions by proposing a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss). |
737 | Tensor Low-Rank Reconstruction for Semantic Segmentation | Wanli Chen; Xinge Zhu; Ruoqi Sun; Junjun He; Ruiyu Li; Xiaoyong Shen ; Bei Yu; | In this paper, we propose a new approach to model the 3D context representations,which not only avoids the space compression, but also tackles the high-rank difficulty. |
738 | Attentive Normalization | Xilai Li; Wei Sun; Tianfu Wu; | In this paper, we propose a light-weight integration between the two schema. |
739 | Count- and Similarity-aware R-CNN for Pedestrian Detection | Jin Xie; Hisham Cholakkal; Rao Muhammad Anwer; Fahad Shahbaz Khan; Yanwei Pang; Ling Shao; Mubarak Shah; | We propose an approach that leverages pedestrian count and proposal similarity information within a two-stage pedestrian detection framework. |
740 | TRADI: Tracking Deep Neural network Weight Distributions | Gianni Franchi; Andrei Bursuc; Emanuel Aldea; Séverine Dubuisson; Isabelle Bloch; | In this work we propose to make use of this knowledge and leverage it for computing the distributions of the weights of the DNN. |
741 | Spatiotemporal Attacks for Embodied Agents | Aishan Liu; Tairan Huang; Xianglong Liu; Yitao Xu; Yuqing Ma; Xinyun Chen; Stephen J. Maybank; Dacheng Tao; | In this work, we take the first step to study adversarial attacks for embodied agents. |
742 | Caption-Supervised Face Recognition: Training a State-of-the-Art Face Model without Manual Annotation | Qingqiu Huang; Lei Yang; Huaiyi Huang; Tong Wu; Dahua Lin; | In this work, we propose a simple yet effective method, which trains a face recognition model by progressively expanding the labeled set via both selective propagation and caption-driven expansion. |
743 | Unselfie: Translating Selfies to Neutral-pose Portraits in the Wild | Liqian Ma; Zhe Lin; Connelly Barnes; Alexei A Efros; Jingwan Lu; | To address this issue, we introduce unselfie, a novel photographic transformation that automatically translates a selfie into a neutral-pose portrait. |
744 | Design and Interpretation of Universal Adversarial Patches in Face Detection | Xiao Yang; Fangyun Wei; Hongyang Zhang; Jun Zhu; | We propose new optimization-based approaches to automatic design of universal adversarial patches for varying goals of the attack, including scenarios in which true positives are suppressed without introducing false positives. |
745 | Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild | Yang Xiao; Renaud Marlet; | We propose a meta-learning framework that can be applied to both tasks, possibly including 3D data. |
746 | Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints | Adrian Spurr; Umar Iqbal; Pavlo Molchanov; Otmar Hilliges; Jan Kautz; | Embracing this challenge we propose a set of novel losses that constrain the prediction of a neural network to lie within the range of biomechanically feasible 3D hand configurations. |
747 | Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification | Mang Ye; Jianbing Shen; David J. Crandall; Ling Shao; Jiebo Luo; | In this paper, we propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID. |
748 | Contextual Heterogeneous Graph Network for Human-Object Interaction Detection | Hai Wang; Wei-shi Zheng; Ling Yingbiao; | In this work, we address such a problem for HOI task by proposing a heterogeneous graph network that models humans and objects as different kinds of nodes and incorporates intra-class messages between homogeneous nodes and inter-class messages between heterogeneous nodes. |
749 | Zero-Shot Image Super-Resolution with Depth Guided Internal Degradation Learning | Xi Cheng; Zhenyong Fu; Jian Yang; | In this work, we present a simple yet effective zero-shot image super-resolution model. |
750 | A Closest Point Proposal for MCMC-based Probabilistic Surface Registration | Dennis Madsen; Andreas Morel-Forster; Patrick Kahr; Dana Rahbani; Thomas Vetter; Marcel Lüthi; | We propose to view non-rigid surface registration as a probabilistic inference problem. |
751 | Interactive Video Object Segmentation Using Global and Local Transfer Modules | Yuk Heo; Yeong Jun Koh; Chang-Su Kim; | An interactive video object segmentation algorithm, which takes scribble annotations on query objects as input, is proposed in this paper. |
752 | End-to-end Interpretable Learning of Non-blind Image Deblurring | Thomas Eboli; Jian Sun; Jean Ponce; | We propose to precondition the Richardson solver using approximate inverse filters of the (known) blur and natural image prior kernels. |
753 | Employing Multi-Estimations for Weakly-Supervised Semantic Segmentation | Junsong Fan; Zhaoxiang Zhang; Tieniu Tan; | Instead of struggling to refine a single seed, we propose a novel approach to alleviate the inaccurate seed problem by leveraging the segmentation model’s robustness to learn from multiple seeds. |
754 | Learning Noise-Aware Encoder-Decoder from Noisy Labels by Alternating Back-Propagation for Saliency Detection | Jing Zhang; Jianwen Xie; Nick Barnes; | In this paper, we propose a noise-aware encoder-decoder framework to disentangle a clean saliency predictor from noisy training examples, where the noisy labels are generated by unsupervised handcrafted feature-based methods. |
755 | Rethinking Image Deraining via Rain Streaks and Vapors | Yinglong Wang; Yibing Song; Chao Ma; Bing Zeng; | In this work, we reformulate rain streaks as transmission medium together with vapors to model rain imaging. |
756 | Finding Non-Uniform Quantization Schemes using Multi-Task Gaussian Processes | Marcelo Gennari do Nascimento; Theo W. Costain; Victor Adrian Prisacariu; | We propose a novel method for neural network quantization that casts the neural architecture search problem as one of hyperparameter search to find non-uniform bit distributions throughout the layers of a CNN. |
757 | Is Sharing of Egocentric Video Giving Away Your Biometric Signature? | Daksh Thapar; Chetan Arora; Aditya Nigam; | In this work, we create a novel kind of privacy attack by extracting the wearer’s gait profile, a well known biometric signature, from such optical flow in the egocentric videos. |
758 | Captioning Images Taken by People Who Are Blind | Danna Gurari; Yinan Zhao; Meng Zhang; Nilavra Bhattacharya; | Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. |
759 | Improving Semantic Segmentation via Decoupled Body and Edge Supervision | Xiangtai Li; Xia Li; Li Zhang; Guangliang Cheng; Jianping Shi; Zhouchen Lin; Shaohua Tan; Yunhai Tong; | In this paper, a new paradigm for semantic segmentation is proposed. |
760 | Conditional Entropy Coding for Efficient Video Compression | Jerry Liu; Shenlong Wang; Wei-Chiu Ma; Meet Shah; Rui Hu; Pranaab Dhawan; Raquel Urtasun; | We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames. |
761 | Differentiable Feature Aggregation Search for Knowledge Distillation | Yushuo Guan; Pengyu Zhao; Bingxuan Wang; Yuanxing Zhang; Cong Yao; Kaigui Bian; Jian Tang; | Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. |
762 | Attention Guided Anomaly Localization in Images | Shashanka Venkataramanan; Kuan-Chuan Peng; Rajat Vikram Singh; Abhijit Mahalanobis; | Without the need of anomalous training images, we propose Convolutional Adversarial Variational autoencoder with Guided Attention (CAVGA), which localizes the anomaly with a convolutional latent variable to preserve the spatial information. |
763 | Self-supervised Video Representation Learning by Pace Prediction | Jiangliu Wang; Jianbo Jiao; Yun-Hui Liu; | This paper addresses the problem of self-supervised video representation learning from a new perspective — by video pace prediction. |
764 | Full-Body Awareness from Partial Observations | Chris Rockwell; David F. Fouhey; | We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; |
765 | Reinforced Axial Refinement Network for Monocular 3D Object Detection | Lijie Liu; Chufan Wu; Jiwen Lu; Lingxi Xie; Jie Zhou; Qi Tian; | To improve the efficiency of sampling, we propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step. |
766 | Self-Supervised Multi-Task Procedure Learning from Instructional Videos | Ehsan Elhamifar; Dat Huynh; | We address the problem of unsupervised procedure learning from instructional videos of multiple tasks using Deep Neural Networks (DNNs). |
767 | CosyPose: Consistent multi-view multi-object 6D pose estimation | Yann Labbé Justin Carpentier; Mathieu Aubry; Josef Sivic; | We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. |
768 | In-Domain GAN Inversion for Real Image Editing | Jiapeng Zhu; Yujun Shen; Deli Zhao; Bolei Zhou; | To solve this problem, we propose an in-domain GAN inversion approach, which not only faithfully reconstructs the input image but also ensures the inverted code to be semantically meaningful for editing. |
769 | Key Frame Proposal Network for Efficient Pose Estimation in Videos | Yuexi Zhang; Yin Wang; Octavia Camps; Mario Sznaier; | In this paper, we propose a novel method combining local approaches with global context. |
770 | Exchangeable Deep Neural Networks for Set-to-Set Matching and Learning | Yuki Saito; Takuma Nakamura; Hirotaka Hachiya; Kenji Fukumizu; | In this study, we propose a novel deep learning architecture to address the abovementioned difficulties and also an efficient training framework for set-to-set matching. |
771 | Making Sense of CNNs: Interpreting Deep Representations & Their Invariances with INNs | Robin Rombach; Patrick Esser; Björn Ommer; | We present an approach based on INNs that (i) recovers the task-specific, learned invariances by disentangling the remaining factor of variation in the data and that (ii) invertibly transforms these recovered invariances combined with the model representation into an equally expressive one with accessible semantic concepts. |
772 | Cross-Modal Weighting Network for RGB-D Salient Object Detection | Gongyang Li; Zhi Liu; Linwei Ye; Yang Wang; Haibin Ling; | In this paper, we propose a novel Cross-Modal Weighting (CMW) strategy to encourage comprehensive interactions between RGB and depth channels for RGB-D SOD. |
773 | Open-set Adversarial Defense | Rui Shao; Pramuditha Perera; Pong C. Yuen; Vishal M. Patel; | In this paper, we show that open-set recognition systems are vulnerable to adversarial attacks. |
774 | Deep Image Compression using Decoder Side Information | Sharon Ayzik; Shai Avidan; | We present a Deep Image Compression neural network that relies on side information, which is only available to the decoder. |
775 | Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation | Jeevan Devaranjan; Amlan Kar; Sanja Fidler; | In this paper, we propose a generative model of synthetic scenes that reduces the distribution gap between the scene structure of generated scenes and a real target image dataset. |
776 | A Generic Visualization Approach for Convolutional Neural Networks | Ahmed Taha; Xitong Yang; Abhinav Shrivastava; Larry Davis; | We formulate attention visualization as a constrained optimization problem. |
777 | Interactive Annotation of 3D Object Geometry using 2D Scribbles | Tianchang Shen; Jun Gao; Amlan Kar; Sanja Fidler; | In this paper, we propose an interactive framework for annotating 3D object geometry from both point cloud data and RGB imagery. |
778 | Hierarchical Kinematic Human Mesh Recovery | Georgios Georgakis; Ren Li; Srikrishna Karanam; Terrence Chen; Jana Košecká Ziyan Wu; | In this work, we address this gap by proposing a new technique for regression of human parametric model that is explicitly informed by the known hierarchical structure, including joint interdependencies of the model. |
779 | Multi-Loss Rebalancing Algorithm for Monocular Depth Estimation | Jae-Han Lee; Chang-Su Kim; | An algorithm to combine multiple loss terms adaptively for training a monocular depth estimator is proposed in this work. |
780 | 3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View | Marc Badger; Yufu Wang; Adarsh Modh; Ammon Perkes; Nikos Kolotouros ; Bernd G. Pfrommer; Marc F. Schmidt; Kostas Daniilidis; | To address this problem, we first introduce a model and multi-view optimization approach, which we use to capture the unique shape and pose space displayed by live birds. We then introduce a pipeline and experiments for keypoint, mask, pose, and shape regression that recovers accurate avian postures from single views. |
781 | We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos | Alex Andonian; Camilo Fosco; Mathew Monfort; Allen Lee; Rogerio Feris; Carl Vondrick; Aude Oliva; | Here, we propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. |
782 | Joint Optimization for Multi-Person Shape Models from Markerless 3D-Scans | Samuel Zeitvogel; Johannes Dornheim; Astrid Laubenheimer; | We propose a markerless end-to-end training framework for parametric 3D human shape models. |
783 | Accurate RGB-D Salient Object Detection via Collaborative Learning | Wei Ji; Jingjing Li; Miao Zhang; Yongri Piao; Huchuan Lu; | In this paper, we propose a novel collaborative learning framework where edge, depth and saliency are leveraged in a more efficient way, which solves those problems tactfully. |
784 | Finding Your (3D) Center: 3D Object Detection Using a Learned Loss | David Griffiths; Jan Boehm; Tobias Ritschel; | Addressing this disparity, we introduce a new optimization procedure, which allows training for 3D detection with raw 3D scans while using as little as 5\,\% of the object labels and still achieve comparable performance. |
785 | Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection | Ganlong Zhao; Guanbin Li; Ruijia Xu; Liang Lin; | In this paper, we are the first to reveal that the region proposal network (RPN) and region proposal classifier (RPC) in the endemic two-stage detectors (e.g., Faster RCNN) demonstrate significantly different transferability when facing large domain gap. |
786 | Two Stream Active Query Suggestion for Active Learning in Connectomics | Zudi Lin; Donglai Wei; Won-Dong Jang; Siyan Zhou; Xupeng Chen; Xueying Wang; Richard Schalek; Daniel Berger; Brian Matejek; Lee Kamentsky; Adi Peleg; Daniel Haehn; Thouis Jones; Toufiq Parag; Jeff Lichtman; Hanspeter Pfister; | To tackle this, we propose a two-stream active query suggestion approach. |
787 | Pix2Surf: Learning Parametric 3D Surface Models of Objects from Images | Jiahui Lei; Srinath Sridhar; Paul Guerrero; Minhyuk Sung; Niloy Mitra; Leonidas J. Guibas; | We investigate the problem of learning to generate 3D parametric surface representations for novel object instances, as seen from one or more views. |
788 | 6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal Inference | Mai Bui; Tolga Birdal; Haowen Deng; Shadi Albarqouni; Leonidas Guibas; Slobodan Ilic; Nassir Navab; | We present a multimodal camera relocalization framework that captures ambiguities and uncertainties with continuous mixture models defined |
789 | Modeling Artistic Workflows for Image Generation and Editing | Hung-Yu Tseng; Matthew Fisher; Jingwan Lu; Yijun Li; Vladimir Kim; Ming-Hsuan Yang; | Motivated by the above observations, we propose a generative model that follows a given artistic workflow, enabling both multi-stage image generation as well as multi-stage image editing of an existing piece of art. |
790 | A Large-scale Annotated Mechanical Components Benchmark for Classification and Retrieval Tasks with Deep Neural Networks | Sangpil Kim; Hyung-gun Chi; Xiao Hu; Qixing Huang; Karthik Ramani; | We introduce a large-scale annotated mechanical components benchmark for classification and retrieval tasks named MechanicalComponents Benchmark (MCB): a large-scale dataset of 3D objects of mechanical components. |
791 | Hidden Footprints: Learning Contextual Walkability from 3D Human Trails | Jin Sun; Hadar Averbuch-Elor; Qianqian Wang; Noah Snavely; | We tackle this problem by leveraging information from existing datasets, without any additional labeling. |
792 | Self-Supervised Learning of Audio-Visual Objects from Video | Triantafyllos Afouras; Andrew Owens; Joon Son Chung; Andrew Zisserman; | Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. |
793 | GAN-based Garment Generation Using Sewing Pattern Images | Yu Shen; Junbang Liang; Ming C. Lin; | We propose a unified method using the generative network. |
794 | Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach | Chaitanya Ahuja; Dong Won Lee; Yukiko I. Nakano; Louis-Philippe Morency; | In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker’s gestures in an end-to-end manner. |
795 | An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds | Rui Huang; Wanyue Zhang; Abhijit Kundu; Caroline Pantofaru; David A Ross; Thomas Funkhouser; Alireza Fathi; | To address this problem, in this paper we propose a sparse LSTM-based multi-frame 3d object detection algorithm. |
796 | Monotonicity Prior for Cloud Tomography | Tamar Loeub; Aviad Levis; Vadim Holodovsky; Yoav Y. Schechner; | We introduce a differentiable monotonicity prior, useful to express signals of monotonic tendency. |
797 | Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention | Lezi Wang; Dong Liu; Rohit Puri; Dimitris N. Metaxas; | We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs, where the moments highly corrected with trailers are expected to be scored higher than the uncorrelated moments. |
798 | Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval | Christopher Thomas; Adriana Kovashka; | We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces, which does not necessarily align with visual coherency. |
799 | Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline | Vishvak Murahari; Dhruv Batra; Devi Parikh; Abhishek Das; | Instead, we present an approach to leverage pretraining on related vision-language datasets before transferring to visual dialog. |
800 | Learning to Generate Grounded Visual Captions without Localization Supervision | Chih-Yao Ma; Yannis Kalantidis; Ghassan AlRegib; Peter Vajda; Marcus Rohrbach; Zsolt Kira; | In this work, we help the model to achieve this via a novel cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it, and then reconstruct the sentence from the localized image region(s) to match the ground-truth. |
801 | Neural Hair Rendering | Menglei Chai; Jian Ren; Sergey Tulyakov; | In this paper, we propose a generic neural-based hair rendering pipeline that can synthesize photo-realistic images from virtual 3D hair models. |
802 | JNR: Joint-based Neural Rig Representation for Compact 3D Face Modeling | Noranart Vesdapunt; Mitch Rundle; HsiangTao Wu; Baoyuan Wang; | In this paper, we introduce a novel approach to learn a 3D face model using a joint-based face rig and a neural skinning network. |
803 | On Disentangling Spoof Trace for Generic Face Anti-Spoofing | Yaojie Liu; Joel Stehouwer; Xiaoming Liu; | This work designs a novel adversarial learning framework to disentangle the spoof traces from input faces as a hierarchical combination of patterns at multiple scales. |
804 | Streaming Object Detection for 3-D Point Clouds | Wei Han; Zhengdong Zhang; Benjamin Caine; Brandon Yang; Christoph Sprunk; Ouais Alsharif; Jiquan Ngiam; Vijay Vasudevan; Jonathon Shlens; Zhifeng Chen; | In this work, we explore how to build an object detector that removes this artificial latency constraint, and instead operates on native streaming data in order to significantly reduce latency. |
805 | NAS-DIP: Learning Deep Image Prior with Neural Architecture Search | Yun-Chun Chen; Chen Gao; Esther Robb; Jia-Bin Huang; | Building upon a generic U-Net architecture, our core contribution lies in designing new search spaces for (1) an upsampling cell and (2) a pattern of cross-scale residual connections. |
806 | Learning to Learn in a Semi-Supervised Fashion | Yun-Chun Chen; Chao-Te Chou; Yu-Chiang Frank Wang; | To address semi-supervised learning from both labeled and unlabeled data, we present a novel meta-learning scheme. |
807 | FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning | Chia-Wen Kuo; Chih-Yao Ma; Jia-Bin Huang; Zsolt Kira; | In this paper, we propose a novel learned feature-based refinement and augmentation method that produces a varied set of complex transformations. |
808 | RadarNet: Exploiting Radar for Robust Perception of Dynamic Objects | Bin Yang; Runsheng Guo; Ming Liang; Sergio Casas; Raquel Urtasun; | To better address this, we propose a new solution that exploits both LiDAR and Radar sensors for perception. |
809 | Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation | Medhini Narasimhan; Erik Wijmans; Xinlei Chen; Trevor Darrell; Dhruv Batra; Devi Parikh; Amanpreet Singh; | We introduce a learning-based approach for room navigation using semantic maps. |
810 | Learning to Separate: Detecting Heavily-Occluded Objects in Urban Scenes | Chenhongyi Yang; Vitaly Ablavsky; Kaihong Wang; Qi Feng; Margrit Betke; | In this work, we propose a novel Non-Maximum-Suppression (NMS) algorithm that dramatically improves the detection recall while maintaining high precision in scenes with heavy occlusions. |
811 | Towards causal benchmarking of bias in face analysis algorithms | Guha Balakrishnan; Yuanjun Xiong; Wei Xia; Pietro Perona; | To address this problem we develop an experimental method for measuring algorithmic bias of face analysis algorithms, which directly manipulates the attributes of interest, e.g., gender and skin tone, in order to reveal causal links between attribute variation and performance change. |
812 | Learning and Memorizing Representative Prototypes for 3D Point Cloud Semantic and Instance Segmentation | Tong He; Dong Gong; Zhi Tian; Chunhua Shen; | To tackle the above issue, we propose a memory-augmented network that learns and memorizes the representative prototypes that encode both geometry and semantic information. |
813 | Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions | Noa Garcia; Yuta Nakashima; | Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. |
814 | Transformation Consistency Regularization – A Semi-Supervised Paradigm for Image-to-Image Translation | Aamir Mustafa; Rafal K. Mantiuk; | We propose Transformation Consistency Regularization, which delves into a more challenging setting of image-to-image translation, which remains unexplored by semi-supervised algorithms. |
815 | LIRA: Lifelong Image Restoration from Unknown Blended Distortions | Jianzhao Liu; Jianxin Lin; Xin Li; Wei Zhou; Sen Liu; Zhibo Chen; | When the input is degraded by a new distortion, inspired by adult neurogenesis in human memory system, we develop a neural growing strategy where the previously trained model can incorporate a new expert branch and continually accumulate new knowledge without interfering with learned knowledge. |
816 | HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization | Jiahao Lin; Gim Hee Lee; | In this paper, we propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization in the camera coordinate space. |
817 | SOLO: Segmenting Objects by Locations | Xinlong Wang; Tao Kong; Chunhua Shen; Yuning Jiang; Lei Li; | We present a new, embarrassingly simple approach to instance segmentation in images. |
818 | Learning to See in the Dark with Events | Song Zhang; Yu Zhang; Zhe Jiang; Dongqing Zou; Jimmy Ren; Bin Zhou; | In this paper, we propose learning to see in the dark by translating the HDR events in low light to canonical sharp images as if captured in day light. |
819 | Trajectron++: Dynamically-Feasible Trajectory Forecasting With Heterogeneous Data | Tim Salzmann; Boris Ivanovic; Punarjay Chakravarty; Marco Pavone; | Towards this end, we present Trajectron++, a modular, graph-structured recurrent model that forecasts the trajectories of a general number of diverse agents while incorporating agent dynamics and heterogeneous data (e.g., semantic maps). |
820 | Context-Gated Convolution | Xudong Lin; Lin Ma; Wei Liu; Shih-Fu Chang; | Motivated by this, we propose one novel Context-Gated Convolution (CGC) to explicitly modify the weights of convolutional layers adaptively under the guidance of global context. |
821 | Polynomial Regression Network for Variable-Number Lane Detection | Bingke Wang; Zilei Wang; Yixin Zhang; | In this work, we propose to use polynomial curves to represent traffic lanes and then propose a novel polynomial regression network (PRNet) to directly predict them, where semantic segmentation is not involved. |
822 | Structural Deep Metric Learning for Room Layout Estimation | Wenzhao Zheng; Jiwen Lu; Jie Zhou; | In this paper, we propose a structural deep metric learning (SDML) method for room layout estimation, which aims to recover the 3D spatial layout of a cluttered indoor scene from a monocular RGB image. |
823 | Adaptive Task Sampling for Meta-Learning | Chenghao Liu; Zhihao Wang; Doyen Sahoo; Yuan Fang Kun Zhang; Steven C.H. Hoi; | In this paper, we propose an adaptive task sampling method to improve the generalization performance. |
824 | Deep Complementary Joint Model for Complex Scene Registration and Few-shot Segmentation on Medical Images | Yuting He; Tiantian Li; Guanyu Yang; Youyong Kong; Yang Chen; Huazhong Shu; Jean-Louis Coatrieux; Jean-Louis Dillenseger; Shuo Li; | We propose a novel Deep Complementary Joint Model (DeepRS) for complex scene registration and few-shot segmentation. |
825 | Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems | Kailai Zhou; Linsen Chen; Xun Cao; | Inspired by this observation, we propose Modality Balance Network (MBNet) which facilitates the optimization process in a much more flexible and balanced manner. |
826 | High-Resolution Image Inpainting with Iterative Confidence Feedback and Guided Upsampling | Yu Zeng; Zhe Lin; Jimei Yang; Jianming Zhang; Eli Shechtman; Huchuan Lu; | To address this challenge, we propose an iterative inpainting method with a feedback mechanism. |
827 | Online Ensemble Model Compression using Knowledge Distillation | Devesh Walawalkar; Zhiqiang Shen; Marios Savvides; | This paper presents a novel knowledge distillation based model compression framework consisting of a student ensemble. |
828 | Deep Learning-based Pupil Center Detection for Fast and Accurate Eye Tracking System | Kang Il Lee; Jung Ho Jeon; Byung Cheol Song; | Thus, we propose more accurate pupil center detection by improving the representation quality of the network in charge of pupil center detection. |
829 | Efficient Residue Number System Based Winograd Convolution | Zhi-Gang Liu; Matthew Mattina; | Our work extends the Winograd algorithm to Residue Number System (RNS). |
830 | Robust Tracking against Adversarial Attacks | Shuai Jia; Chao Ma; Yibing Song; Xiaokang Yang; | We apply the proposed adversarial attack and defense approaches to state-of-the-art deep tracking algorithms. |
831 | Single-Shot Neural Relighting and SVBRDF Estimation | Shen Sang; Manmohan Chandraker; | We present a novel physically-motivated deep network for joint shape and material estimation, as well as relighting under novel illumination conditions, using a single image captured by a mobile phone camera. |
832 | Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement | Qiang Nie ; Ziwei Liu ; Yunhui Liu; | In this work, we propose a novel Siamese denoising autoencoder to learn a 3D pose representation by disentangling the pose-dependent and view-dependent feature from the human skeleton data, in a fully unsupervised manner. |
833 | Angle-based Search Space Shrinking for Neural Architecture Search | Yiming Hu; Yuding Liang; Zichao Guo; Ruosi Wan; Xiangyu Zhang; Yichen Wei; Qingyi Gu; Jian Sun; | In this work, we present a simple and general search space shrinking method, called Angle-Based search space Shrinking (ABS), for Neural Architecture Search (NAS). |
834 | RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition | Xiaoyu Yue; Zhanghui Kuang; Chenhao Lin; Hongbin Sun; Wayne Zhang; | To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. |
835 | Towards Fast, Accurate and Stable 3D Dense Face Alignment | Jianzhu Guo; Xiangyu Zhu; Yang Yang; Fan Yang; Zhen Lei; Stan Z. Li; | In this paper, we propose a novel regression framework which makes a balance among speed, accuracy and stability. |
836 | Iterative Feature Transformation for Fast and Versatile Universal Style Transfer | Tai-Yin Chiu; Danna Gurari; | We propose a new transformation that iteratively stylizes features with analytical gradient descent. |
837 | CATCH: Context-based Meta Reinforcement Learning for Transferrable Architecture Search | Xin Chen; Yawen Duan; Zewei Chen; Hang Xu; Zihao Chen; Xiaodan Liang; Tong Zhang; Zhenguo Li; | This is the first work to our knowledge that proposes an efficient transferrable NAS solution while maintaining robustness across various settings. |
838 | Toward Faster and Simpler Matrix Normalization via Rank-1 Update | Tan Yu; Yunfeng Cai; Ping Li; | To overcome these limitations, we propose a rank-1 update normalization (RUN), which only needs matrix-vector multiplications and thus is significantly more efficient than NS iteration using matrix-matrix multiplications. |
839 | Accurate Polarimetric BRDF for Real Polarization Scene Rendering | Yuhi Kondo; Taishi Ono; Legong Sun; Yasutaka Hirasawa; Jun Murayama; | In this paper, we propose a new polarimetric BRDF (pBRDF) model. |
840 | Lensless Imaging with Focusing Sparse URA Masks in Long-Wave Infrared and its Application for Human Detection | Ilya Reshetouski; Hideki Oyaizu; Kenichiro Nakamura; Ryuta Satoh; Suguru Ushiki; Ryuichi Tadano; Atsushi Ito; Jun Murayama; | We introduce a lensless imaging framework for contemporary computer vision applications in long-wavelength infrared (LWIR). |
841 | Topology-Preserving Class-Incremental Learning | Xiaoyu Tao; Xinyuan Chang; Xiaopeng Hong; Xing Wei; Yihong Gong; | On this basis, we propose a novel topology-preserving class-incremental learning (TPCIL) framework. |
842 | Inter-Image Communication for Weakly Supervised Localization | Xiaolin Zhang; Yunchao Wei; Yi Yang; | In this paper, we propose to leverage pixel-level similarities across different objects for learning more accurate object locations in a complementary way. |
843 | UFO²: A Unified Framework towards Omni-supervised Object Detection | Zhongzheng Ren; Zhiding Yu; Xiaodong Yang; Ming-Yu Liu; Alexander G. Schwing; Jan Kautz; | In this paper, we present UFO$^2$, a unified object detection framework that can handle different forms of supervision simultaneously. |
844 | iCaps: An Interpretable Classifier via Disentangled Capsule Networks | Dahuin Jung; Jonghyun Lee; Jihun Yi; Sungroh Yoon; | In this work, we address these two limitations using a novel class-supervised disentanglement algorithm and an additional regularizer, respectively. |
845 | Detecting Natural Disasters, Damage, and Incidents in the Wild | Ethan Weber; Nuria Marzo; Dim P. Papadopoulos; Aritro Biswas; Agata Lapedriza; Ferda Ofli; Muhammad Imran; Antonio Torralba; | In this work, we present the Incidents Dataset, which contains 446,684 images annotated by humans that cover 43 incidents across a variety of scenes. |
846 | Dynamic ReLU | Yinpeng Chen; Xiyang Dai; Mengchen Liu; Dongdong Chen; Lu Yuan; Zicheng Liu; | In this paper, we propose dynamic ReLU (DY-ReLU), a dynamic rectifier of which parameters are generated by a hyper function over all in-put elements. |
847 | Acquiring Dynamic Light Fields through Coded Aperture Camera | Kohei Sakai; Keita Takahashi; Toshiaki Fujii; Hajime Nagahara; | We investigate the problem of compressive acquisition of a dynamic light field. |
848 | Gait Recognition from a Single Image using a Phase-Aware Gait Cycle Reconstruction Network | Chi Xu; Yasushi Makihara; Xiang Li; Yasushi Yagi; Jianfeng Lu; | We propose a method of gait recognition just from a single image for the first time, which enables latency-free gait recognition. |
849 | Informative Sample Mining Network for Multi-Domain Image-to-Image Translation | Jie Cao; Huaibo Huang; Yi Li; Ran He; Zhenan Sun; | In this paper, we reveal that improving the sample selection strategy is an effective solution. |
850 | Spherical Feature Transform for Deep Metric Learning | Yuke Zhu; Yan Bai; Yichen Wei; | This work proposes a novel spherical feature transform approach. |
851 | Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering | Ruixue Tang; Chao Ma; Wei Emma Zhang; Qi Wu; Xiaokang Yang; | In this paper, instead of directly manipulating images and questions, we use generated adversarial examples for both images and questions as the augmented data. |
852 | Unsupervised Multi-View CNN for Salient View Selection of 3D Objects and Scenes | Ran Song; Wei Zhang; Yitian Zhao; Yonghuai Liu; | We present an unsupervised 3D deep learning framework based on a ubiquitously true proposition named by us view-object consistency as it states that a 3D object and its projected 2D views always belong to the same object class. |
853 | Representation Sharing for Fast Object Detector Search and Beyond | Yujie Zhong; Zelu Deng; Sheng Guo; Matthew R. Scott; Weilin Huang; | To enhance such capability, we propose an extremely efficient neural architecture search method, named Fast And Diverse (FAD), to better explore the optimal configuration of receptive fields and con-volution types in the sub-networks for one-stage detectors. |
854 | Peeking into occluded joints: A novel framework for crowd pose estimation | Lingteng Qiu; Xuanye Zhang; Yanran Li; Guanbin Li; Xiaojun Wu; Zixiang Xiong; Xiaoguang Han; Shuguang Cui; | Therefore, we thoroughly pursue this problem and propose a novel OPEC-Net framework together with a new Occluded Pose (OCPose) dataset with 9k annotated images. |
855 | RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition | Linxi Fan; Shyamal Buch; Guanzhi Wang; Ryan Cao; Yuke Zhu; Juan Carlos Niebles; Li Fei-Fei; | To this end, we introduce RubiksNet, a new efficient architecture for video action recognition which is based on a proposed learnable 3D spatiotemporal shift operation instead. |
856 | Deep Hashing with Active Pairwise Supervision | Ziwei Wang; Quan Zheng; Jiwen Lu; Jie Zhou; | n this paper, we propose a Deep Hashing method with Active Pairwise Supervision(DH-APS). |
857 | Graph Edit Distance Reward: Learning to Edit Scene Graph | Lichang Chen; Guosheng Lin; Shijie Wang; Qingyao Wu; | In this paper, we propose a new method to edit the scene graph according to the user instructions, which has never been explored. |
858 | Malleable 2.5D Convolution: Learning Receptive Fields along the Depth-axis for RGB-D Scene Parsing | Yajie Xing; Jingbo Wang; Gang Zeng; | In this paper, we propose a novel operator called malleable 2.5D convolution to learn the receptive field along the depth-axis. |
859 | Feature-metric Loss for Self-supervised Learning of Depth and Egomotion | Chang Shu; Kun Yu; Zhixiang Duan; Kuiyuan Yang; | In this work, feature-metric loss is proposed and defined on feature representation, where the feature representation is also learned in a self-supervised manner and regularized by both first-order and second-order derivatives to constrain the loss landscapes to form proper convergence basins. |
860 | Propagating Over Phrase Relations for One-Stage Visual Grounding | Sibei Yang; Guanbin Li; Yizhou Yu; | In this paper, we propose a linguistic structure guided propagation network for one-stage phrase grounding. |
861 | Adversarial Semantic Data Augmentation for Human Pose Estimation | Yanrui Bin; Xuan Cao; Xinya Chen; Yanhao Ge; Ying Tai; Chengjie Wang; Jilin Li; Feiyue Huang; Changxin Gao; Nong Sang; | We instead propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. |
862 | Free View Synthesis | Gernot Riegler; Vladlen Koltun; | We present a method for novel view synthesis from input images that are freely distributed around a scene. |
863 | Face Anti-Spoofing via Disentangled Representation Learning | Ke-Yue Zhang; Taiping Yao; Jian Zhang; Ying Tai; Shouhong Ding; Jilin Li; Feiyue Huang; Haichuan Song; Lizhuang Ma; | In this paper, motivated by the disentangled representation learning, we propose a novel perspective of face anti-spoofing that disentangles the liveness features and content features from images, and the liveness features is further used for classification. |
864 | Prime-Aware Adaptive Distillation | Youcai Zhang; Zhonghao Lan; Yuchen Dai; Fangao Zeng; Yan Bai; Jie Chang; Yichen Wei; | This paper introduces the adaptive sample weighting to KD. |
865 | Meta-Learning with Network Pruning | Hongduan Tian; Bo Liu; Xiao-Tong Yuan; Qingshan Liu; | To remedy this deficiency, we propose a network pruning based meta-learning approach for overfitting reduction via explicitly controlling the capacity of network. |
866 | Spiral Generative Network for Image Extrapolation | Dongsheng Guo; Hongzhi Liu; Haoru Zhao; Yunhao Cheng; Qingwei Song; Zhaorui Gu; Haiyong Zheng; Bing Zheng; | In this paper, motivated by human natural ability to perceive unseen surroundings imaginatively, we propose a novel Spiral Generative Network, SpiralNet, to perform image extrapolation in a spiral manner, which regards extrapolation as an evolution process growing from an input sub-image along a spiral curve to an expanded full image. |
867 | SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches | Fang Liu; Changqing Zou; Xiaoming Deng; Ran Zuo; Yu-Kun Lai; Cuixia Ma; Yong-Jin Liu; Hongan Wang; | In this paper, for the first time, we study the fine-grained scene-level SBIR problem which aims at retrieving scene images satisfying the user’s specific requirements via a freehand scene sketch. |
868 | Few-shot Compositional Font Generation with Dual Memory | Junbum Cha; Sanghyuk Chun; Gayoung Lee; Bado Lee; Seonghyeon Kim; Hwalsuk Lee; | In this paper, we focus on compositional scripts, a widely used letter system in the world, where each glyph can be decomposed by several components. |
869 | PUGeo-Net: A Geometry-centric Network for 3D Point Cloud Upsampling | Yue Qian; Junhui Hou; Sam Kwong; Ying He; | In this paper, we propose a novel deep neural network based method, called PUGeo-Net, for upsampling 3D point clouds. |
870 | Handcrafted Outlier Detection Revisited | Luca Cavalli; Viktor Larsson; Martin Ralf Oswald; Torsten Sattler; Marc Pollefeys; | Based on best practices, we propose a hierarchical pipeline for effective outlier detection as well as integrate novel ideas which in sum lead to an efficient and competitive approach to outlier rejection. |
871 | The Average Mixing Kernel Signature | Luca Cosmo; Giorgia Minello; Michael Bronstein; Luca Rossi; Andrea Torsello; | We introduce the Average Mixing Kernel Signature (AMKS), a novel signature for points on non-rigid three-dimensional shapes based on the average mixing kernel and continuous-time quantum walks. |
872 | BCNet: Learning Body and Cloth Shape from A Single Image | Boyi Jiang; Juyong Zhang; Yang Hong; Jinhao Luo; Ligang Liu; Hujun Bao; | In this paper, we consider the problem to automatically reconstruct garment and body shapes from a single near-front view RGB image. To train our model, we construct two large scale datasets with ground truth body and garment geometries as well as paired color images. |
873 | Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos | Umer Rafi; Andreas Doering; Bastian Leibe; Juergen Gall; | To address this issue, we propose an approach that relies on key point correspondences for associating persons in videos. |
874 | Interactive Multi-Dimension Modulation with Dynamic Controllable Residual Learning for Image Restoration | Jingwen He; Chao Dong; Yu Qiao; | To make a step forward, this paper presents a new problem setup, called multi-dimension (MD) modulation, which aims at modulating output effects across multiple degradation types and levels. |
875 | Polysemy Deciphering Network for Human-Object Interaction Detection | Xubin Zhong; Changxing Ding; Xian Qu; Dacheng Tao; | To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net), which decodes the visual polysemy of verbs for HOI detection in three ways. |
876 | PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning | Arthur Douillard; Matthieu Cord; Charles Ollion; Thomas Robert; Eduardo Valle; | In this work, we propose PODNet, a model inspired by representation learning. |
877 | Learning Graph-Convolutional Representations for Point Cloud Denoising | Francesca Pistilli; Giulia Fracastoro; Diego Valsesia; Enrico Magli; | We propose a deep neural network based on graph-convolutional layers that can elegantly deal with the permutation-invariance problem encountered by learning-based point cloud processing methods. |
878 | Semantic Line Detection Using Mirror Attention and Comparative Ranking and Matching | Dongkwon Jin; Jun-Tae Lee; Chang-Su Kim; | A novel algorithm to detect semantic lines is proposed in this paper. |
879 | A Differentiable Recurrent Surface for Asynchronous Event-Based Data | Marco Cannici; Marco Ciccone; Andrea Romanoni ; Matteo Matteucci; | In this paper, we propose Matrix-LSTM, a grid of Long Short-Term Memory (LSTM) cells that efficiently process events and learn end-to-end task-dependent event-surfaces. |
880 | Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches | Ruoyi Du; Dongliang Chang; Ayan Kumar Bhunia; Jiyang Xie; Zhanyu Ma ; Yi-Zhe Song; Jun Guo; | In this work, we propose a novel framework for fine-grained visual classi?cation to tackle these problems. |
881 | LiteFlowNet3: Resolving Correspondence Ambiguity for More Accurate Optical Flow Estimation | Tak-Wai Hui; Chen Change Loy; | In this paper, we introduce LiteFlowNet3, a deep network consisting of two specialized modules, to address the above challenges. |
882 | Microscopy Image Restoration with Deep Wiener-Kolmogorov Filters | Valeriya Pronina; Filippos Kokkinos; Dmitry V. Dylov; Stamatios Lefkimmiatis; | In this work, we propose a unifying framework of algorithms for Gaussian image deblurring and denoising. |
883 | ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language | Dave Zhenyu Chen; Angel X. Chang; Matthias Nießner; | In order to train and benchmark our method, we introduce a new ScanRefer dataset, containing 46,173 descriptions of 9,943 objects from 703 ScanNet scenes. |
884 | JSENet: Joint Semantic Segmentation and Edge Detection Network for 3D Point Clouds | Zeyu Hu; Mingmin Zhen; Xuyang Bai; Hongbo Fu; Chiew-lan Tai; | In this paper, we tackle the 3D semantic edge detection task for the first time and present a new two-stream fully-convolutional network that jointly performs the two tasks. |
885 | Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior | Hu Zhang; Linchao Zhu; Yi Zhu; Yi Yang; | In this paper, we aim to attack video models by utilizing intrinsic movement pattern and regional relative motion among video frames. |
886 | An Inference Algorithm for Multi-Label MRF-MAP Problems with Clique Size 100 | Ishant Shanu; Siddhant Bharti; Chetan Arora; S. N. Maheshwari; | In this paper, we propose an algorithm for optimal solutions to submodular higher-order multi-label MRF-MAP energy functions which can handle practical computer vision problems with up to 16 labels and cliques of size 100. |
887 | Dual Refinement Underwater Object Detection Network | Baojie Fan; Wei Chen; Yang Cong; Jiandong Tian; | To address these problems, we propose an underwater detection framework with feature enhancement and anchor refinement. |
888 | Multiple Sound Sources Localization from Coarse to Fine | Rui Qian; Di Hu; Heinrich Dinkel; Mengyue Wu; Ning Xu; Weiyao Lin; | To solve this problem, we develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine manner. |
889 | Task-Aware Quantization Network for JPEG Image Compression | Jinyoung Choi; Bohyung Han; | We propose to learn a deep neural network for JPEG image compression, which predicts image-specific optimized quantization tables fully compatible with the standard JPEG encoder and decoder. |
890 | Energy-Based Models for Deep Probabilistic Regression | Fredrik K. Gustafsson; Martin Danelljan; Goutam Bhat; Thomas B. Schön; | We address these issues by proposing a general and conceptually simple regression method with a clear probabilistic interpretation. |
891 | CLOTH3D: Clothed 3D Humans | Hugo Bertiche; Meysam Madadi; Sergio Escalera; | We present CLOTH3D, the first big scale synthetic dataset of 3D clothed human sequences. |
892 | Encoding Structure-Texture Relation with P-Net for Anomaly Detection in Retinal Images | Kang Zhou; Yuting Xiao; Jianlong Yang; Jun Cheng; Wen Liu; Weixin Luo; Zaiwang Gu; Jiang Liu; Shenghua Gao; | Motivated by this, we propose to leverage the relation between the image texture and structure to design a deep neural network for anomaly detection. |
893 | CLNet: A Compact Latent Network for Fast Adjusting Siamese Trackers | Xingping Dong; Jianbing Shen; Ling Shao; Fatih Porikli; | In this paper, we provide a deep analysis for Siamese-based trackers and find that the one core reason for their failure on challenging cases can be attributed to the problem of {\it decisive samples missing} during offline training. |
894 | Occlusion-Aware Siamese Network for Human Pose Estimation | Lu Zhou; Yingying Chen; Yunze Gao; Jinqiao Wang; Hanqing Lu; | To conquer this dilemma, we propose an occlusion-aware siamese network to improve the performance. |
895 | Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model | Yufan Liu; Minglang Qiao; Mai Xu; Bing Li; Weiming Hu; Ali Borji; | In this paper, we thoroughly investigate such influences by establishing a large-scale eye-tracking database of Multiple-face Video in Visual-Audio condition (MVVA). |
896 | NormalGAN: Learning Detailed 3D Human from a Single RGB-D Image | Lizhen Wang; Xiaochen Zhao; Tao Yu; Songtao Wang; Yebin Liu; | We propose NormalGAN, a fast adversarial learning-based method to reconstruct the complete and detailed 3D human from a single RGB-D image. |
897 | Model-based occlusion disentanglement for image-to-image translation | Fabio Pizzati; Pietro Cerri; Raoul de Charette; | Our unsupervised model-based learning disentangles scene and occlusions, while benefiting from an adversarial pipeline to regress physical parameters of the occlusion model. |
898 | Rotation-robust Intersection over Union for 3D Object Detection | Yu Zheng; Danyang Zhang; Sinan Xie; Jiwen Lu; Jie Zhou; | In this paper, we propose a Rotation-robust Intersection over Union ($ extit{RIoU}$) for 3D object detection, which aims to jointly learn the overlap of rotated bounding boxes. |
899 | New Threats against Object Detector with Non-local Block | Yi Huang; Fan Wang; Adams Wai-Kin Kong; Kwok-Yan Lam; | In this paper, two new threats named disappearing attack and appearing attack against object detectors with a non-local block are investigated. |
900 | Self-Supervised CycleGAN for Object-Preserving Image-to-Image Domain Adaptation | Xinpeng Xie; Jiawei Chen; Yuexiang Li; Linlin Shen; Kai Ma; Yefeng Zheng; | In this paper, we propose a novel GAN (namely OP-GAN) to address the problem, which involves a self-supervised module to enforce the image content consistency during image-to-image translations without any extra annotations. |
901 | On the Usage of the Trifocal Tensor in Motion Segmentation | Federica Arrigoni; Luca Magri; Tomas Pajdla; | In this paper we address motion segmentation in multiple images by combining partial results coming from triplets of images, which are obtained by fitting a number of trifocal tensors to correspondences. |
902 | 3D-Rotation-Equivariant Quaternion Neural Networks | Wen Shen; Binbin Zhang; Shikun Huang; Zhihua Wei; Quanshi Zhang; | This paper proposes a set of rules to revise various neural networks for 3D point cloud processing to rotation-equivariant quaternion neural networks (REQNNs). |
903 | InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image | Gyeongsik Moon; Shoou-I Yu; He Wen; Takaaki Shiratori; Kyoung Mu Lee; | Therefore, we firstly propose (1) a large-scale dataset, InterHand2.6M, and (2) a baseline network, InterNet, for 3D interacting hand pose estimation from a single RGB image. |
904 | Active Crowd Counting with Limited Supervision | Zhen Zhao; Miaojing Shi; Xiaoxiao Zhao; Li Li; | In the last cycle when the labeling budget is met, the large amount of unlabeled data are also utilized: a distribution classifier is introduced to align the labeled data with unlabeled data furthermore, we propose to mix up the distribution labels and latent representations of data in the network to particularly improve the distribution alignment in-between training samples. |
905 | Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance | Marvin Klingner; Jan-Aike Termhlen; Jonas Mikolajczyk; Tim Fingscheidt; | In this work we present a new self-supervised semantically-guided depth estimation (SGDepth) method to deal with moving dynamic-class (DC) objects, such as moving cars and pedestrians, which violate the static-world assumptions typically made during training of such models. |
906 | Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language | Shaoxiang Chen; Yu-Gang Jiang; | In this paper, we propose a novel TALL method which builds a Hierarchical Visual-Textual Graph to model interactions between the objects and words as well as among the objects to jointly understand the video contents and the language. |
907 | Do Not Mask What You Do Not Need to Mask: a Parser-Free Virtual Try-On | Thibaut Issenhuth; Jérémie Mary; Clément Calauzènes; | In this paper, we propose a novel student-teacher paradigm where the teacher is trained in the standard way (reconstruction) before guiding the student to focus on the initial task (changing the cloth). |
908 | NODIS: Neural Ordinary Differential Scene Understanding | Yuren Cong; Hanno Ackermann; Wentong Liao; Michael Ying Yang; Bodo Rosenhahn; | In this work, we interpret that formulation as Ordinary Differential Equation (ODE). |
909 | AssembleNet++: Assembling Modality Representations via Attention Connections – Supplementary Material – | Michael S. Ryoo; AJ Piergiovanni; Juhana Kangaspunta; Anelia Angelova; | We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. |
910 | Learning Propagation Rules for Attribution Map Generation | Yiding Yang; Jiayan Qiu; Mingli Song; Dacheng Tao; Xinchao Wang; | In this paper, we propose a dedicated method to generate attribution maps that allow us to learn the propagation rules automatically, overcoming the flaws of the hand-crafted ones. |
911 | Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference | Menelaos Kanakis; David Bruggemann; Suman Saha; Stamatios Georgoulis ; Anton Obukhov; Luc Van Gool; | In this paper, we show that both can be achieved simply by reparameterizing the convolutions of standard neural network architectures into a non-trainable shared part (filter bank) and task-specific parts (modulators), where each modulator has a fraction of the filter bank parameters. |
912 | Learning Predictive Models from Observation and Interaction | Karl Schmeckpeper; Annie Xie; Oleh Rybkin; Stephen Tian; Kostas Daniilidis; Sergey Levine; Chelsea Finn; | We address the first challenge by formulating the corresponding graphical model and treating the action as an observed variable for the interaction data and an unobserved variablefor the observation data, and the second challenge by using a domain-dependent prior. |
913 | Unifying Deep Local and Global Features for Image Search | Bingyi Cao; André Araujo; Jack Sim; | In this work, our key contribution is to unify global and local features into a single deep model, enabling accurate retrieval with efficient feature extraction. |
914 | Human Body Model Fitting by Learned Gradient Descent | Jie Song; Xu Chen; Otmar Hilliges; | We propose a novel algorithm for the fitting of 3D human shape to images. |
915 | DDGCN: A Dynamic Directed Graph Convolutional Network for Action Recognition | Matthew Korban; Xin Li; | We propose a Dynamic Directed Graph Convolutional Network (DDGCN) to model spatial and temporal features of human actions from their skeletal representations. |
916 | Learning latent representations across multiple data domains using Lifelong VAEGAN | Fei Ye; Adrian G. Bors; | In this paper, we propose a novel lifelong learning approach, namely the Lifelong VAEGAN (L-VAEGAN), which not only induces a powerful generative replay network but also learns meaningful latent representations, benefiting representation learning. |
917 | DVI: Depth Guided Video Inpainting for Autonomous Driving | Miao Liao; Feixiang Lu; Dingfu Zhou; Sibo Zhang; Wei Li; Ruigang Yang; | To get clear street-view and photo-realistic simulation in autonomous driving, we present an automatic video inpainting algorithm that can remove traffic agents from videos and synthesize missing regions with the guidance of depth/point cloud. |
918 | Incorporating Reinforced Adversarial Learning in Autoregressive Image Generation | Kenan E. Ak; Ning Xu; Zhe Lin; Yilin Wang; | To address these limitations, we propose to use Reinforced Adversarial Learning (RAL) based on policy gradient optimization for autoregressive models. |
919 | APRICOT: A Dataset of Physical Adversarial Attacks on Object Detection | A. Braunegg; Amartya Chakraborty; Michael Krumdick; Nicole Lape; Sara Leary; Keith Manville; Elizabeth Merkhofer; Laura Strickhart; Matthew Walmer; | We present APRICOT, a collection of over 1,000 annotated photographs of printed adversarial patches in public locations. |
920 | Visual Question Answering on Image Sets | Ankan Bansal; Yuting Zhang; Rama Chellappa; | We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings. |
921 | Object as Hotspots: An Anchor-Free 3D Object Detection Approach via Firing of Hotspots | Qi Chen; Lin Sun; Zhixin Wang; Kui Jia; Alan Yuille; | We thus argue in this paper for an approach opposite to existing methods using object-level anchors. |
922 | Placepedia: Comprehensive Place Understanding with Multi-Faceted Annotations | Huaiyi Huang; Yuqi Zhang; Qingqiu Huang; Zhengkui Guo; Ziwei Liu; Dahua Lin; | In this work, we contribute Placepedia1, a large-scale place dataset with more than 35M photos from 240K unique places. |
923 | DELTAS: Depth Estimation by Learning Triangulation And densification of Sparse points | Ayan Sinha; Zak Murez; James Bartolozzi; Vijay Badrinarayanan; Andrew Rabinovich; | Distinct from cost volume approaches, we propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally densifying this sparse set of 3D points using CNNs. |
924 | Dynamic Low-light Imaging with Quanta Image Sensors | Yiheng Chi; Abhiram Gnanasambandam; Vladlen Koltun; Stanley H. Chan; | We propose a solution using Quanta Image Sensors (QIS) and present a new image reconstruction algorithm. |
925 | Disambiguating Monocular Depth Estimation with a Single Transient | Mark Nishimura; David B. Lindell; Christopher Metzler; Gordon Wetzstein; | In this work, we demonstrate how a depth histogram of the scene, which can be readily captured using a single-pixel time-resolved detector, can be fused with the output of existing monocular depth estimation algorithms to resolve the depth ambiguity problem. |
926 | DSDNet: Deep Structured self-Driving Network | Wenyuan Zeng; Shenlong Wang; Renjie Liao; Yun Chen; Bin Yang; Raquel Urtasun; | In this paper, we propose the Deep Structured self-Driving Network (DSDNet), which performs object detection, motion prediction, and motion planning with a single neural network. |
927 | QuEST: Quantized Embedding Space for Transferring Knowledge | Himalaya Jain; Spyros Gidaris; Nikos Komodakis; Patrick Pérez; Matthieu Cord; | In this work, we propose a novel way to achieve this goal: by distilling the knowledge through a quantized visual words space. |
928 | EGDCL: An Adaptive Curriculum Learning Framework for Unbiased Glaucoma Diagnosis | Rongchang Zhao; Xuanlin Chen; Zailiang Chen; Shuo Li; | In this paper, we propose a novel curriculum learning paradigm (EGDCL) to train an unbiased glaucoma diagnosis model with the adaptive dual-curriculum. |
929 | Backpropagated Gradient Representations for Anomaly Detection | Gukyeong Kwon; Mohit Prabhushankar; Dogancan Temel; Ghassan AlRegib; | Hence, we propose the utilization of backpropagated gradients as representations to characterize model behavior on anomalies and, consequently, detect such anomalies. |
930 | Dense RepPoints: Representing Visual Objects with Dense Point Sets | Ze Yang; Yinghao Xu; Han Xue; Zheng Zhang Raquel Urtasun; Liwei Wang ; Stephen Lin; Han Hu; | We present a new object representation, called Dense Rep-Points, which utilize a large number of points to describe the multi-grainedobject representation of both box level and pixel level. |
931 | On Dropping Clusters to Regularize Graph Convolutional Neural Networks | Xikun Zhang; Chang Xu; Dacheng Tao; | To effectively regularize GCNs, we devise DropCluster which first randomly zeros some seed entries and then zeros entries that are spatially or depth-wisely correlated to those seed entries. |
932 | Adaptive Video Highlight Detection by Learning from User History | Mrigank Rochan; Mahesh Kumar Krishna Reddy; Linwei Ye; Yang Wang; | In this paper, we propose a simple yet effective framework that learns to adapt highlight detection to a user by exploiting the user’s history in the form of highlights that the user has previously created. |
933 | Improving 3D Object Detection through Progressive Population Based Augmentation | Shuyang Cheng; Zhaoqi Leng; Ekin Dogus Cubuk; Barret Zoph; Chunyan Bai; Jiquan Ngiam; Yang Song; Benjamin Caine; Vijay Vasudevan; Congcong Li; Quoc V. Le; Jonathon Shlens; Dragomir Anguelov; | In this work, we present the first attempt to automate the design of data augmentation policies for 3D object detection. |
934 | DR-KFS: A Differentiable Visual Similarity Metric for 3D Shape Reconstruction | Jiongchao Jin; Akshay Gadi Patil; Zhang Xiong; Hao Zhang; | We introduce a differential visual similarity metric to train deep neural networks for 3D reconstruction, aimed at improving reconstruction quality. |
935 | SPAN: Spatial Pyramid Attention Network for Image Manipulation Localization | Xuefeng Hu; Zhihan Zhang; Zhenye Jiang; Syomantak Chaudhuri; Zhenheng Yang; Ram Nevatia; | We present a novel, Spatial Pyramid Attention Network (SPAN) for detection and localization of multiple types of image manipulations. |
936 | Adversarial Learning for Zero-shot Domain Adaptation | Jinghua Wang; Jianmin Jiang; | With the hypothesis that the shift between a given pair of domains is shared across tasks, we propose a new method for ZSDA by transferring domain shift from an irrelevant task (IrT) to the task of interest (ToI). |
937 | YOLO in the Dark – Domain Adaptation Method for Merging Multiple Models – | Yukihiro Sasagawa; Hajime Nagahara; | We propose a method of domain adaptation for merging multiple models with less effort than creating an additional dataset. |
938 | Identity-Aware Multi-Sentence Video Description | Jae Sung Park; Trevor Darrell; Anna Rohrbach; | We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. |
939 | VQA-LOL: Visual Question Answering under the Lens of Logic | Tejas Gokhale; Pratyay Banerjee; Chitta Baral; Yezhou Yang; | In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. |
940 | Piggyback GAN: Efficient Lifelong Learning for Image Conditioned Generation | Mengyao Zhai; Lei Chen; Jiawei He; Megha Nawhal; Frederick Tung; Greg Mori; | In contrast, we propose a parameter efficient framework, Piggyback GAN, which learns the current task by building a set of convolutional and deconvolutional filters that are factorized into filters of the models trained on previous tasks. |
941 | TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering | Xiaofeng Yang; Guosheng Lin; Fengmao Lv; Fayao Liu; | We propose a novel tiered reasoning method that dynamically selects object level candidates based on language representations and generates robust pairwise relations within the selected candidate objects. |
942 | Mining Inter-Video Proposal Relations for Video Object Detection | Mingfei Han; Yali Wang; Xiaojun Chang; Yu Qiao; | To address the limitation, we propose a novel Inter-Video Proposal Relation module. |
943 | TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval | Jie Lei; Licheng Yu; Tamara L. Berg; Mohit Bansal; | We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. |
944 | Minimum Class Confusion for Versatile Domain Adaptation | Ying Jin; Ximei Wang; Mingsheng Long(); Jianmin Wang; | To this end, this paper studies Versatile Domain Adaptation (VDA),where one method can handle several different DA scenarios without any modification. |
945 | Large Batch Optimization for Object Detection: Training COCO in 12 Minutes | Tong Wang; Yousong Zhu; Chaoyang Zhao; Wei Zeng; Yaowei Wang; Jinqiao Wang; Ming Tang; | Specifically, we present a novel Periodical Moments Decay LAMB (PMD-LAMB) algorithm to effectively reduce the negative effects of the lagging historical gradients. |
946 | Towards Practical and Efficient High-Resolution HDR Deghosting with CNN | K. Ram Prabhakar; Susmit Agrawal; Durgesh Kumar Singh; Balraj Ashwath ; R. Venkatesh Babu; | In this paper, we present a deep neural network based approach to generate high-quality ghost-free HDR for high-resolution images. |
947 | Monocular Differentiable Rendering for Self-Supervised 3D Object Detection | Deniz Beker; Hiroharu Kato; Mihai Adrian Morariu; Takahiro Ando; Toru Matsuoka; Wadim Kehl; Adrien Gaidon; | To overcome this ambiguity, we present a novel self-supervised method for textured 3D shape reconstruction and pose estimation of rigid objects with the help of strong shape priors and 2D instance masks. |
948 | Shape Prior Deformation for Categorical 6D Object Pose and Size Estimation | Meng Tian; Marcelo H Ang Jr; Gim Hee Lee; | We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image. |
949 | Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction | Chaofan Tao; Qinhong Jiang; Lixin Duan; Ping Luo; | However, unlike previous work that isolated the spatial interaction, temporal coherence, and scene layout, this paper designs a new mechanism, \ie, Dynamic and Static Context-aware Motion Predictor (DSCMP), to integrates these rich information into the long-short-term-memory (LSTM). |
950 | Image-based table recognition: data, model, and evaluation | Xu Zhong; Elaheh ShafieiBavani; Antonio Jimeno Yepes; | To facilitate image-based table recognition with deep learning, we develop and release the largest publicly available table recognition dataset PubTabNet, containing 568k table images with corresponding structured HTML representation. |
951 | Group Activity Prediction with Sequential Relational Anticipation Model | Junwen Chen; Wentao Bao,; Yu Kong; | In this paper, we propose a novel approach to predict group activities given the beginning frames with incomplete activity executions. |
952 | PiP: Planning-informed Trajectory Prediction for Autonomous Driving | Haoran Song; Wenchao Ding; Yuxuan Chen; Shaojie Shen; Michael Yu Wang; Qifeng Chen; | We propose planning-informed trajectory prediction (PiP) to tackle the prediction problem in the multi-agent setting. |
953 | PSConv: Squeezing Feature Pyramid into One Compact Poly-Scale Convolutional Layer | Duo Li; Anbang Yao; Qifeng Chen; | We bridge this regret by exploiting multi-scale features in a finer granularity. |
954 | Hierarchical Context Embedding for Region-based Object Detection | Zhao-Min Chen; Xin Jin; Borui Zhao; Xiu-Shen Wei; Yanwen Guo; | To address this issue, we present a simple but effective Hierarchical Context Embedding (HCE) framework, which can be applied as a plug-and-play component, to facilitate the classification ability of a series of region-based detectors by mining contextual cues. |
9 |