README.md

社区模型库

飞桨目前包含470+社区模型，覆盖CV、NLP、推荐等多个领域，详细内容如下表：

社区开发者贡献模型

Abstract

Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet. Our code and models are available at https://github.com/szagoruyko/wide-residual-networks CIFAR-10(WRN-28-20-dropout): 96.55% Colorful Image Colorization

Abstract

Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underconstrained, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward pass in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a"colorization Turing test,"asking human participants to choose between a generated and ground truth color image. Our method successfully fools humans on 32% of the trials, significantly higher than previous methods. Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder. This approach results in state-of-the-art performance on several feature learning benchmarks.

AuC: non-rebal=89.5%rebal=67.3%VGG Top-1 Class Acc=56%AMT Labeled Real=32.3% Prototypical Networks for Few-shot Learning

Abstract

Dropout is a powerful and widely used technique to regularize the training ofdeep neural networks. In this paper, we introduce a simple regularizationstrategy upon dropout in model training, namely R-Drop, which forces the outputdistributions of different sub models generated by dropout to be consistentwith each other. Specifically, for each training sample, R-Drop minimizes thebidirectional KL-divergence between the output distributions of two sub modelssampled by dropout. Theoretical analysis reveals that R-Drop reduces thefreedom of the model parameters and complements dropout. Experiments on$\bf{5}$ widely used deep learning tasks ( $\bf{18}$ datasets in total),including neural machine translation, abstractive summarization, languageunderstanding, language modeling, and image classification, show that R-Drop isuniversally effective. In particular, it yields substantial improvements whenapplied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large,and BART, and achieves state-of-the-art (SOTA) performances with the vanillaTransformer model on WMT14 English$\to$German translation ( $\bf{30.91}$ BLEU)and WMT14 English$\to$French translation ( $\bf{43.95}$ BLEU), even surpassingmodels trained with extra large-scale data and expert-designed advancedvariants of Transformer models. Our code is available atGitHub{\url{this https URL}}.

1-shot=49.42%, 5-shot=68.2% R-Drop: Regularized Dropout for Neural Networks

Abstract

Inspired by recent work in machine translation and object detection, weintroduce an attention based model that automatically learns to describe thecontent of images. We describe how we can train this model in a deterministicmanner using standard backpropagation techniques and stochastically bymaximizing a variational lower bound. We also show through visualization howthe model is able to automatically learn to fix its gaze on salient objectswhile generating the corresponding words in the output sequence. We validatethe use of attention with state-of-the-art performance on three benchmarkdatasets: Flickr8k, Flickr30k and MS COCO.

ViT-B/16+RD=93.29 Weight Uncertainty in Neural Networks

Abstract

We introduce a new, efficient, principled and backpropagation-compatible algorithm for learning a probability distribution on the weights of a neural network, called Bayes by Backprop. It regularises the weights by minimising a compression cost, known as the variational free energy or the expected lower bound on the marginal likelihood. We show that this principled kind of regularisation yields comparable performance to dropout on MNIST classification. We then demonstrate how the learnt uncertainty in the weights can be used to improve generalisation in non-linear regression problems, and how this weight uncertainty can be used to drive the exploration-exploitation trade-off in reinforcement learning.

MNIST: Test Error=1.32% Matching Networks for One Shot Learning

Abstract

Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.

omniglot k-way=5, n-shot=1, acc = 98.1% Modeling Relational Data with Graph Convolutional Networks

Abstract

Recognizing arbitrary multi-character text in unconstrained naturalphotographs is a hard problem. In this paper, we address an equally hardsub-problem in this domain viz. recognizing arbitrary multi-digit numbers fromStreet View imagery. Traditional approaches to solve this problem typicallyseparate out the localization, segmentation, and recognition steps. In thispaper we propose a unified approach that integrates these three steps via theuse of a deep convolutional neural network that operates directly on the imagepixels. We employ the DistBelief implementation of deep neural networks inorder to train large, distributed neural networks on high quality images. Wefind that the performance of this approach increases with the depth of theconvolutional network, with the best performance occurring in the deepestarchitecture we trained, with eleven hidden layers. We evaluate this approachon the publicly available SVHN dataset and achieve over $96\%$ accuracy inrecognizing complete street numbers. We show that on a per-digit recognitiontask, we improve upon the state-of-the-art, achieving $97.84\%$ accuracy. Wealso evaluate this approach on an even more challenging dataset generated fromStreet View imagery containing several tens of millions of street numberannotations and achieve over $90\%$ accuracy. To further explore theapplicability of the proposed system to broader text recognition tasks, weapply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of themost secure reverse turing tests that uses distorted text to distinguish humansfrom bots. We report a $99.8\%$ accuracy on the hardest category of reCAPTCHA.Our evaluations on both tasks indicate that at specific operating thresholds,the performance of the proposed system is comparable to, and in some casesexceeds, that of human operators.

Accuracy = 95.83% Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

Abstract

Deep neural networks are typically trained by optimizing a loss function withan SGD variant, in conjunction with a decaying learning rate, untilconvergence. We show that simple averaging of multiple points along thetrajectory of SGD, with a cyclical or constant learning rate, leads to bettergeneralization than conventional training. We also show that this StochasticWeight Averaging (SWA) procedure finds much flatter solutions than SGD, andapproximates the recent Fast Geometric Ensembling (FGE) approach with a singlemodel. Using SWA we achieve notable improvement in test accuracy overconventional SGD training on a range of state-of-the-art residual networks,PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, andImageNet. In short, SWA is extremely easy to implement, improvesgeneralization, and has almost no computational overhead.

Accuracy=95.65% Averaging Weights Leads to Wider Optima and Better Generalization

Abstract

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

VGG16+SWA 1budget, CIFAR10 top1=93.59 Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation

Abstract

Deep learning (DL) based semantic segmentation methods have been providing state-of-the-art performance in the last few years. More specifically, these techniques have been successfully applied to medical image classification, segmentation, and detection tasks. One deep learning technique, U-Net, has become one of the most popular for these applications. In this paper, we propose a Recurrent Convolutional Neural Network (RCNN) based on U-Net as well as a Recurrent Residual Convolutional Neural Network (RRCNN) based on U-Net models, which are named RU-Net and R2U-Net respectively. The proposed models utilize the power of U-Net, Residual Network, as well as RCNN. There are several advantages of these proposed architectures for segmentation tasks. First, a residual unit helps when training deep architecture. Second, feature accumulation with recurrent residual convolutional layers ensures better feature representation for segmentation tasks. Third, it allows us to design better U-Net architecture with same number of network parameters with better performance for medical image segmentation. The proposed models are tested on three benchmark datasets such as blood vessel segmentation in retina images, skin cancer segmentation, and lung lesion segmentation. The experimental results show superior performance on segmentation tasks compared to equivalent models including U-Net and residual U-Net (ResU-Net).

R2U-Net F1-score=0.8171 Unsupervised Representation Learning by Predicting Image Rotations

Abstract

Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained AlexNet model achieves the state-of-the-art (among unsupervised methods) mAP of 54.4% that is only 2.4 points lower from the supervised case. We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as ImageNet classification, PASCAL classification, PASCAL segmentation, and CIFAR-10 classification. The code and models of our paper will be published on: https://github.com/gidariss/FeatureLearningRotNet .

RotNet+conv, CIFAR-10上top1=91.16 FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

Abstract

Over the last decade, Convolutional Neural Network (CNN) models have beenhighly successful in solving complex vision problems. However, these deepmodels are perceived as"black box"methods considering the lack ofunderstanding of their internal functioning. There has been a significantrecent interest in developing explainable deep learning models, and this paperis an effort in this direction. Building on a recently proposed method calledGrad-CAM, we propose a generalized method called Grad-CAM++ that can providebetter visual explanations of CNN model predictions, in terms of better objectlocalization as well as explaining occurrences of multiple object instances ina single image, when compared to state-of-the-art. We provide a mathematicalderivation for the proposed method, which uses a weighted combination of thepositive partial derivatives of the last convolutional layer feature maps withrespect to a specific class score as weights to generate a visual explanationfor the corresponding class label. Our extensive experiments and evaluations,both subjective and objective, on standard datasets showed that Grad-CAM++provides promising human-interpretable visual explanations for a given CNNarchitecture across multiple tasks including classification, image captiongeneration and 3D action recognition; as well as in new settings such asknowledge distillation.

cifar 10, 40label: 93.6%, 250 label 95.31%, 4000 labels 95.77% MLP-Mixer: An all-MLP Architecture for Vision

Abstract

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e."mixing"the per-location features), and one with MLPs applied across patches (i.e."mixing"spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

Mixer-B/16CIFAR-10upstream: ImageNet96.72%upstream: ImageNet-21k96.82%(官方JAX repo提供) Deep Networks with Stochastic Depth

Abstract

Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91% on CIFAR-10).

CIFAR-10 test error=5.25 Recurrent Models of Visual Attention

Abstract

Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

28*28 Mnist, RAM, 6 glimpses, 8 × 8, 1 scale达到论文指标 Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Abstract

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0\% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3\% top1 accuracy in image resolution 384×384 on ImageNet. (Code: this https URL)

ImageNet1k: T2T-ViT-7, 71.7% Rethinking Spatial Dimensions of Vision Transformers

Abstract

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit .

ImageNet1k: pit_ti 73.0% Masked Autoencoders Are Scalable Vision Learners

Abstract

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

ImageNet1k: val 83.6% XCiT: Cross-Covariance Image Transformers

Abstract

Following tremendous success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlyingtransformers yields global interactions between all tokens, i.e. words or image patches, andenables flexible modelling of image data beyond the local interactions of convolutions. Thisflexibility, however, comes with a quadratic complexity in time and memory, hinderingapplication to long sequences and high-resolution images. We propose a “transposed”version of self-attention that operates across feature channels rather than tokens, wherethe interactions are based on the cross-covariance matrix between keys and queries. Theresulting cross-covariance attention (XCA) has linear complexity in the number of tokens,and allows efficient processing of high-resolution images. Our cross-covariance imagetransformer (XCiT) – built upon XCA – combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness andgenerality of XCiT by reporting excellent results on multiple vision benchmarks, including (self-supervised) image classification on ImageNet-1k, object detection and instancesegmentation on COCO, and semantic segmentation on ADE20k.

ImageNet; xcit_nano_12_p8224: top1=73.8224: top1=76.3384: top1=77.8 MicroNet: Improving Image Recognition with Extremely Low FLOPs

Abstract

This paper aims at addressing the problem of substantial performance degradation at extremely low computational cost (e.g. 5M FLOPs on ImageNet classification). We found that two factors, sparse connectivity and dynamic activation function, are effective to improve the accuracy. The former avoids the significant reduction of network width, while the latter mitigates the detriment of reduction in network depth. Technically, we propose micro-factorized convolution, which factorizes a convolution matrix into low rank matrices, to integrate sparse connectivity into convolution. We also present a new dynamic activation function, named Dynamic Shift Max, to improve the non-linearity via maxing out multiple dynamic fusions between an input feature map and its circular channel shift. Building upon these two new operators, we arrive at a family of networks, named MicroNet, that achieves significant performance gains over the state of the art in the low FLOP regime. For instance, under the constraint of 12M FLOPs, MicroNet achieves 59.4% top-1 accuracy on ImageNet classification, outperforming MobileNetV3 by 9.6%. Source code is at https://github.com/liyunsheng13/micronet .

ImageNet1k数据集, MicroNet-M3 top1-acc 62.5% Rethinking Bottleneck Structure for Efficient Mobile Network Design

Abstract

The inverted residual block is dominating architecture design for mobile networks recently. It changes the classic residual bottleneck by introducing two design rules: learning inverted residuals and using linear bottlenecks. In this paper, we rethink the necessity of such design changes and find it may bring risks of information loss and gradient confusion. We thus propose to flip the structure and present a novel bottleneck design, called the sandglass block, that performs identity mapping and spatial transformation at higher dimensions and thus alleviates information loss and gradient confusion effectively. Extensive experiments demonstrate that, different from the common belief, such bottleneck structure is more beneficial than the inverted ones for mobile networks. In ImageNet classification, by simply replacing the inverted residual block with our sandglass block without increasing parameters and computation, the classification accuracy can be improved by more than 1.7% over MobileNetV2. On Pascal VOC 2007 test set, we observe that there is also 0.9% mAP improvement in object detection. We further verify the effectiveness of the sandglass block by adding it into the search space of neural architecture search method DARTS. With 25% parameter reduction, the classification accuracy is improved by 0.13% over previous DARTS models. Code can be found at: this https URL.

ImageNet1k数据集, MobileNeXt-1.00 top1-acc 74.02% CvT: Introducing Convolutions to Vision Transformers

Abstract

We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at \url{ https://github.com/leoxiaobin/CvT }.

ImageNet1k数据集CvT-13, 81.6% An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection

Abstract

As DenseNet conserves intermediate features with diverse receptive fields by aggregating them with dense connection, it shows good performance on the object detection task. Although feature reuse enables DenseNet to produce strong features with a small number of model parameters and FLOPs, the detector with DenseNet backbone shows rather slow speed and low energy efficiency. We find the linearly increasing input channel by dense connection leads to heavy memory access cost, which causes computation overhead and more energy consumption. To solve the inefficiency of DenseNet, we propose an energy and computation efficient architecture called VoVNet comprised of One-Shot Aggregation (OSA). The OSA not only adopts the strength of DenseNet that represents diversified features with multi receptive fields but also overcomes the inefficiency of dense connection by aggregating all features only once in the last feature maps. To validate the effectiveness of VoVNet as a backbone network, we design both lightweight and large-scale VoVNet and apply them to one-stage and two-stage object detectors. Our VoVNet based detectors outperform DenseNet based ones with 2x faster speed and the energy consumptions are reduced by 1.6x - 4.1x. In addition to DenseNet, VoVNet also outperforms widely used ResNet backbone with faster speed and better energy efficiency. In particular, the small object detection performance has been significantly improved over DenseNet and ResNet.

imagenet VoVNet-39, top1 acc 0.7677 Residual Attention: A Simple but Effective Method for Multi-Label Recognition

Abstract

Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive explanations. To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training. CSRA is both easy to implement and light in computations, which also enjoys intuitive explanations and visualizations.

VOC2007 dataset, resnet101, head num=1, mAP 94.7% HashNet: Deep Learning to Hash by Continuation

Abstract

Learning to hash has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval, due to its computation efficiency and retrieval quality. Deep learning to hash, which improves retrieval quality by end-to-end representation learning and hash encoding, has received increasing attention recently. Subject to the ill-posed gradient difficulty in the optimization with sign activations, existing deep learning to hash methods need to first learn continuous representations and then generate binary hash codes in a separated binarization step, which suffer from substantial loss of retrieval quality. This work presents HashNet, a novel deep architecture for deep learning to hash by continuation method with convergence guarantees, which learns exactly binary hash codes from imbalanced similarity data. The key idea is to attack the ill-posed gradient problem in optimizing deep networks with non-smooth binary activations by continuation method, in which we begin from learning an easier network with smoothed activation function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, deep network with the sign activation function. Comprehensive empirical evidence shows that HashNet can generate exactly binary hash codes and yield state-of-the-art multimedia retrieval performance on standard benchmarks.

MS COCO 16bits 0.6873, 32bits 0.7184, 48bits 0.7301, 64bits 0.7362 Greedy Hash: Towards Fast Optimization for Accurate Hash Coding in CNN

Abstract

To convert the input into binary code, hashing algorithm has been widely used for approximate nearest neighbor search on large-scale image sets due to its computation and storage efficiency. Deep hashing further improves the retrieval quality by combining the hash coding with deep neural network. However, a major difficulty in deep hashing lies in the discrete constraints imposed on the network output, which generally makes the optimization NP hard. In this work, we adopt the greedy principle to tackle this NP hard problem by iteratively updating the network toward the probable optimal discrete solution in each iteration. A hash coding layer is designed to implement our approach which strictly uses the sign function in forward propagation to maintain the discrete constraints, while in back propagation the gradients are transmitted intactly to the front layer to avoid the vanishing gradients. In addition to the theoretical derivation, we provide a new perspective to visualize and understand the effectiveness and efficiency of our algorithm. Experiments on benchmark datasets show that our scheme outperforms state-of-the-art hashing methods in both supervised and unsupervised tasks.

cifar10(1) 12bits 0.774, 24bits 0.795, 32bit 0.810, 48bits 0.822 Trusted Multi-View Classification

Abstract

Multi-view classification (MVC) generally focuses on improving classification accuracy by using information from different views, typically integrating them into a unified comprehensive representation for downstream tasks. However, it is also crucial to dynamically assess the quality of a view for different samples in order to provide reliable uncertainty estimations, which indicate whether predictions can be trusted. To this end, we propose a novel multi-view classification method, termed trusted multi-view classification, which provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The algorithm jointly utilizes multiple views to promote both classification reliability and robustness by integrating evidence from each view. To achieve this, the Dirichlet distribution is used to model the distribution of the class probabilities, parameterized with evidence from different views and integrated with the Dempster-Shafer theory. The unified learning framework induces accurate uncertainty and accordingly endows the model with both reliability and robustness for out-of-distribution samples. Extensive experimental results validate the effectiveness of the proposed model in accuracy, reliability and robustness.

See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification

Abstract

Data augmentation is usually adopted to increase the amount of training data, prevent overfitting and improve the performance of deep models. However, in practice, random data augmentation, such as random image cropping, is low-efficiency and might introduce many uncontrolled background noises. In this paper, we propose Weakly Supervised Data Augmentation Network (WS-DAN) to explore the potential of data augmentation. Specifically, for each training image, we first generate attention maps to represent the object's discriminative parts by weakly supervised learning. Next, we augment the image guided by these attention maps, including attention cropping and attention dropping. The proposed WS-DAN improves the classification accuracy in two folds. In the first stage, images can be seen better since more discriminative parts' features will be extracted. In the second stage, attention regions provide accurate location of object, which ensures our model to look at the object closer and further improve the performance. Comprehensive experiments in common fine-grained visual classification datasets show that our WS-DAN surpasses the state-of-the-art methods, which demonstrates its effectiveness.

WS-DAN CUB-200-2011 acc 89.4%, FGVC-Aircraft 93.0%, Standford Cars 94.5% CycleMLP: A MLP-like Architecture for Dense Prediction

Abstract

This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have O(N2) computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP .

CycleMLP-B1：78.9 A ConvNet for the 2020s

Abstract

The"Roaring 20s"of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually"modernize"a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

ConvNeXt-T top1 acc 0.821 Visual Attention Network

Abstract

While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN surpasses similar size vision transformers(ViTs) and convolutional neural networks(CNNs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation, pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark and set new state-of-the-art performance (58.2 PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4% mIoU (50.1 vs. 46.1) for semantic segmentation on ADE20K benchmark, 2.6% AP (48.8 vs. 46.2) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community. Code is available at https://github.com/Visual-Attention-Network .

VAN-Tiny top1 acc 0.754 Pelee: A Real-Time Object Detection System on Mobile Devices

Abstract

An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and MobileNetV2. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy and over 1.8 times faster speed than MobileNet and MobileNetV2 on NVIDIA TX2. Meanwhile, PeleeNet is only 66% of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system2, named Pelee, achieves 76.4% mAP (mean average precision) on PASCAL VOC2007 and 22.4 mAP on MS COCO dataset at the speed of 23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2. The result on COCO outperforms YOLOv2 in consideration of a higher precision, 13.6 times lower computational cost and 11.3 times smaller model size.

peleenet top1 acc 0.726 All Tokens Matter: Token Labeling for Training Better Vision Transformers

Abstract

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling .

LV-ViT-T @ImageNet val top1 acc=79.1% Destruction and Construction Learning for Fine-grained Image Recognition

Abstract

Delicate feature representation about object parts plays a critical role in fine-grained recognition. For example, experts can even distinguish fine-grained objects relying only on object parts according to professional knowledge. In this paper, we propose a novel"Destruction and Construction Learning"(DCL) method to enhance the difficulty of fine-grained recognition and exercise the classification model to acquire expert knowledge. Besides the standard classification backbone network, another"destruction and construction"stream is introduced to carefully"destruct"and then"reconstruct"the input image, for learning discriminative regions and features. More specifically, for"destruction", we first partition the input image into local regions and then shuffle them by a Region Confusion Mechanism (RCM). To correctly recognize these destructed images, the classification network has to pay more attention to discriminative regions for spotting the differences. To compensate the noises introduced by RCM, an adversarial loss, which distinguishes original images from destructed ones, is applied to reject noisy patterns introduced by RCM. For"construction", a region alignment network, which tries to restore the original spatial layout of local regions, is followed to model the semantic correlation among local regions. By jointly training with parameter sharing, our proposed DCL injects more discriminative local details to the classification network. Experimental results show that our proposed framework achieves state-of-the-art performance on three standard benchmarks. Moreover, our proposed method does not need any external knowledge during training, and there is no computation overhead at inference time except the standard classification network feed-forwarding. Source code: https://github.com/JDAI-CV/DCL .

ResNet50, CUB-200-2011 acc 87.8%, Stanford Cars 94.5%, FGVC-Aircraft 93.0% How to Trust Unlabeled Data? Instance Credibility Inference for Few-Shot Learning

Abstract

Deep learning based models have excelled in many computer vision tasks and appear to surpass humans' performance. However, these models require an avalanche of expensive human labeled training data and many iterations to train their large number of parameters. This severely limits their scalability to the real-world long-tail distributed categories, some of which are with a large number of instances, but with only a few manually annotated. Learning from such extremely limited labeled examples is known as Few-shot learning (FSL). Different to prior arts that leverage meta-learning or data augmentation strategies to alleviate this extremely data-scarce problem, this paper presents a statistical approach, dubbed Instance Credibility Inference (ICI) to exploit the support of unlabeled instances for few-shot visual recognition. Typically, we repurpose the self-taught learning paradigm to predict pseudo-labels of unlabeled instances with an initial classifier trained from the few shot and then select the most confident ones to augment the training set to re-train the classifier. This is achieved by constructing a (Generalized) Linear Model (LM/GLM) with incidental parameters to model the mapping from (un-)labeled features to their (pseudo-)labels, in which the sparsity of the incidental parameters indicates the credibility of the corresponding pseudo-labeled instance. We rank the credibility of pseudo-labeled instances along the regularization path of their corresponding incidental parameters, and the most trustworthy pseudo-labeled examples are preserved as the augmented labeled instances. Theoretically, under mild conditions of restricted eigenvalue, irrepresentability, and large error, our approach is guaranteed to collect all the correctly-predicted instances from the noisy pseudo-labeled set.

mini-ImageNet, semi ICIR 1shot 73.12%, 5shot 83.28% 论文名称(链接) EfficientDet: Scalable and Efficient Object Detection

Abstract

Accurate depth estimation from images is a fundamental task in manyapplications including scene understanding and reconstruction. Existingsolutions for depth estimation often produce blurry approximations of lowresolution. This paper presents a convolutional neural network for computing ahigh-resolution depth map given a single RGB image with the help of transferlearning. Following a standard encoder-decoder architecture, we leveragefeatures extracted using high performing pre-trained networks when initializingour encoder along with augmentation and training strategies that lead to moreaccurate results. We show how, even for a very simple decoder, our method isable to achieve detailed high-resolution depth maps. Our network, with fewerparameters and training iterations, outperforms state-of-the-art on twodatasets and also produces qualitatively better results that capture objectboundaries more faithfully. Code and corresponding pre-trained weights are madepublicly available.

efficientdet_d0 mAP: 33.6 High Quality Monocular Depth Estimation via Transfer Learning

Abstract

We show that the YOLOv4 object detection neural network based on the CSPapproach, scales both up and down and is applicable to small and large networkswhile maintaining optimal speed and accuracy. We propose a network scalingapproach that modifies not only the depth, width, resolution, but alsostructure of the network. YOLOv4-large model achieves state-of-the-art results:55.5% AP (73.4% AP50) for the MS COCO dataset at a speed of ~16 FPS on TeslaV100, while with the test time augmentation, YOLOv4-large achieves 56.0% AP(73.3 AP50). To the best of our knowledge, this is currently the highestaccuracy on the COCO dataset among any published work. The YOLOv4-tiny modelachieves 22.0% AP (42.0% AP50) at a speed of 443 FPS on RTX 2080Ti, while byusing TensorRT, batch size = 4 and FP16-precision the YOLOv4-tiny achieves 1774FPS.

NYU Depth v2 δ1: 0.895 参考原论文table1 Scaled-YOLOv4: Scaling Cross Stage Partial Network

Abstract

The highest accuracy object detectors to date are based on a two-stageapproach popularized by R-CNN, where a classifier is applied to a sparse set ofcandidate object locations. In contrast, one-stage detectors that are appliedover a regular, dense sampling of possible object locations have the potentialto be faster and simpler, but have trailed the accuracy of two-stage detectorsthus far. In this paper, we investigate why this is the case. We discover thatthe extreme foreground-background class imbalance encountered during trainingof dense detectors is the central cause. We propose to address this classimbalance by reshaping the standard cross entropy loss such that itdown-weights the loss assigned to well-classified examples. Our novel FocalLoss focuses training on a sparse set of hard examples and prevents the vastnumber of easy negatives from overwhelming the detector during training. Toevaluate the effectiveness of our loss, we design and train a simple densedetector we call RetinaNet. Our results show that when trained with the focalloss, RetinaNet is able to match the speed of previous one-stage detectorswhile surpassing the accuracy of all existing state-of-the-art two-stagedetectors. Code is at: this https URL.

YOLOv4-P5 mAP: 51.2 Focal Loss for Dense Object Detection

Abstract

There are a huge number of features which are said to improve ConvolutionalNeural Network (CNN) accuracy. Practical testing of combinations of suchfeatures on large datasets, and theoretical justification of the result, isrequired. Some features operate on certain models exclusively and for certainproblems exclusively, or only for small-scale datasets; while some features,such as batch-normalization and residual-connections, are applicable to themajority of models, tasks, and datasets. We assume that such universal featuresinclude Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections(CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT)and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation,Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, andcombine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50)for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Sourcecode is at this https URL

RetinaNet R-50-FPN 1x mAP: 35.7 参考github YOLOv4: Optimal Speed and Accuracy of Object Detection

Abstract

Cascade is a classic yet powerful architecture that has boosted performanceon various tasks. However, how to introduce cascade to instance segmentationremains an open question. A simple combination of Cascade R-CNN and Mask R-CNNonly brings limited gain. In exploring a more effective approach, we find thatthe key to a successful instance segmentation cascade is to fully leverage thereciprocal relationship between detection and segmentation. In this work, wepropose a new framework, Hybrid Task Cascade (HTC), which differs in twoimportant aspects: (1) instead of performing cascaded refinement on these twotasks separately, it interweaves them for a joint multi-stage processing; (2)it adopts a fully convolutional branch to provide spatial context, which canhelp distinguishing hard foreground from cluttered background. Overall, thisframework can learn more discriminative features progressively whileintegrating complementary features together in each stage. Without bells andwhistles, a single HTC obtains 38.4 and 1.5 improvement over a strong CascadeMask R-CNN baseline on MSCOCO dataset. Moreover, our overall system achieves48.6 mask AP on the test-challenge split, ranking 1st in the COCO 2018Challenge Object Detection Task. Code is available at:this https URL.

input size: 416x416, MS COCO上mAP=41.2 Hybrid Task Cascade for Instance Segmentation

Abstract

We trained a convolutional neural network (CNN) to map raw pixels from asingle front-facing camera directly to steering commands. This end-to-endapproach proved surprisingly powerful. With minimum training data from humansthe system learns to drive in traffic on local roads with or without lanemarkings and on highways. It also operates in areas with unclear visualguidance such as in parking lots and on unpaved roads.The system automatically learns internal representations of the necessaryprocessing steps such as detecting useful road features with only the humansteering angle as the training signal. We never explicitly trained it todetect, for example, the outline of roads.Compared to explicit decomposition of the problem, such as lane markingdetection, path planning, and control, our end-to-end system optimizes allprocessing steps simultaneously. We argue that this will eventually lead tobetter performance and smaller systems. Better performance will result becausethe internal components self-optimize to maximize overall system performance,instead of optimizing human-selected intermediate criteria, e.g., lanedetection. Such criteria understandably are selected for ease of humaninterpretation which doesn't automatically guarantee maximum systemperformance. Smaller networks are possible because the system learns to solvethe problem with the minimal number of processing steps.We used an NVIDIA DevBox and Torch 7 for training and an NVIDIA DRIVE(TM) PXself-driving car computer also running Torch 7 for determining where to drive.The system operates at 30 frames per second (FPS).

HTC R-50-FPN 1x box AP: 42.3 mask AP: 37.4 Holistically-Nested Edge Detection

Abstract

Humans recognize the visual world at multiple levels: we effortlesslycategorize scenes and detect objects inside, while also identifying thetextures and surfaces of the objects along with their different compositionalparts. In this paper, we study a new task called Unified Perceptual Parsing,which requires the machine vision systems to recognize as many visual conceptsas possible from a given image. A multi-task framework called UPerNet and atraining strategy are developed to learn from heterogeneous image annotations.We benchmark our framework on Unified Perceptual Parsing and show that it isable to effectively segment a wide range of concepts from images. The trainednetworks are further applied to discover visual knowledge in natural scenes.Models are available at \url{this https URL}.

BSD500 dataset (ODS F-score of .782) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Abstract

We present an approach to efficiently detect the 2D pose of multiple peoplein an image. The approach uses a nonparametric representation, which we referto as Part Affinity Fields (PAFs), to learn to associate body parts withindividuals in the image. The architecture encodes global context, allowing agreedy bottom-up parsing step that maintains high accuracy while achievingrealtime performance, irrespective of the number of people in the image. Thearchitecture is designed to jointly learn part locations and their associationvia two branches of the same sequential prediction process. Our method placedfirst in the inaugural COCO 2016 keypoints challenge, and significantly exceedsthe previous state-of-the-art result on the MPII Multi-Person benchmark, bothin performance and efficiency.

bleu-1: 67%, bleu-2: 45.7%, bleu-3: 31.4%, bleu-4: 21.3% Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Abstract

Recent work has demonstrated that deep neural networks are vulnerable toadversarial examples---inputs that are almost indistinguishable from naturaldata and yet classified incorrectly by the network. In fact, some of the latestfindings suggest that the existence of adversarial attacks may be an inherentweakness of deep learning models. To address this problem, we study theadversarial robustness of neural networks through the lens of robustoptimization. This approach provides us with a broad and unifying view on muchof the prior work on this topic. Its principled nature also enables us toidentify methods for both training and attacking neural networks that arereliable and, in a certain sense, universal. In particular, they specify aconcrete security guarantee that would protect against any adversary. Thesemethods let us train networks with significantly improved resistance to a widerange of adversarial attacks. They also suggest the notion of security againsta first-order adversary as a natural and broad security guarantee. We believethat robustness against such well-defined classes of adversaries is animportant stepping stone towards fully resistant deep learning models. Code andpre-trained models are available at this https URLand this https URL.

coco 2014 bleu-1=79.8% Pixel Recurrent Neural Networks

Abstract

Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image model that is at once expressive, tractable and scalable. We present a deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. Our method models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. Architectural novelties include fast two-dimensional recurrent layers and an effective use of residual connections in deep recurrent networks. We achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Our main results also provide benchmarks on the diverse ImageNet dataset. Samples generated from the model appear crisp, varied and globally coherent.

NLL test 81.3 Residual Attention Network for Image Classification

Abstract

In this work, we propose"Residual Attention Network", a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers. Extensive analyses are conducted on CIFAR-10 and CIFAR-100 datasets to verify the effectiveness of every module mentioned above. Our Residual Attention Network achieves state-of-the-art object recognition performance on three benchmark datasets including CIFAR-10 (3.90% error), CIFAR-100 (20.45% error) and ImageNet (4.8% single model and single crop, top-5 error). Note that, our method achieves 0.6% top-1 accuracy improvement with 46% trunk depth and 69% forward FLOPs comparing to ResNet-200. The experiment also demonstrates that our network is robust against noisy labels.

Attention-92 top1 error 4.99% Fast R-CNN

Abstract

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn .

COCO R50 mAP=37.8 Simple Baselines for Human Pose Estimation and Tracking

Abstract

There has been significant progress on pose estimation and increasing interests on pose tracking in recent years. At the same time, the overall algorithm and system complexity increases as well, making the algorithm analysis and comparison more difficult. This work provides simple and effective baseline methods. They are helpful for inspiring and evaluating new ideas for the field. State-of-the-art results are achieved on challenging benchmarks. The code will be available at https://github.com/leoxiaobin/pose.pytorch .

MPII; 256x256_pose_resnet_50 mean=88.53 VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

Abstract

Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

KITTI; VoxelNet 3D detection kitti validation: car easy: 81, 97 moderate: 65.46 hard: 62.85(参考原论文table2 github repo实现精度 easy: 53.43 moderate: 48.78 hard: 48.06) MnasNet: Platform-Aware Neural Architecture Search for Mobile

Abstract

Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8x faster than MobileNetV2 [29] with 0.5% higher accuracy and 2.3x faster than NASNet [36] with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. Code is at this https URL

ImageNet: MnasNet-A top-1 73.5 Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

Abstract

While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of"predictive coding"from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.

KITTI: MSE 0.007 Gradient Harmonized Single-stage Detector

Abstract

This paper revisits feature pyramids networks (FPN) for one-stage detectors and points out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion. From the perspective of optimization, we introduce an alternative way to address the problem instead of adopting the complex feature pyramids - {\em utilizing only one-level feature for detection}. Based on the simple and efficient solution, we present You Only Look One-level Feature (YOLOF). In our method, two key components, Dilated Encoder and Uniform Matching, are proposed and bring considerable improvements. Extensive experiments on the COCO benchmark prove the effectiveness of the proposed model. Our YOLOF achieves comparable results with its feature pyramids counterpart RetinaNet while being 2.5× faster. Without transformer layers, YOLOF can match the performance of DETR in a single-level feature manner with 7× less training epochs. With an image size of 608×608, YOLOF achieves 44.3 mAP running at 60 fps on 2080Ti, which is 13% faster than YOLOv4. Code is available at \url{ https://github.com/megvii-model/YOLOF }.

coco: R50 37.0 You Only Look One-level Feature

Abstract

Many modern object detectors demonstrate outstanding performances by using the mechanism of looking and thinking twice. In this paper, we explore this mechanism in the backbone design for object detection. At the macro level, we propose Recursive Feature Pyramid, which incorporates extra feedback connections from Feature Pyramid Networks into the bottom-up backbone layers. At the micro level, we propose Switchable Atrous Convolution, which convolves the features with different atrous rates and gathers the results using switch functions. Combining them results in DetectoRS, which significantly improves the performances of object detection. On COCO test-dev, DetectoRS achieves state-of-the-art 55.7% box AP for object detection, 48.5% mask AP for instance segmentation, and 50.0% PQ for panoptic segmentation. The code is made publicly available.

coco: R50 37.5 YOLOX: Exceeding YOLO Series in 2021

Abstract

In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector — YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLONano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4- CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and * Equal contribution. † Corresponding author. researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/ Megvii-BaseDetection/YOLOX.

COCO 2017 test-dev, YOLOX-X (640*640) map: 51.2 论文名称(链接) PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Abstract

Few prior works study deep learning on point sets. PointNet by Qi et al. is apioneer in this direction. However, by design PointNet does not capture localstructures induced by the metric space points live in, limiting its ability torecognize fine-grained patterns and generalizability to complex scenes. In thiswork, we introduce a hierarchical neural network that applies PointNetrecursively on a nested partitioning of the input point set. By exploitingmetric space distances, our network is able to learn local features withincreasing contextual scales. With further observation that point sets areusually sampled with varying densities, which results in greatly decreasedperformance for networks trained on uniform densities, we propose novel setlearning layers to adaptively combine features from multiple scales.Experiments show that our network called PointNet++ is able to learn deep pointset features efficiently and robustly. In particular, results significantlybetter than state-of-the-art have been obtained on challenging benchmarks of 3Dpoint clouds.

ModelNet40: 89.2% PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Abstract

The ability to perform pixel-wise semantic segmentation in real-time is ofparamount importance in mobile applications. Recent deep neural networks aimedat this task have the disadvantage of requiring a large number of floatingpoint operations and have long run-times that hinder their usability. In thispaper, we propose a novel deep neural network architecture named ENet(efficient neural network), created specifically for tasks requiring lowlatency operation. ENet is up to 18$\times$ faster, requires 75$\times$ lessFLOPs, has 79$\times$ less parameters, and provides similar or better accuracyto existing models. We have tested it on CamVid, Cityscapes and SUN datasetsand report on comparisons with existing state-of-the-art methods, and thetrade-offs between accuracy and processing time of a network. We presentperformance measurements of the proposed architecture on embedded systems andsuggest possible software improvements that could make ENet even faster.

ModelNet40: 90.7% ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Abstract

We develop a new edge detection algorithm that tackles two important issuesin this long-standing vision problem: (1) holistic image training andprediction; and (2) multi-scale and multi-level feature learning. Our proposedmethod, holistically-nested edge detection (HED), performs image-to-imageprediction by means of a deep learning model that leverages fully convolutionalneural networks and deeply-supervised nets. HED automatically learns richhierarchical representations (guided by deep supervision on side responses)that are important in order to approach the human ability resolve thechallenging ambiguity in edge and object boundary detection. We significantlyadvance the state-of-the-art on the BSD500 dataset (ODS F-score of .782) andthe NYU Depth dataset (ODS F-score of .746), and do so with an improved speed(0.4 second per image) that is orders of magnitude faster than some recentCNN-based edge detection algorithms.

Cityscapes mIoU 58.3% Unified Perceptual Parsing for Scene Understanding

Abstract

In this work we address the task of semantic image segmentation with DeepLearning and make three main contributions that are experimentally shown tohave substantial practical merit. First, we highlight convolution withupsampled filters, or 'atrous convolution', as a powerful tool in denseprediction tasks. Atrous convolution allows us to explicitly control theresolution at which feature responses are computed within Deep ConvolutionalNeural Networks. It also allows us to effectively enlarge the field of view offilters to incorporate larger context without increasing the number ofparameters or the amount of computation. Second, we propose atrous spatialpyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPPprobes an incoming convolutional feature layer with filters at multiplesampling rates and effective fields-of-views, thus capturing objects as well asimage context at multiple scales. Third, we improve the localization of objectboundaries by combining methods from DCNNs and probabilistic graphical models.The commonly deployed combination of max-pooling and downsampling in DCNNsachieves invariance but has a toll on localization accuracy. We overcome thisby combining the responses at the final DCNN layer with a fully connectedConditional Random Field (CRF), which is shown both qualitatively andquantitatively to improve localization performance. Our proposed"DeepLab"system sets the new state-of-art at the PASCAL VOC-2012 semantic imagesegmentation task, reaching 79.7% mIOU in the test set, and advances theresults on three other datasets: PASCAL-Context, PASCAL-Person-Part, andCityscapes. All of our code is made publicly available online.

Cityscapes mIoU 80.1% DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Abstract

In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed"DeepLab"system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.

Cityscapes mIoU 71.4% Real-Time High-Resolution Background Matting

Abstract

We introduce a real-time, high-resolution background replacement technique which operates at 30fps in 4K resolution, and 60fps for HD on a modern GPU. Our technique is based on background matting, where an additional frame of the background is captured and used in recovering the alpha matte and the foreground layer. The main challenge is to compute a high-quality alpha matte, preserving strand-level hair details, while processing high-resolution images in real-time. To achieve this goal, we employ two neural networks; a base network computes a low-resolution result which is refined by a second network operating at high-resolution on selective patches. We introduce two largescale video and image matting datasets: VideoMatte240K and PhotoMatte13K/85. Our approach yields higher quality results compared to the previous state-of-the-art in background matting, while simultaneously yielding a dramatic boost in both speed and resolution.

PhotoMatte85 SAD8.65、MSE9.57 Panoptic Feature Pyramid Networks

Abstract

The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-of-the-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, top-performing method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.

Cityscapes mIoU 75.8% SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Abstract

We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. We show that SegNet provides good performance with competitive inference time and more efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/ .

image size: 360× 480; Dataset: Camvid; mIOU: 60.1 YOLACT: Real-time Instance Segmentation

Abstract

We develop an algorithm that can detect pneumonia from chest X-rays at alevel exceeding practicing radiologists. Our algorithm, CheXNet, is a 121-layerconvolutional neural network trained on ChestX-ray14, currently the largestpublicly available chest X-ray dataset, containing over 100,000 frontal-viewX-ray images with 14 diseases. Four practicing academic radiologists annotate atest set, on which we compare the performance of CheXNet to that ofradiologists. We find that CheXNet exceeds average radiologist performance onthe F1 metric. We extend CheXNet to detect all 14 diseases in ChestX-ray14 andachieve state of the art results on all 14 diseases.

Image size: 550 Resnet101-FPN FPS=33.5 mAP=29.8 YOLACT++: Better Real-time Instance Segmentation

Abstract

We present a new method for efficient high-quality image segmentation ofobjects and scenes. By analogizing classical computer graphics methods forefficient rendering with over- and undersampling challenges faced in pixellabeling tasks, we develop a unique perspective of image segmentation as arendering problem. From this vantage, we present the PointRend (Point-basedRendering) neural network module: a module that performs point-basedsegmentation predictions at adaptively selected locations based on an iterativesubdivision algorithm. PointRend can be flexibly applied to both instance andsemantic segmentation tasks by building on top of existing state-of-the-artmodels. While many concrete implementations of the general idea are possible,we show that a simple design already achieves excellent results. Qualitatively,PointRend outputs crisp object boundaries in regions that are over-smoothed byprevious methods. Quantitatively, PointRend yields significant gains on COCOand Cityscapes, for both instance and semantic segmentation. PointRend'sefficiency enables output resolutions that are otherwise impractical in termsof memory or computation compared to existing approaches. Code has been madeavailable atthis https URL.

Image size: 550 Resnet50-FPN FPS=33.5 mAP=34.1 PointRend: Image Segmentation as Rendering

Abstract

Recent works have widely explored the contextual dependencies to achieve moreaccurate segmentation results. However, most approaches rarely distinguishdifferent types of contextual dependencies, which may pollute the sceneunderstanding. In this work, we directly supervise the feature aggregation todistinguish the intra-class and inter-class context clearly. Specifically, wedevelop a Context Prior with the supervision of the Affinity Loss. Given aninput image and corresponding ground truth, Affinity Loss constructs an idealaffinity map to supervise the learning of Context Prior. The learned ContextPrior extracts the pixels belonging to the same category, while the reversedprior focuses on the pixels of different classes. Embedded into a conventionaldeep CNN, the proposed Context Prior Layer can selectively capture theintra-class and inter-class contextual dependencies, leading to robust featurerepresentation. To validate the effectiveness, we design an effective ContextPrior Network (CPNet). Extensive quantitative and qualitative evaluationsdemonstrate that the proposed model performs favorably against state-of-the-artsemantic segmentation approaches. More specifically, our algorithm achieves46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU onCityscapes. Code is available at this https URL.

cityscapes resnet50+FPN mIoU 78.3% 参考P5 architecture Context Prior for Scene Segmentation

Abstract

BiSeNet has been proved to be a popular two-stream network for real-timesegmentation. However, its principle of adding an extra path to encode spatialinformation is time-consuming, and the backbones borrowed from pretrainedtasks, e.g., image classification, may be inefficient for image segmentationdue to the deficiency of task-specific design. To handle these problems, wepropose a novel and efficient structure named Short-Term Dense Concatenatenetwork (STDC network) by removing structure redundancy. Specifically, wegradually reduce the dimension of feature maps and use the aggregation of themfor image representation, which forms the basic module of STDC network. In thedecoder, we propose a Detail Aggregation module by integrating the learning ofspatial information into low-level layers in single-stream manner. Finally, thelow-level features and deep features are fused to predict the finalsegmentation results. Extensive experiments on Cityscapes and CamVid datasetdemonstrate the effectiveness of our method by achieving promising trade-offbetween segmentation accuracy and inference speed. On Cityscapes, we achieve71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti,which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0FPS while inferring on higher resolution images.

cityscapes resnet101 mIoU 81.3% 参考论文table6 Rethinking BiSeNet For Real-time Semantic Segmentation

Abstract

BiSeNet has been proved to be a popular two-stream network for real-time segmentation. However, its principle of adding an extra path to encode spatial information is time-consuming, and the backbones borrowed from pretrained tasks, e.g., image classification, may be inefficient for image segmentation due to the deficiency of task-specific design. To handle these problems, we propose a novel and efficient structure named Short-Term Dense Concatenate network (STDC network) by removing structure redundancy. Specifically, we gradually reduce the dimension of feature maps and use the aggregation of them for image representation, which forms the basic module of STDC network. In the decoder, we propose a Detail Aggregation module by integrating the learning of spatial information into low-level layers in single-stream manner. Finally, the low-level features and deep features are fused to predict the final segmentation results. Extensive experiments on Cityscapes and CamVid dataset demonstrate the effectiveness of our method by achieving promising trade-off between segmentation accuracy and inference speed. On Cityscapes, we achieve 71.9% mIoU on the test set with a speed of 250.4 FPS on NVIDIA GTX 1080Ti, which is 45.2% faster than the latest methods, and achieve 76.8% mIoU with 97.0 FPS while inferring on higher resolution images.

imgsize 512 × 1024 cityscapes STDC2-Seg50 mIoU 74.2% 参见论文table6 ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network

Abstract

We introduce a light-weight, power efficient, and general purpose convolutional neural network, ESPNetv2, for modeling visual and sequential data. Our network uses group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters. The performance of our network is evaluated on four different tasks: (1) object classification, (2) semantic segmentation, (3) object detection, and (4) language modeling. Experiments on these tasks, including image classification on the ImageNet and language modeling on the PenTree bank dataset, demonstrate the superior performance of our method over the state-of-the-art methods. Our network outperforms ESPNet by 4-5% and has 2-4x fewer FLOPs on the PASCAL VOC and the Cityscapes dataset. Compared to YOLOv2 on the MS-COCO object detection, ESPNetv2 delivers 4.4% higher accuracy with 6x fewer FLOPs. Our experiments show that ESPNetv2 is much more power efficient than existing state-of-the-art efficient methods including ShuffleNets and MobileNets. Our code is open-source and available at https://github.com/sacmehta/ESPNetv2 cityscapes ESPNetv2-val mIoU 66.4% 参见论文 fig7 Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Abstract

Current semantic segmentation methods focus only on mining"local"context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization criteria (e.g., IoU-like loss). However, they ignore"global"context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCR) and backbones (i.e., ResNet, HR-Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL-Context, COCO-Stuff, CamVid). We expect this work will encourage our community to rethink the current de facto training paradigm in fully supervised semantic segmentation.

HRNet-W48 Cityscaoes mIOU=80.18 Category-Level Adversarial Adaptation for Semantic Segmentation using Purified Features

Abstract

We target the problem named unsupervised domain adaptive semantic segmentation. A key in this campaign consists inreducing the domain shift, so that a classifier based on labeled data from one domain can generalize well to other domains. With theadvancement of adversarial learning blackmethod, recent works prefer the strategy of aligning the marginal distribution in the featurespaces for minimizing the domain discrepancy. However, based on the observance in experiments, only focusing on aligning globalmarginal distribution but ignoring the local joint distribution alignment fails to be the optimal choice. Other than that, the noisy factorsexisting in the feature spaces, which are not relevant to the target task, entangle with the domain invariant factors improperly and makethe domain distribution alignment more difficult. To address those problems, we introduce two new modules, Significance-awareInformation Bottleneck (SIB) and Category-level alignment (CLA), to construct a purified embedding-based category-level adversarialnetwork. As the name suggests, our designed network, CLAN, can not only disentangle the noisy factors and suppress their influencesfor target tasks but also utilize those purified features to conduct a more delicate level domain calibration, i.e., global marginal distributionand local joint distribution alignment simultaneously. In three domain adaptation tasks, i.e., GTA5 → Cityscapes, SYNTHIA → Cityscapesand Cross Season, we validate that our proposed method matches the state of the art in segmentation accuracy.

Resnet101 Cityscapes mIoU 45.5% Brain Tumor Segmentation with Deep Neural Networks

Abstract

In this paper, we present a fully automatic brain tumor segmentation method based on Deep Neural Networks (DNNs). The proposed networks are tailored to glioblastomas (both low and high grade) pictured in MR images. By their very nature, these tumors can appear anywhere in the brain and have almost any kind of shape, size, and contrast. These reasons motivate our exploration of a machine learning solution that exploits a flexible, high capacity DNN while being extremely efficient. Here, we give a description of different model choices that we've found to be necessary for obtaining competitive performance. We explore in particular different architectures based on Convolutional Neural Networks (CNN), i.e. DNNs specifically adapted to image data. We present a novel CNN architecture which differs from those traditionally used in computer vision. Our CNN exploits both local features as well as more global contextual features simultaneously. Also, different from most traditional uses of CNNs, our networks use a final layer that is a convolutional implementation of a fully connected layer which allows a 40 fold speed up. We also describe a 2-phase training procedure that allows us to tackle difficulties related to the imbalance of tumor labels. Finally, we explore a cascade architecture in which the output of a basic CNN is treated as an additional source of information for a subsequent CNN. Results reported on the 2013 BRATS test dataset reveal that our architecture improves over the currently published state-of-the-art while being over 30 times faster.

BRATS 2013 test: Dice Complete 0.84, Core 0.72, Enhancing 0.57 Dynamic Graph CNN for Learning on Point Clouds

Abstract

Point clouds provide a flexible geometric representation suitable for countless applications in computer graphics; they also comprise the raw output of most 3D data acquisition devices. While hand-designed features on point clouds have long been proposed in graphics and vision, however, the recent overwhelming success of convolutional neural networks (CNNs) for image analysis suggests the value of adapting insight from CNN to the point cloud world. Point clouds inherently lack topological information so designing a model to recover topology can enrich the representation power of point clouds. To this end, we propose a new neural network module dubbed EdgeConv suitable for CNN-based high-level tasks on point clouds including classification and segmentation. EdgeConv acts on graphs dynamically computed in each layer of the network. It is differentiable and can be plugged into existing architectures. Compared to existing modules operating in extrinsic space or treating each point independently, EdgeConv has several appealing properties: It incorporates local neighborhood information; it can be stacked applied to learn global shape properties; and in multi-layer systems affinity in feature space captures semantic characteristics over potentially long distances in the original embedding. We show the performance of our model on standard benchmarks including ModelNet40, ShapeNetPart, and S3DIS.

mIOU=85.2% 参考论文 Table.6 Adaptive Pyramid Context Network for Semantic Segmentation

Abstract

Semi-supervised learning (SSL) provides an effective means of leveragingunlabeled data to improve a model's performance. In this paper, we demonstratethe power of a simple combination of two common SSL methods: consistencyregularization and pseudo-labeling. Our algorithm, FixMatch, first generatespseudo-labels using the model's predictions on weakly-augmented unlabeledimages. For a given image, the pseudo-label is only retained if the modelproduces a high-confidence prediction. The model is then trained to predict thepseudo-label when fed a strongly-augmented version of the same image. Despiteits simplicity, we show that FixMatch achieves state-of-the-art performanceacross a variety of standard semi-supervised learning benchmarks, including94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 -- just4 labels per class. Since FixMatch bears many similarities to existing SSLmethods that achieve worse performance, we carry out an extensive ablationstudy to tease apart the experimental factors that are most important toFixMatch's success. We make our code available atthis https URL.

Cityscapes: mIOU=79.28% CGNet: A Light-weight Context Guided Network for Semantic Segmentation

Abstract

We focus on the challenging task of real-time semantic segmentation in thispaper. It finds many practical applications and yet is with fundamentaldifficulty of reducing a large portion of computation for pixel-wise labelinference. We propose an image cascade network (ICNet) that incorporatesmulti-resolution branches under proper label guidance to address thischallenge. We provide in-depth analysis of our framework and introduce thecascade feature fusion unit to quickly achieve high-quality segmentation. Oursystem yields real-time inference on a single GPU card with decent qualityresults evaluated on challenging datasets like Cityscapes, CamVid andCOCO-Stuff.

Cityscapes valset: M3N21, mIOU=68.27% ICNet for Real-Time Semantic Segmentation on High-Resolution Images

Abstract

Recent deep learning based approaches have shown promising results for thechallenging task of inpainting large missing regions in an image. These methodscan generate visually plausible image structures and textures, but often createdistorted structures or blurry textures inconsistent with surrounding areas.This is mainly due to ineffectiveness of convolutional neural networks inexplicitly borrowing or copying information from distant spatial locations. Onthe other hand, traditional texture and patch synthesis approaches areparticularly suitable when it needs to borrow textures from the surroundingregions. Motivated by these observations, we propose a new deep generativemodel-based approach which can not only synthesize novel image structures butalso explicitly utilize surrounding image features as references during networktraining to make better predictions. The model is a feed-forward, fullyconvolutional neural network which can process images with multiple holes atarbitrary locations and with variable sizes during the test time. Experimentson multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD) andnatural images (ImageNet, Places2) demonstrate that our proposed approachgenerates higher-quality inpainting results than existing ones. Code, demo andmodels are available at: this https URL.

Cityscapes mIOU 69.6% Context Encoding for Semantic Segmentation

Abstract

Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous convolution, utilizing multi-scale features and refining boundaries. In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps. The proposed Context Encoding Module significantly improves semantic segmentation results with only marginal extra computation cost over FCN. Our approach has achieved new state-of-the-art results 51.7% mIoU on PASCAL-Context, 85.9% mIoU on PASCAL VOC 2012. Our single model achieves a final score of 0.5567 on ADE20K test set, which surpass the winning entry of COCO-Place Challenge in 2017. In addition, we also explore how the Context Encoding Module can improve the feature representation of relatively shallow networks for the image classification on CIFAR-10 dataset. Our 14 layer network has achieved an error rate of 3.45%, which is comparable with state-of-the-art approaches with over 10 times more layers. The source code for the complete system are publicly available.

Cityscapes; Cityscapes mIOU = 78.55% BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

Abstract

Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048x1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.

Cityscapes: resnet18 mIOU=74.8 对应论文 Table.6 中实现 FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation

Abstract

Modern approaches for semantic segmentation usually employ dilated convolutions in the backbone to extract high-resolution feature maps, which brings heavy computation complexity and memory footprint. To replace the time and memory consuming dilated convolutions, we propose a novel joint upsampling module named Joint Pyramid Upsampling (JPU) by formulating the task of extracting high-resolution feature maps into a joint upsampling problem. With the proposed JPU, our method reduces the computation complexity by more than three times without performance loss. Experiments show that JPU is superior to other upsampling modules, which can be plugged into many existing approaches to reduce computation complexity and improve performance. By replacing dilated convolutions with the proposed JPU module, our method achieves the state-of-the-art performance in Pascal Context dataset (mIoU of 53.13%) and ADE20K dataset (final score of 0.5584) while running 3 times faster.

ADE20K: EncNet+JPU (resnet50)mIOU=42.5 Dynamic Multi-Scale Filters for Semantic Segmentation

Abstract

Multi-scale representation provides an effective way to address scale variation of objects and stuff in semantic segmentation. Previous works construct multi-scale representation by utilizing different filter sizes, expanding filter sizes with dilated filters or pooling grids, and the parameters of these filters are fixed after training. These methods often suffer from heavy computational cost or have more parameters, and are not adaptive to the input image during inference. To address these problems, this paper proposes a Dynamic Multi-scale Network (DMNet) to adaptively capture multi-scale contents for predicting pixel-level semantic labels. DMNet is composed of multiple Dynamic Convolutional Modules (DCMs) arranged in parallel, each of which exploits context-aware filters to estimate semantic representation for a specific scale. The outputs of multiple DCMs are further integrated for final segmentation. We conduct extensive experiments to evaluate our DMNet on three challenging semantic segmentation and scene parsing datasets, PASCAL VOC 2012, Pascal-Context, and ADE20K. DMNet achieves a new record 84.4% mIoU on PASCAL VOC 2012 test set without MS COCO pre-trained and post-processing, and also obtains state-of-the-art performance on PascalContext and ADE20K.

Cityscapes: mIOU = 79.64% ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

Abstract

We introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated ESPNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively.

Cityscapes; 1. mIOU=60.3 对应论文 Table.1(a) 中实现; 2. 训练日志中包含周期性的在 valset 上的评估结果。 Adversarial Learning for Semi-Supervised Semantic Segmentation

Abstract

We propose a method for semi-supervised semantic segmentation using an adversarial network. While most existing discriminators are trained to classify input images as real or fake on the image level, we design a discriminator in a fully convolutional manner to differentiate the predicted probability maps from the ground truth segmentation distribution with the consideration of the spatial resolution. We show that the proposed discriminator can be used to improve semantic segmentation accuracy by coupling the adversarial loss with the standard cross entropy loss of the proposed model. In addition, the fully convolutional discriminator enables semi-supervised learning through discovering the trustworthy regions in predicted results of unlabeled images, thereby providing additional supervisory signals. In contrast to existing methods that utilize weakly-labeled images, our method leverages unlabeled images to enhance the segmentation model. Experimental results on the PASCAL VOC 2012 and Cityscapes datasets demonstrate the effectiveness of the proposed algorithm.

Pascal VOC 2012: data amount= 1/8; DeepLab-v2 with ResNet-101; mIOU=69.5 对应论文 Table.1 V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

Abstract

Convolutional Neural Networks (CNNs) have been recently employed to solve problems from both the computer vision and medical image analysis fields. Despite their popularity, most approaches are only able to process 2D images while most medical data used in clinical practice consists of 3D volumes. In this work we propose an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network. Our CNN is trained end-to-end on MRI volumes depicting prostate, and learns to predict segmentation for the whole volume at once. We introduce a novel objective function, that we optimise during training, based on Dice coefficient. In this way we can deal with situations where there is a strong imbalance between the number of foreground and background voxels. To cope with the limited number of annotated volumes available for training, we augment the data applying random non-linear transformations and histogram matching. We show in our experimental evaluation that our approach achieves good performances on challenging test data while requiring only a fraction of the processing time needed by other previous methods.

Prostate dataset Dice coefficient: 0.869参考论文指标 Polarized Self-Attention: Towards High-quality Pixel-wise Regression

Abstract

Pixel-wise regression is probably the most common problem in fine-grained computer vision tasks, such as estimating keypoint heatmaps and segmentation masks. These regression problems are very challenging particularly because they require, at low computation overheads, modeling long-range dependencies on high-resolution inputs/outputs to estimate the highly nonlinear pixel-wise semantics. While attention mechanisms in Deep Convolutional Neural Networks(DCNNs) has become popular for boosting long-range dependencies, element-specific attention, such as Nonlocal blocks, is highly complex and noise-sensitive to learn, and most of simplified attention hybrids try to reach the best compromise among multiple types of tasks. In this paper, we present the Polarized Self-Attention(PSA) block that incorporates two critical designs towards high-quality pixel-wise regression: (1) Polarized filtering: keeping high internal resolution in both channel and spatial attention computation while completely collapsing input tensors along their counterpart dimensions. (2) Enhancement: composing non-linearity that directly fits the output distribution of typical fine-grained regression, such as the 2D Gaussian distribution (keypoint heatmaps), or the 2D Binormial distribution (binary segmentation masks). PSA appears to have exhausted the representation capacity within its channel-only and spatial-only branches, such that there is only marginal metric differences between its sequential and parallel layouts. Experimental results show that PSA boosts standard baselines by 2−4 points, and boosts state-of-the-arts by 1−2 points on 2D pose estimation and semantic segmentation benchmarks.

数据集 cityscapes valset验收指标：1. HRNetV2-OCR+PSA(s) mIOU= 86.7% 参考论文 Table.52. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg nnFormer: Interleaved Transformer for Volumetric Segmentation

Abstract

Transformer, the model of choice for naturallanguage processing, has drawn scant attention from themedical imaging community. Given the ability to exploitlong-term dependencies, transformers are promising tohelp atypical convolutional neural networks to overcometheir inherent shortcomings of spatial inductive bias. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations. To address this issue, we introducennFormer (i.e., not-another transFormer), a 3D transformerfor volumetric medical image segmentation. nnFormer notonly exploits the combination of interleaved convolutionand self-attention operations, but also introduces localand global volume-based self-attention mechanism to learnvolume representations. Moreover, nnFormer proposes touse skip attention to replace the traditional concatenation/summation operations in skip connections in U-Netlike architecture. Experiments show that nnFormer significantly outperforms previous transformer-based counterparts by large margins on three public datasets. Comparedto nnUNet, nnFormer produces significantly lower HD95and comparable DSC results. Furthermore, we show thatnnFormer and nnUNet are highly complementary to eachother in model ensembling. Codes and models of nnFormerare available at https://git.io/JSf3i .

数据集 ACDC：注册后下载 https://acdc.creatis.insa-lyon.fr/#phase/5846c3ab6a3c7735e84b67f2验收指标：1.Dice = 91.78% 对应论文 Table.3中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Abstract

In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at this https URL.

数据集SYNAPSE：联系 [email protected] 获取处理后数据链接，或者在synapse 官网下载。验收指标：1. Avg Dice = 79.13% 对应论文 Table.1中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg FCCDN: Feature Constraint Network for VHR Image Change Detection

Abstract

Change detection is the process of identifying pixelwise differences in bitemporal co-registered images. It is of great significance to Earth observations. Recently, with the emergence of deep learning (DL), the power and feasibility of deep convolutional neural network (CNN)-based methods have been shown in the field of change detection. However, there is still a lack of effective supervision for change feature learning. In this work, a feature constraint change detection network (FCCDN) is proposed. We constrain features both in bitemporal feature extraction and feature fusion. More specifically, we propose a dual encoder-decoder network backbone for the change detection task. At the center of the backbone, we design a nonlocal feature pyramid network to extract and fuse multiscale features. To fuse bitemporal features in a robust way, we build a dense connection-based feature fusion module. Moreover, a self-supervised learning-based strategy is proposed to constrain feature learning. Based on FCCDN, we achieve state-of-the-art performance on two building change detection datasets (LEVIR-CD and WHU). On the LEVIR-CD dataset, we achieve an IoU of 0.8569 and an F1 score of 0.9229. On the WHU dataset, we achieve an IoU of 0.8820 and an F1 score of 0.9373. Moreover, for the first time, the acquisition of accurate bitemporal semantic segmentation results is achieved without using semantic segmentation labels. This is vital for the application of change detection because it saves the cost of labeling.

数据集 LEVIR-CD验收指标：1.FCCDN (512) F1 = 92.29% 参考论文 Table.32. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS A Transformer-Based Siamese Network for Change Detection

Abstract

This paper presents a transformer-based Siamese network architecture (abbreviated by ChangeFormer) for Change Detection (CD) from a pair of co-registered remote sensing images. Different from recent CD frameworks, which are based on fully convolutional networks (ConvNets), the proposed method unifies hierarchically structured transformer encoder with Multi-Layer Perception (MLP) decoder in a Siamese network architecture to efficiently render multi-scale long-range details required for accurate CD. Experiments on two CD datasets show that the proposed end-to-end trainable ChangeFormer architecture achieves better CD performance than previous counterparts. Our code is available at https://github.com/wgcban/ChangeFormer .

数据集 LEVIR-CD验收指标：1.ChangeFormer F1 = 90.4% 参考论文 Table.12. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Abstract

Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet .

数据集SYNAPSE：联系 [email protected] 获取处理后数据链接，或者在synapse 官网下载验收指标：1. Avg Dice = 77.48% 对应论文 Table.1中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg FactSeg: Foreground Activation Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery

Abstract

The small object semantic segmentation task is aimed at automatically extracting key objects from high-resolution remote sensing (HRS) imagery. Compared with the large-scale coverage areas for remote sensing imagery, the key objects, such as cars and ships, in HRS imagery often contain only a few pixels. In this article, to tackle this problem, the foreground activation (FA)-driven small object semantic segmentation (FactSeg) framework is proposed from perspectives of structure and optimization. In the structure design, FA object representation is proposed to enhance the awareness of the weak features in small objects. The FA object representation framework is made up of a dual-branch decoder and collaborative probability (CP) loss. In the dual-branch decoder, the FA branch is designed to activate the small object features (activation) and suppress the large-scale background, and the semantic refinement (SR) branch is designed to further distinguish small objects (refinement). The CP loss is proposed to effectively combine the activation and refinement outputs of the decoder under the CP hypothesis. During the collaboration, the weak features of the small objects are enhanced with the activation output, and the refined output can be viewed as the refinement of the binary outputs. In the optimization stage, small object mining (SOM)-based network optimization is applied to automatically select effective samples and refine the direction of the optimization while addressing the imbalanced sample problem between the small objects and the large-scale background. The experimental results obtained with two benchmark HRS imagery segmentation datasets demonstrate that the proposed framework outperforms the state-of-the-art semantic segmentation methods and achieves a good tradeoff between accuracy and efficiency. Code will be available at: http://rsidea.whu.edu.cn/FactSeg.htm 数据集 iSAID： https://captain-whu.github.io/iSAID/dataset.html验收指标：1 . FactSeg ResNet-50 mIOU= 64.79% 参考论文Table.4 实现2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS PSANet: Point-wise Spatial Attention Network for Scene Parsing

Abstract

We notice information flow in convolutional neural networks is restricted inside local neighborhood regions due to the physical design of convolutional filters, which limits the overall understanding of complex scenes. In this paper, we propose the point-wise spatial attention network (PSANet) to relax the local neighborhood constraint. Each position on the feature map is connected to all the other ones through a self-adaptively learned attention mask. Moreover, information propagation in bi-direction for scene parsing is enabled. Information at other positions can be collected to help the prediction of the current position and vice versa, information at the current position can be distributed to assist the prediction of other ones. Our proposed approach achieves top performance on various competitive scene parsing datasets, including ADE20K, PASCAL VOC 2012 and Cityscapes, demonstrating its effectiveness and generality.

数据集 cityscapes valset 验收指标：1. PSANet-resnet50 输入分辨率512x1024 mIOU=77.24% 参考 https://github.com/open-mmlab/mmsegmentation/tree/master/configs/psanet2 . 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg CCNet: Criss-Cross Attention for Semantic Segmentation

Abstract

Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at \url{ https://github.com/speedinghzl/CCNet }.

数据集 cityscapes valset验收指标：1. CCNet-resnet101 R=2+OHEM mIOU= 80.0% 参考 https://github.com/speedinghzl/CCNet/tree/pure-python2 . 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

Abstract

Semantic segmentation is a key technology for autonomous vehicles to understand the surrounding scenes. The appealing performances of contemporary models usually come at the expense of heavy computations and lengthy inference time, which is intolerable for self-driving. Using light-weight architectures (encoder-decoder or two-pathway) or reasoning on low-resolution images, recent methods realize very fast scene parsing, even running at more than 100 FPS on a single 1080Ti GPU. However, there is still a significant gap in performance between these real-time methods and the models based on dilation backbones. To tackle this problem, we proposed a family of efficient backbones specially designed for real-time semantic segmentation. The proposed deep dual-resolution networks (DDRNets) are composed of two deep branches between which multiple bilateral fusions are performed. Additionally, we design a new contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) to enlarge effective receptive fields and fuse multi-scale context based on low-resolution feature maps. Our method achieves a new state-of-the-art trade-off between accuracy and speed on both Cityscapes and CamVid dataset. In particular, on a single 2080Ti GPU, DDRNet-23-slim yields 77.4% mIoU at 102 FPS on Cityscapes test set and 74.7% mIoU at 230 FPS on CamVid test set. With widely used test augmentation, our method is superior to most state-of-the-art models and requires much less computation. Codes and trained models are available online.

数据集 cityscapes valset验收指标：1. DDRNet-23 mIOU= 79.5% 参考论文 Table.42. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation

Abstract

The U-Net was presented in 2015. With its straight-forward and successful architecture it quickly evolved to a commonly used benchmark in medical image segmentation. The adaptation of the U-Net to novel problems, however, comprises several degrees of freedom regarding the exact architecture, preprocessing, training and inference. These choices are not independent of each other and substantially impact the overall performance. The present paper introduces the nnU-Net ('no-new-Net'), which refers to a robust and self-adapting framework on the basis of 2D and 3D vanilla U-Nets. We argue the strong case for taking away superfluous bells and whistles of many proposed network designs and instead focus on the remaining aspects that make out the performance and generalizability of a method. We evaluate the nnU-Net in the context of the Medical Segmentation Decathlon challenge, which measures segmentation performance in ten disciplines comprising distinct entities, image modalities, image geometries and dataset sizes, with no manual adjustments between datasets allowed. At the time of manuscript submission, nnU-Net achieves the highest mean dice scores across all classes and seven phase 1 tasks (except class 1 in BrainTumour) in the online leaderboard of the challenge.

数据集 MSD-Lung : https://drive.google.com/drive/folders/1HqEgzS8BV2c7xYNrZdEAnrHk7osJJ--2验收指标：1 . 3DUnet-cascade Avg Dice = 66.85%；ensemble 2DUnet+ 3DUnet Avg Dice = 61.18%；对应论文 Table.2中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg UNETR: Transformers for 3D Medical Image

Abstract

Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful"U-shaped"network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard. Code: https://monai.io/research/unetr 论文名称(链接) Detecting Text in Natural Image with Connectionist Text Proposal Network

Abstract

We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi- language text without further post-processing, departing from previous bottom-up methods requiring multi-step post-processing. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpass- ing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0:14s/image, by using the very deep VGG16 model [27]. Online demo is available at: http://textdet.com/ .

icdar2015: 0.61 Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Abstract

Attention-based scene text recognizers have gained huge success, whichleverages a more compact intermediate representation to learn 1d- or 2d-attention by a RNN-based encoder-decoder architecture. However, such methodssuffer from attention-drift problem because high similarity among encodedfeatures leads to attention confusion under the RNN-based local attentionmechanism. Moreover, RNN-based methods have low efficiency due to poorparallelization. To overcome these problems, we propose the MASTER, aself-attention based scene text recognizer that (1) not only encodes theinput-output attention but also learns self-attention which encodesfeature-feature and target-target relationships inside the encoder and decoderand (2) learns a more powerful and robust intermediate representation tospatial distortion, and (3) owns a great training efficiency because of hightraining parallelization and a high-speed inference because of an efficientmemory-cache mechanism. Extensive experiments on various benchmarks demonstratethe superior performance of our MASTER on both regular and irregular scenetext. Pytorch code can be found at this https URL,and Tensorflow code can be found at this https URL.

ResNet18 ctw1500 0.806 MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Abstract

Temporal action proposal generation is an important and challenging task invideo understanding, which aims at detecting all temporal segments containingaction instances of interest. The existing proposal generation approaches aregenerally based on pre-defined anchor windows or heuristic bottom-up boundarymatching strategies. This paper presents a simple and efficient framework(RTD-Net) for direct action proposal generation, by re-purposing aTransformer-alike architecture. To tackle the essential visual differencebetween time and space, we make three important improvements over the originaltransformer detection framework (DETR). First, to deal with slowness prior invideos, we replace the original Transformer encoder with a boundary attentivemodule to better capture long-range temporal information. Second, due to theambiguous temporal boundary and relatively sparse annotations, we present arelaxed matching scheme to relieve the strict criteria of single assignment toeach groundtruth. Finally, we devise a three-branch head to further improve theproposal confidence estimation by explicitly predicting its completeness.Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstratethe effectiveness of RTD-Net, on both tasks of temporal action proposalgeneration and temporal action detection. Moreover, due to its simplicity indesign, our framework is more efficient than previous proposal generationmethods, without non-maximum suppression post-processing. The code and modelsare made available at this https URL.

IIIT5K: 95 SVT: 90.6 IC03: 96.4 IC13: 95.3IC15: 79.4 SVTP: 834.5 CT80: 84.5 avg: 89.81 Fourier Contour Embedding for Arbitrary-Shaped Text Detection

Abstract

One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. We further construct FCENet with a backbone, feature pyramid networks (FPN) and a simple post-processing with the Inverse Fourier Transformation (IFT) and Non-Maximum Suppression (NMS). Different from previous methods, FCENet first predicts compact Fourier signatures of text instances, and then reconstructs text contours via IFT and NMS during test. Extensive experiments demonstrate that FCE is accurate and robust to fit contours of scene texts even with highly-curved shapes, and also validate the effectiveness and the good generalization of FCENet for arbitrary-shaped text detection. Furthermore, experimental results show that our FCENet is superior to the state-of-the-art (SOTA) methods on CTW1500 and Total-Text, especially on challenging highly-curved text subset.

ResNet50 + DCNv2 ctw1500 0.851 Primitive Representation Learning for Scene Text Recognition

Abstract

Scene text recognition is a challenging task due to diverse variations of text instances in natural scene images. Conventional methods based on CNN-RNN-CTC or encoder-decoder with attention mechanism may not fully investigate stable and efficient feature representations for multi-oriented scene texts. In this paper, we propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. We model elements in feature maps as the nodes of an undirected graph. A pooling aggregator and a weighted aggregator are proposed to learn primitive representations, which are transformed into high-level visual text representations by graph convolutional networks. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. Furthermore, by integrating visual text representations into an encoderdecoder model with the 2D attention mechanism, we propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods. Experimental results on both English and Chinese scene text recognition tasks demonstrate that PREN keeps a balance between accuracy and efficiency, while PREN2D achieves state-of-theart performance.

SynthText+Mjsynth; IIIT5k: 86.03%, SVT: 87.17%, IC03: 95.16%, IC13: 93.93%, IC15: 78.52%, SVTP: 81.71%, CUTE80: 75.69%, avg: 85.5% RF-Learning:Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text Recognition

Abstract

Text recognition is a popular topic for its broad applications. In this work, we excavate the implicit task, character counting within the traditional text recognition, without additional labor annotation cost. The implicit task plays as an auxiliary branch for complementing the sequential recognition. We design a two-branch reciprocal feature learning framework in order to adequately utilize the features from both the tasks. Through exploiting the complementary effect between explicit and implicit tasks, the feature is reliably enhanced. Extensive experiments on 7 benchmarks show the advantages of the proposed methods in both text recognition and the new-built character counting tasks. In addition, it is convenient yet effective to equip with variable networks and tasks. We offer abundant ablation studies, generalizing experiments with deeper understanding on the tasks. Code is available.

RF-Learning visual IIIT5K: 96, SVT:94.7 IC03:96.2 IC13:95.9 IC15:88.7 SVTP:86.7 CUTE80:88.2 avg: 92.34 Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection

Abstract

Arbitrary shape text detection is a challenging task due to the high variety and complexity of scenes texts. In this paper, we propose a novel unified relational reasoning graph network for arbitrary shape text detection. In our method, an innovative local graph bridges a text proposal model via Convolutional Neural Network (CNN) and a deep relational reasoning network via Graph Convolutional Network (GCN), making our network end-to-end trainable. To be concrete, every text instance will be divided into a series of small rectangular components, and the geometry attributes (e.g., height, width, and orientation) of the small components will be estimated by our text proposal model. Given the geometry attributes, the local graph construction model can roughly establish linkages between different text components. For further reasoning and deducing the likelihood of linkages between the component and its neighbors, we adopt a graph-based network to perform deep relational reasoning on local graphs. Experiments on public available datasets demonstrate the state-of-the-art performance of our method.

ResNet50 ctw1500 0.840 Scene Text Telescope: Text-Focused Scene Image Super-Resolution

Abstract

Image super-resolution, which is often regarded as a preprocessing procedure of scene text recognition, aims to recover the realistic features from a low-resolution text image. It has always been challenging due to large variations in text shapes, fonts, backgrounds, etc. However, most existing methods employ generic super-resolution frameworks to handle scene text images while ignoring text-specific properties such as text-level layouts and character-level details. In this paper, we establish a text-focused super-resolution framework, called Scene Text Telescope (STT). In terms of text-level layouts, we propose a Transformer-Based Super-Resolution Network (TBSRN) containing a Self-Attention Module to extract sequential information, which is robust to tackle the texts in arbitrary orientations. In terms of character-level details, we propose a Position-Aware Module and a Content-Aware Module to highlight the position and the content of each character. By observing that some characters look indistinguishable in low-resolution conditions, we use a weighted cross-entropy loss to tackle this problem. We conduct extensive experiments, including text recognition with pre-trained recognizers and image quality evaluation, on TextZoom and several scene text recognition benchmarks to assess the super-resolution images. The experimental results show that our STT can indeed generate text-focused super-resolution images and outperform the existing methods in terms of recognition accuracy.

CRNN+tbsrn，easy: 0.5979, medium: 0.4507, hard: 0.3418, avg: 0.4634 When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition

Abstract

Recently, most handwritten mathematical expression recognition (HMER) methods adopt the encoder-decoder networks, which directly predict the markup sequences from formula images with the attention mechanism. However, such methods may fail to accurately read formulas with complicated structure or generate long markup sequences, as the attention results are often inaccurate due to the large variance of writing styles or spatial layouts. To alleviate this problem, we propose an unconventional network for HMER named Counting-Aware Network (CAN), which jointly optimizes two tasks: HMER and symbol counting. Specifically, we design a weakly-supervised counting module that can predict the number of each symbol class without the symbol-level position annotations, and then plug it into a typical attention-based encoder-decoder model for HMER. Experiments on the benchmark datasets for HMER validate that both joint optimization and counting results are beneficial for correcting the prediction errors of encoder-decoder models, and CAN consistently outperforms the state-of-the-art methods. In particular, compared with an encoder-decoder model for HMER, the extra time cost caused by the proposed counting module is marginal. The source code is available at https://github.com/LBH1024/CAN .

1.ExpRate=65.89 2.复现后合入PaddleOCR套件，并添加TIPC RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

Abstract

The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios.

IIIT5K: 95.1 SVT:89.2 IC13:93.1 IC15:77.8 SVTP:80.3 CT80:90.3 avg 87.63 SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition

Abstract

Arbitrary text appearance poses a great challenge in scene text recognition tasks. Existing works mostly handle with the problem in consideration of the shape distortion, including perspective distortions, line curvature or other style variations. Therefore, methods based on spatial transformers are extensively studied. However, chromatic difficulties in complex scenes have not been paid much attention on. In this work, we introduce a new learnable geometric-unrelated module, the Structure-Preserving Inner Offset Network (SPIN), which allows the color manipulation of source data within the network. This differentiable module can be inserted before any recognition architecture to ease the downstream tasks, giving neural networks the ability to actively transform input intensity rather than the existing spatial rectification. It can also serve as a complementary module to known spatial transformations and work in both independent and collaborative ways with them. Extensive experiments show that the use of SPIN results in a significant improvement on multiple text recognition benchmarks compared to the state-of-the-arts.

IIIT5K: 94.6, SVT:89, IC03: 93.3, IC13:94.2,IC15:80.7,SVTP:83,CUTE80:84.7,avg: 88.5 论文名称(链接) Deep Image Prior

Abstract

Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them, and to restore images based on flash-no flash input pairs. Apart from its diverse applications, our approach highlights the inductive bias captured by standard generator network architectures. It also bridges the gap between two very popular families of image restoration methods: learning-based methods using deep convolutional networks and learning-free methods based on handcrafted image priors such as self-similarity. Code and supplementary material are available at https://dmitryulyanov.github.io/deep_image_prior .

8× super-resolution, avg psnr=24.15% Progressive Growing of GANs for Improved Quality, Stability, and Variation

Abstract

We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.

CelebA MS-SSIM=0.2838, SWD=2.64(64) Image Inpainting for Irregular Holes Using Partial Convolutions

Abstract

Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolution is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model outperforms other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.

CelebA 人眼评估生成的图像(参考论文Figure8) Generative Adversarial Text to Image Synthesis

Abstract

Synthesizing high resolution photorealistic images has been a long-standingchallenge in machine learning. In this paper we introduce new methods for theimproved training of generative adversarial networks (GANs) for imagesynthesis. We construct a variant of GANs employing label conditioning thatresults in 128x128 resolution image samples exhibiting global coherence. Weexpand on previous work for image quality assessment to provide two newanalyses for assessing the discriminability and diversity of samples fromclass-conditional image synthesis models. These analyses demonstrate that highresolution samples provide class information not present in low resolutionsamples. Across 1000 ImageNet classes, 128x128 samples are more than twice asdiscriminable as artificially resized 32x32 samples. In addition, 84.7% of theclasses have samples exhibiting diversity comparable to real ImageNet data.

Oxford-102 人眼评估生成的图像(参考论文中展示的生成图片) Conditional Image Synthesis With Auxiliary Classifier GANs

Abstract

We introduce SinGAN, an unconditional generative model that can be learnedfrom a single natural image. Our model is trained to capture the internaldistribution of patches within the image, and is then able to generate highquality, diverse samples that carry the same visual content as the image.SinGAN contains a pyramid of fully convolutional GANs, each responsible forlearning the patch distribution at a different scale of the image. This allowsgenerating new samples of arbitrary size and aspect ratio, that havesignificant variability, yet maintain both the global structure and the finetextures of the training image. In contrast to previous single image GANschemes, our approach is not limited to texture images, and is not conditional(i.e. it generates samples from noise). User studies confirm that the generatedsamples are commonly confused to be real images. We illustrate the utility ofSinGAN in a wide range of image manipulation tasks.

ImageNet 人眼评估生成的图像(参考论文中展示的生成图片) SinGAN: Learning a Generative Model from a Single Natural Image

Abstract

We propose spatially-adaptive normalization, a simple but effective layer forsynthesizing photorealistic images given an input semantic layout. Previousmethods directly feed the semantic layout as input to the deep network, whichis then processed through stacks of convolution, normalization, andnonlinearity layers. We show that this is suboptimal as the normalizationlayers tend to ``wash away'' semantic information. To address the issue, wepropose using the input layout for modulating the activations in normalizationlayers through a spatially-adaptive, learned transformation. Experiments onseveral challenging datasets demonstrate the advantage of the proposed methodover existing approaches, regarding both visual fidelity and alignment withinput layouts. Finally, our model allows user control over both semantic andstyle. Code is available at this https URL .

任意一张图片人眼评估生成的图像(可参考论文中展示的生成图片Figure6) Semantic Image Synthesis with Spatially-Adaptive Normalization

Abstract

We present a generic image-to-image translation framework, pixel2style2pixel(pSp). Our pSp framework is based on a novel encoder network that directlygenerates a series of style vectors which are fed into a pretrained StyleGANgenerator, forming the extended W+ latent space. We first show that our encodercan directly embed real images into W+, with no additional optimization. Next,we propose utilizing our encoder to directly solve image-to-image translationtasks, defining them as encoding problems from some input domain into thelatent domain. By deviating from the standard invert first, edit latermethodology used with previous StyleGAN encoders, our approach can handle avariety of tasks even when the input image is not represented in the StyleGANdomain. We show that solving translation tasks through StyleGAN significantlysimplifies the training process, as no adversary is required, has bettersupport for solving tasks without pixel-to-pixel correspondence, and inherentlysupports multi-modal synthesis via the resampling of styles. Finally, wedemonstrate the potential of our framework on a variety of facialimage-to-image translation tasks, even when compared to state-of-the-artsolutions designed specifically for a single task, and further show that it canbe extended beyond the human facial domain.

cityscapes mIoU=62.3, accu=81.9, FID=71.8, 及人眼观察可视化效果 Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Abstract

The task of age transformation illustrates the change of an individual'sappearance over time. Accurately modeling this complex transformation over aninput facial image is extremely challenging as it requires making convincing,possibly large changes to facial features and head shape, while stillpreserving the input identity. In this work, we present an image-to-imagetranslation method that learns to directly encode real facial images into thelatent space of a pre-trained unconditional GAN (e.g., StyleGAN) subject to agiven aging shift. We employ a pre-trained age regression network to explicitlyguide the encoder in generating the latent codes corresponding to the desiredage. In this formulation, our method approaches the continuous aging process asa regression task between the input age and desired target age, providingfine-grained control over the generated image. Moreover, unlike approaches thatoperate solely in the latent space using a prior on the path controlling age,our method learns a more disentangled, non-linear path. Finally, we demonstratethat the end-to-end nature of our approach, coupled with the rich semanticlatent space of StyleGAN, allows for further editing of the generated images.Qualitative and quantitative evaluations show the advantages of our methodcompared to state-of-the-art approaches.

CelebA LPIPS=0.17, similarity=0.56, MSE=0.03 (task of StyleGAN Inversion) Only a Matter of Style: Age Transformation Using a Style-Based Regression Model

Abstract

The task of age transformation illustrates the change of an individual's appearance over time. Accurately modeling this complex transformation over an input facial image is extremely challenging as it requires making convincing, possibly large changes to facial features and head shape, while still preserving the input identity. In this work, we present an image-to-image translation method that learns to directly encode real facial images into the latent space of a pre-trained unconditional GAN (e.g., StyleGAN) subject to a given aging shift. We employ a pre-trained age regression network to explicitly guide the encoder in generating the latent codes corresponding to the desired age. In this formulation, our method approaches the continuous aging process as a regression task between the input age and desired target age, providing fine-grained control over the generated image. Moreover, unlike approaches that operate solely in the latent space using a prior on the path controlling age, our method learns a more disentangled, non-linear path. Finally, we demonstrate that the end-to-end nature of our approach, coupled with the rich semantic latent space of StyleGAN, allows for further editing of the generated images. Qualitative and quantitative evaluations show the advantages of our method compared to state-of-the-art approaches.

CelebA 人眼评估生成的图像(参考论文中展示的生成图片Figure4, 6, 8) ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic

Abstract

We aim at accelerating super-resolution (SR) networks on large images (2K-8K). The large images are usually decomposed into small sub-images in practical usages. Based on this processing, we found that different image regions have different restoration difficulties and can be processed by networks with different capacities. Intuitively, smooth areas are easier to super-solve than complex textures. To utilize this property, we can adopt appropriate SR networks to process different sub-images after the decomposition. On this basis, we propose a new solution pipeline -- ClassSR that combines classification and SR in a unified framework. In particular, it first uses a Class-Module to classify the sub-images into different classes according to restoration difficulties, then applies an SR-Module to perform SR for different classes. The Class-Module is a conventional classification network, while the SR-Module is a network container that consists of the to-be-accelerated SR network and its simplified versions. We further introduce a new classification method with two losses -- Class-Loss and Average-Loss to produce the classification results. After joint training, a majority of sub-images will pass through smaller networks, thus the computational cost can be significantly reduced. Experiments show that our ClassSR can help most existing methods (e.g., FSRCNN, CARN, SRResNet, RCAN) save up to 50% FLOPs on DIV8K datasets. This general framework can also be applied in other low-level vision tasks.

DIV2K PSNR=26.39, FLOPs=21.22G(65%)(Test2K, ClassSR-RCAN) Self-Attention Generative Adversarial Networks

Abstract

In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.

ImageNet FID=18.28 Inception score=52.52 Generative Image Inpainting with Contextual Attention

Abstract

We apply basic statistical reasoning to signal reconstruction by machinelearning -- learning to map corrupted observations to clean signals -- with asimple and powerful conclusion: it is possible to learn to restore images byonly looking at corrupted examples, at performance at and sometimes exceedingtraining using clean data, without explicit image priors or likelihood modelsof the corruption. In practice, we show that a single model learns photographicnoise removal, denoising synthetic Monte Carlo images, and reconstruction ofundersampled MRI scans -- all corrupted by different processes -- based onnoisy data only.

L1Loss=8.6%, L2Loss=2.1%, PSNR=18.91, TVLoss=25.3% Noise2Noise: Learning Image Restoration without Clean Data

Abstract

While humans easily recognize relations between data from different domainswithout any supervision, learning to automatically discover them is in generalvery challenging and needs many ground-truth pairs that illustrate therelations. To avoid costly pairing, we address the task of discoveringcross-domain relations given unpaired data. We propose a method based ongenerative adversarial networks that learns to discover relations betweendifferent domains (DiscoGAN). Using the discovered relations, our proposednetwork successfully transfers style from one domain to another whilepreserving key attributes such as orientation and face identity. Source codefor official implementation is publicly availablethis https URL

Denoised 与clear image PSNR持平 (Gaussian noise (σ = 25)) Learning to Discover Cross-Domain Relations with Generative Adversarial Networks

Abstract

We propose to restore old photos that suffer from severe degradation througha deep learning approach. Unlike conventional restoration tasks that can besolved through supervised learning, the degradation in real photos is complexand the domain gap between synthetic images and real old photos makes thenetwork fail to generalize. Therefore, we propose a novel triplet domaintranslation network by leveraging real photos along with massive syntheticimage pairs. Specifically, we train two variational autoencoders (VAEs) torespectively transform old photos and clean photos into two latent spaces. Andthe translation between these two latent spaces is learned with syntheticpaired data. This translation generalizes well to real photos because thedomain gap is closed in the compact latent space. Besides, to address multipledegradations mixed in one old photo, we design a global branch with apartialnonlocal block targeting to the structured defects, such as scratches and dustspots, and a local branch targeting to the unstructured defects, such as noisesand blurriness. Two branches are fused in the latent space, leading to improvedcapability to restore old photos from multiple defects. Furthermore, we applyanother face refinement network to recover fine details of faces in the oldphotos, thus ultimately generating photos with enhanced perceptual quality.With comprehensive experiments, the proposed pipeline demonstrates superiorperformance over state-of-the-art methods as well as existing commercial toolsin terms of visual quality for old photos restoration.

可视化, 参考论文图7, 8, 9 Old Photo Restoration via Deep Latent Space Translation

Abstract

We present a novel method for constructing Variational Autoencoder (VAE).Instead of using pixel-by-pixel loss, we enforce deep feature consistencybetween the input and the output of a VAE, which ensures the VAE's output topreserve the spatial correlation characteristics of the input, thus leading theoutput to have a more natural visual appearance and better perceptual quality.Based on recent deep learning works such as style transfer, we employ apre-trained deep convolutional neural network (CNN) and use its hidden featuresto define a feature perceptual loss for VAE training. Evaluated on the CelebAface dataset, we show that our model produces better results than other methodsin the literature. We also show that our method can produce latent vectors thatcan capture the semantic information of face expressions and can be used toachieve state-of-the-art performance in facial attribute prediction.

PSNR=23.33, SSIM= 0.69, LPIPS=0.25, FID=134.35(table2) Deep Feature Consistent Variational Autoencoder

Abstract

Scene text detection, an important step of scene text reading systems, haswitnessed rapid development with convolutional neural networks. Nonetheless,two main challenges still exist and hamper its deployment to real-worldapplications. The first problem is the trade-off between speed and accuracy.The second one is to model the arbitrary-shaped text instance. Recently, somemethods have been proposed to tackle arbitrary-shaped text detection, but theyrarely take the speed of the entire pipeline into consideration, which may fallshort in practical this http URL this paper, we propose an efficient andaccurate arbitrary-shaped text detector, termed Pixel Aggregation Network(PAN), which is equipped with a low computational-cost segmentation head and alearnable post-processing. More specifically, the segmentation head is made upof Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM).FPEM is a cascadable U-shaped module, which can introduce multi-levelinformation to guide the better segmentation. FFM can gather the features givenby the FPEMs of different depths into a final feature for segmentation. Thelearnable post-processing is implemented by Pixel Aggregation (PA), which canprecisely aggregate text pixels by predicted similarity vectors. Experiments onseveral standard benchmarks validate the superiority of the proposed PAN. It isworth noting that our method can achieve a competitive F-measure of 79.9% at84.2 FPS on CTW1500.

Average accuracies=88.73 (Table1. VAE-345) Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Abstract

Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data. Specifically, a high-order degradation modeling process is introduced to better simulate complex real-world degradations. We also consider the common ringing and overshoot artifacts in the synthesis process. In addition, we employ a U-Net discriminator with spectral normalization to increase discriminator capability and stabilize the training dynamics. Extensive comparisons have shown its superior visual performance than prior works on various real datasets. We also provide efficient implementations to synthesize training pairs on the fly.

DIV2K and Flickr2K and OST; 可视化效果与论文一致 GFP-GAN: Towards Real-World Blind Face Restoration with Generative Facial Prior

Abstract

Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric prior while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich and diverse priors encapsulated in a pretrained face GAN for blind face restoration. This Generative Facial Prior (GFP) is incorporated into the face restoration process via novel channel-split spatial feature transform layers, which allow our method to achieve a good balance of realness and fidelity. Thanks to the powerful generative facial prior and delicate designs, our GFP-GAN could jointly restore facial details and enhance colors with just a single forward pass, while GAN inversion methods require expensive image-specific optimization at inference. Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.

CelebA-Test： LPIPS=0.3646， FID=42.62 Aggregated Contextual Transformations for High-Resolution Image Inpainting

Abstract

State-of-the-art image inpainting approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., 512x512). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named Aggregated COntextual-Transformation GAN (AOT-GAN), for high-resolution image inpainting. Specifically, to enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block. The AOT blocks aggregate contextual transformations from various receptive fields, allowing to capture both informative distant image contexts and rich patterns of interest for context reasoning. For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task. Such a training objective forces the discriminator to distinguish the detailed appearances of real and synthesized patches, and in turn, facilitates the generator to synthesize clear textures. Extensive comparisons on Places2, the most challenging benchmark with 1.8 million high-resolution images of 365 complex scenes, show that our model outperforms the state-of-the-art by a significant margin in terms of FID with 38.60% relative improvement. A user study including more than 30 subjects further validates the superiority of AOT-GAN. We further evaluate the proposed AOT-GAN in practical applications, e.g., logo removal, face editing, and object removal. Results show that our model achieves promising completions in the real world. We release code and models in https://github.com/researchmm/AOT-GAN-for-Inpainting .

Places365-val(20-30% ): PSNR=26.03, SSIM=0.890 Progressive Image Deraining Networks: A Better and Simpler Baseline

Abstract

Along with the deraining performance improvement of deep networks, their structures and learning become more and more complicated and diverse, making it difficult to analyze the contribution of various network modules when developing new deraining networks. To handle this issue, this paper provides a better and simpler baseline deraining network by considering network architecture, input and output, and loss functions. Specifically, by repeatedly unfolding a shallow ResNet, progressive ResNet (PRN) is proposed to take advantage of recursive computation. A recurrent layer is further introduced to exploit the dependencies of deep features across stages, forming our progressive recurrent network (PReNet). Furthermore, intra-stage recursive computation of ResNet can be adopted in PRN and PReNet to notably reduce network parameters with graceful degradation in deraining performance. For network input and output, we take both stage-wise result and original rainy image as input to each ResNet and finally output the prediction of {residual image}. As for loss functions, single MSE or negative SSIM losses are sufficient to train PRN and PReNet. Experiments show that PRN and PReNet perform favorably on both synthetic and real rainy images. Considering its simplicity, efficiency and effectiveness, our models are expected to serve as a suitable baseline in future deraining research. The source codes are available at https://github.com/csdwren/PReNet .

Rain100H数据集,PReNet模型,psnr=29.46, ssim=0.899 StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

Abstract

Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images. However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation. Finally, we present a method for mapping a text prompts to input-agnostic directions in StyleGAN's style space, enabling interactive text-driven image manipulation. Extensive results and comparisons demonstrate the effectiveness of our approaches.

GAN Prior Embedded Network for Blind Face Restoration in the Wild

Abstract

Blind face restoration (BFR) from severely degraded face images in the wild is a very challenging problem. Due to the high illness of the problem and the complex unknown degradation, directly training a deep neural network (DNN) usually cannot lead to acceptable results. Existing generative adversarial network (GAN) based methods can produce better results but tend to generate over-smoothed restorations. In this work, we propose a new method by first learning a GAN for high-quality face image generation and embedding it into a U-shaped DNN as a prior decoder, then fine-tuning the GAN prior embedded DNN with a set of synthesized low-quality face images. The GAN blocks are designed to ensure that the latent code and noise input to the GAN can be respectively generated from the deep and shallow features of the DNN, controlling the global face structure, local face details and background of the reconstructed image. The proposed GAN prior embedded network (GPEN) is easy-to-implement, and it can generate visually photo-realistic results. Our experiments demonstrated that the proposed GPEN achieves significantly superior results to state-of-the-art BFR methods both quantitatively and qualitatively, especially for the restoration of severely degraded face images in the wild. The source code and models can be found at https://github.com/yangxy/GPEN .

FID=31.72（CelebA-HQ-val ）论文名称(链接) Simple Baselines for Image Restoration

Abstract

Although there have been significant advances in the field of image restoration recently, the system complexity of the state-of-the-art (SOTA) methods is increasing as well, which may hinder the convenient analysis and comparison of methods. In this paper, we propose a simple baseline that exceeds the SOTA methods and is computationally efficient. To further simplify the baseline, we reveal that the nonlinear activation functions, e.g. Sigmoid, ReLU, GELU, Softmax, etc. are not necessary: they could be replaced by multiplication or removed. Thus, we derive a Nonlinear Activation Free Network, namely NAFNet, from the baseline. SOTA results are achieved on various challenging benchmarks, e.g. 33.69 dB PSNR on GoPro (for image deblurring), exceeding the previous SOTA 0.38 dB with only 8.4% of its computational costs; 40.30 dB PSNR on SIDD (for image denoising), exceeding the previous SOTA 0.28 dB with less than half of its computational costs. The code and the pre-trained models are released at https://github.com/megvii-research/NAFNet .

SIDD PSNR: 40.3045, SSIM:0.9614 HINet: Half Instance Normalization Network for Image Restoration

Abstract

In this paper, we explore the role of Instance Normalization in low-level vision tasks. Specifically, we present a novel block: Half Instance Normalization Block (HIN Block), to boost the performance of image restoration networks. Based on HIN Block, we design a simple and powerful multi-stage network named HINet, which consists of two subnetworks. With the help of HIN Block, HINet surpasses the state-of-the-art (SOTA) on various image restoration tasks. For image denoising, we exceed it 0.11dB and 0.28 dB in PSNR on SIDD dataset, with only 7.5% and 30% of its multiplier-accumulator operations (MACs), 6.8 times and 2.9 times speedup respectively. For image deblurring, we get comparable performance with 22.5% of its MACs and 3.3 times speedup on REDS and GoPro datasets. For image deraining, we exceed it by 0.3 dB in PSNR on the average result of multiple datasets with 1.4 times speedup. With HINet, we won 1st place on the NTIRE 2021 Image Deblurring Challenge - Track2. JPEG Artifacts, with a PSNR of 29.70. The code is available at https://github.com/megvii-model/HINet .

SIDD PSNR: 39.99, SSIM:0.958 Invertible Denoising Network: A Light Solution for Real Noise Removal

Abstract

Invertible networks have various benefits for image denoising since they are lightweight, information-lossless, and memory-saving during back-propagation. However, applying invertible models to remove noise is challenging because the input is noisy, and the reversed output is clean, following two different distributions. We propose an invertible denoising network, InvDN, to address this challenge. InvDN transforms the noisy input into a low-resolution clean image and a latent representation containing noise. To discard noise and restore the clean image, InvDN replaces the noisy latent representation with another one sampled from a prior distribution during reversion. The denoising performance of InvDN is better than all the existing competitive models, achieving a new state-of-the-art result for the SIDD dataset while enjoying less run time. Moreover, the size of InvDN is far smaller, only having 4.2% of the number of parameters compared to the most recently proposed DANet. Further, via manipulating the noisy latent representation, InvDN is also able to generate noise more similar to the original one. Our code is available at: https://github.com/Yang-Liu1082/InvDN.git .

SIDD PSNR: 39.28, SSIM:0.955 SwinIR: Image Restoration Using Swin Transformer

Abstract

Image restoration is a long-standing low-level vision problem that aims to restore high-quality images from low-quality images (e.g., downscaled, noisy and compressed images). While state-of-the-art image restoration methods are based on convolutional neural networks, few attempts have been made with Transformers which show impressive performance on high-level vision tasks. In this paper, we propose a strong baseline model SwinIR for image restoration based on the Swin Transformer. SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection. We conduct experiments on three representative tasks: image super-resolution (including classical, lightweight and real-world image super-resolution), image denoising (including grayscale and color image denoising) and JPEG compression artifact reduction. Experimental results demonstrate that SwinIR outperforms state-of-the-art methods on different tasks by up to 0.14∼0.45dB, while the total number of parameters can be reduced by up to 67%.

CBSD68, average PSNR, noise 15: 34.42 Neighbor2Neighbor: Self-Supervised Denoising from Single Noisy Images

Abstract

In the last few years, image denoising has benefited a lot from the fast development of neural networks. However, the requirement of large amounts of noisy-clean image pairs for supervision limits the wide use of these models. Although there have been a few attempts in training an image denoising model with only single noisy images, existing self-supervised denoising approaches suffer from inefficient network training, loss of useful information, or dependence on noise modeling. In this paper, we present a very simple yet effective method named Neighbor2Neighbor to train an effective image denoising model with only noisy images. Firstly, a random neighbor sub-sampler is proposed for the generation of training image pairs. In detail, input and target used to train a network are images sub-sampled from the same noisy image, satisfying the requirement that paired pixels of paired images are neighbors and have very similar appearance with each other. Secondly, a denoising network is trained on sub-sampled training pairs generated in the first stage, with a proposed regularizer as additional loss for better performance. The proposed Neighbor2Neighbor framework is able to enjoy the progress of state-of-the-art supervised denoising networks in network architecture design. Moreover, it avoids heavy dependence on the assumption of the noise distribution. We explain our approach from a theoretical perspective and further validate it through extensive experiments, including synthetic experiments with different noise distributions in sRGB space and real-world experiments on a denoising benchmark dataset in raw-RGB space.

Gaussion 25, BSD300: PSNR: 30.79, SSIM:0.873 Restormer: Efficient Transformer for High-Resolution Image Restoration

Abstract

Since convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data, these models have been extensively applied to image restoration and related tasks. Recently, another class of neural architectures, Transformers, have shown significant performance gains on natural language and high-level vision tasks. While the Transformer model mitigates the shortcomings of CNNs (i.e., limited receptive field and inadaptability to input content), its computational complexity grows quadratically with the spatial resolution, therefore making it infeasible to apply to most image restoration tasks involving high-resolution images. In this work, we propose an efficient Transformer model by making several key designs in the building blocks (multi-head attention and feed-forward network) such that it can capture long-range pixel interactions, while still remaining applicable to large images. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks, including image deraining, single-image motion deblurring, defocus deblurring (single-image and dual-pixel data), and image denoising (Gaussian grayscale/color denoising, and real image denoising). The source code and pre-trained models are available at https://github.com/swz30/Restormer .

CBSD68, average PSNR, noise 15: 34.39 Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Abstract

Discriminative model learning for image denoising has been recently attracting considerable attentions due to its favorable denoising performance. In this paper, we take one step forward by investigating the construction of feed-forward denoising convolutional neural networks (DnCNNs) to embrace the progress in very deep architecture, learning algorithm, and regularization method into image denoising. Specifically, residual learning and batch normalization are utilized to speed up the training process as well as boost the denoising performance. Different from the existing discriminative denoising models which usually train a specific model for additive white Gaussian noise (AWGN) at a certain noise level, our DnCNN model is able to handle Gaussian denoising with unknown noise level (i.e., blind Gaussian denoising). With the residual learning strategy, DnCNN implicitly removes the latent clean image in the hidden layers. This property motivates us to train a single DnCNN model to tackle with several general image denoising tasks such as Gaussian denoising, single image super-resolution and JPEG image deblocking. Our extensive experiments demonstrate that our DnCNN model can not only exhibit high effectiveness in several general image denoising tasks, but also be efficiently implemented by benefiting from GPU computing.

BSD68: average PSNR, noise 15: 31.73 Learning Enriched Features for Real Image Restoration and Enhancement

Abstract

With the goal of recovering high-quality image content from its degraded version, image restoration enjoys numerous applications, such as in surveillance, computational photography, medical imaging, and remote sensing. Recently, convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. Existing CNN-based methods typically operate either on full-resolution or on progressively low-resolution representations. In the former case, spatially precise but contextually less robust results are achieved, while in the latter case, semantically reliable but spatially less accurate outputs are generated. In this paper, we present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network and receiving strong contextual information from the low-resolution representations. The core of our approach is a multi-scale residual block containing several key elements: (a) parallel multi-resolution convolution streams for extracting multi-scale features, (b) information exchange across the multi-resolution streams, (c) spatial and channel attention mechanisms for capturing contextual information, and (d) attention based multi-scale feature aggregation. In a nutshell, our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Extensive experiments on five real image benchmark datasets demonstrate that our method, named as MIRNet, achieves state-of-the-art results for a variety of image processing tasks, including image denoising, super-resolution, and image enhancement. The source code and pre-trained models are available at https://github.com/swz30/MIRNet .

SIDD PSNR: 39.72, SSIM:0.959 论文名称(链接) Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection

Abstract

Anomaly detection is commonly pursued as a one-class classification problem, where models can only learn from normal training samples, while being evaluated on both normal and abnormal test samples. Among the successful approaches for anomaly detection, a distinguished category of methods relies on predicting masked information (e.g. patches, future frames, etc.) and leveraging the reconstruction error with respect to the masked information as an abnormality score. Different from related methods, we propose to integrate the reconstruction-based functionality into a novel self-supervised predictive architectural building block. The proposed self-supervised block is generic and can easily be incorporated into various state-of-the-art anomaly detection methods. Our block starts with a convolutional layer with dilated filters, where the center area of the receptive field is masked. The resulting activation maps are passed through a channel attention module. Our block is equipped with a loss that minimizes the reconstruction error with respect to the masked area in the receptive field. We demonstrate the generality of our block by integrating it into several state-of-the-art frameworks for anomaly detection on image and video, providing empirical evidence that shows considerable performance improvements on MVTec AD, Avenue, and ShanghaiTech. We release our code as open source at https://github.com/ristea/sspcab .

MVTec AD数据集，结合CutPaste方法，3-way detection AUROC 96.1% CutPaste: Self-Supervised Learning for Anomaly Detection and Localization

Abstract

We aim at constructing a high performance model for defect detection that detects unknown anomalous patterns of an image without anomalous data. To this end, we propose a two-stage framework for building anomaly detectors using normal training data only. We first learn self-supervised deep representations and then build a generative one-class classifier on learned representations. We learn representations by classifying normal data from the CutPaste, a simple data augmentation strategy that cuts an image patch and pastes at a random location of a large image. Our empirical study on MVTec anomaly detection dataset demonstrates the proposed algorithm is general to be able to detect various types of real-world defects. We bring the improvement upon previous arts by 3.1 AUCs when learning representations from scratch. By transfer learning on pretrained representations on ImageNet, we achieve a new state-of-theart 96.6 AUC. Lastly, we extend the framework to learn and extract representations from patches to allow localizing defective areas without annotations during training.

MVTec AD数据集，3-way detection AUROC 95.2% Anomaly Detection via Reverse Distillation from One-Class Embedding

Abstract

Knowledge distillation (KD) achieves promising results on the challenging problem of unsupervised anomaly detection (AD).The representation discrepancy of anomalies in the teacher-student (T-S) model provides essential evidence for AD. However, using similar or identical architectures to build the teacher and student models in previous studies hinders the diversity of anomalous representations. To tackle this problem, we propose a novel T-S model consisting of a teacher encoder and a student decoder and introduce a simple yet effective"reverse distillation"paradigm accordingly. Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input and targets to restore the teacher's multiscale representations. Inherently, knowledge distillation in this study starts from abstract, high-level presentations to low-level features. In addition, we introduce a trainable one-class bottleneck embedding (OCBE) module in our T-S model. The obtained compact embedding effectively preserves essential information on normal patterns, but abandons anomaly perturbations. Extensive experimentation on AD and one-class novelty detection benchmarks shows that our method surpasses SOTA performance, demonstrating our proposed approach's effectiveness and generalizability.

MVTec AD数据集， 256尺度，wide-resnet50，detection AUROC 98.5%， loc-AUROC 97.8%， loc-PRO 93.9% FastFlow: Unsupervised Anomaly Detection and Localization via 2D Normalizing Flows

Abstract

Unsupervised anomaly detection and localization is crucial to the practical application when collecting and labeling sufficient anomaly data is infeasible. Most existing representation-based approaches extract normal image features with a deep convolutional neural network and characterize the corresponding distribution through non-parametric distribution estimation methods. The anomaly score is calculated by measuring the distance between the feature of the test image and the estimated distribution. However, current methods can not effectively map image features to a tractable base distribution and ignore the relationship between local and global features which are important to identify anomalies. To this end, we propose FastFlow implemented with 2D normalizing flows and use it as the probability distribution estimator. Our FastFlow can be used as a plug-in module with arbitrary deep feature extractors such as ResNet and vision transformer for unsupervised anomaly detection and localization. In training phase, FastFlow learns to transform the input visual feature into a tractable distribution and obtains the likelihood to recognize anomalies in inference phase. Extensive experimental results on the MVTec AD dataset show that FastFlow surpasses previous state-of-the-art methods in terms of accuracy and inference efficiency with various backbone networks. Our approach achieves 99.4% AUC in anomaly detection with high inference efficiency.

MVTec AD数据集，ResNet18 , image-level AUC 97.9%，pixel-leval AUC 97.2% PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization

Abstract

We present a new framework for Patch Distribution Modeling, PaDiM, to concurrently detect and localize anomalies in images in a one-class learning setting. PaDiM makes use of a pretrained convolutional neural network (CNN) for patch embedding, and of multivariate Gaussian distributions to get a probabilistic representation of the normal class. It also exploits correlations between the different semantic levels of CNN to better localize anomalies. PaDiM outperforms current state-of-the-art approaches for both anomaly detection and localization on the MVTec AD and STC datasets. To match real-world visual industrial inspection, we extend the evaluation protocol to assess performance of anomaly localization algorithms on non-aligned dataset. The state-of-the-art performance and low complexity of PaDiM make it a good candidate for many industrial applications.

resnet18 mvtec image-level auc 0.891, pixel-level auc: 0.968 Student-Teacher Feature Pyramid Matching for Anomaly Detection

Abstract

Anomaly detection is a challenging task and usually formulated as an one-class learning problem for the unexpectedness of anomalies. This paper proposes a simple yet powerful approach to this issue, which is implemented in the student-teacher framework for its advantages but substantially extends it in terms of both accuracy and efficiency. Given a strong model pre-trained on image classification as the teacher, we distill the knowledge into a single student network with the identical architecture to learn the distribution of anomaly-free images and this one-step transfer preserves the crucial clues as much as possible. Moreover, we integrate the multi-scale feature matching strategy into the framework, and this hierarchical feature matching enables the student network to receive a mixture of multi-level knowledge from the feature pyramid under better supervision, thus allowing to detect anomalies of various sizes. The difference between feature pyramids generated by the two networks serves as a scoring function indicating the probability of anomaly occurring. Due to such operations, our approach achieves accurate and fast pixel-level anomaly detection. Very competitive results are delivered on the MVTec anomaly detection dataset, superior to the state of the art ones.

resnet18 mvtec image-level auc 0.893, pixel-level auc: 0.951 Multiresolution Knowledge Distillation for Anomaly Detection

Abstract

Unsupervised representation learning has proved to be a critical component of anomaly detection/localization in images. The challenges to learn such a representation are two-fold. Firstly, the sample size is not often large enough to learn a rich generalizable representation through conventional techniques. Secondly, while only normal samples are available at training, the learned features should be discriminative of normal and anomalous samples. Here, we propose to use the"distillation"of features at various layers of an expert network, pre-trained on ImageNet, into a simpler cloner network to tackle both issues. We detect and localize anomalies using the discrepancy between the expert and cloner networks' intermediate activation values given the input data. We show that considering multiple intermediate hints in distillation leads to better exploiting the expert's knowledge and more distinctive discrepancy compared to solely utilizing the last layer activation values. Notably, previous methods either fail in precise anomaly localization or need expensive region-based training. In contrast, with no need for any special or intensive training procedure, we incorporate interpretability algorithms in our novel framework for the localization of anomalous regions. Despite the striking contrast between some test datasets and ImageNet, we achieve competitive or significantly superior results compared to the SOTA methods on MNIST, F-MNIST, CIFAR-10, MVTecAD, Retinal-OCT, and two Medical datasets on both anomaly detection and localization.

mvtec detection AUROC 87.74%, loc AUROC 90.71% Towards Total Recall in Industrial Anomaly Detection

Abstract

Being able to spot defective parts is a critical component in large-scale industrial manufacturing. A particular challenge that we address in this work is the cold-start problem: fit a model using nominal (non-defective) example images only. While handcrafted solutions per class are possible, the goal is to build systems that work well simultaneously on many different tasks automatically. The best performing approaches combine embeddings from ImageNet models with an outlier detection model. In this paper, we extend on this line of work and propose \textbf{PatchCore}, which uses a maximally representative memory bank of nominal patch-features. PatchCore offers competitive inference times while achieving state-of-the-art performance for both detection and localization. On the challenging, widely used MVTec AD benchmark PatchCore achieves an image-level anomaly detection AUROC score of up to 99.6%, more than halving the error compared to the next best competitor. We further report competitive results on two additional datasets and also find competitive results in the few samples regime.\freefootnote{∗ Work done during a research internship at Amazon AWS.} Code: github.com/amazon-research/patchcore-inspection.

resnet18 mvtec image-level auc 0.973, pixel-level auc: 0.976 Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation

Abstract

We present the efficiency of semi-orthogonal embedding for unsupervised anomaly segmentation. The multi-scale features from pre-trained CNNs are recently used for the localized Mahalanobis distances with significant performance. However, the increased feature size is problematic to scale up to the bigger CNNs, since it requires the batch-inverse of multi-dimensional covariance tensor. Here, we generalize an ad-hoc method, random feature selection, into semi-orthogonal embedding for robust approximation, cubically reducing the computational cost for the inverse of multi-dimensional covariance tensor. With the scrutiny of ablation studies, the proposed method achieves a new state-of-the-art with significant margins for the MVTec AD, KolektorSDD, KolektorSDD2, and mSTC datasets. The theoretical and empirical analyses offer insights and verification of our straightforward yet cost-effective approach.

mvtec pro 0.942, roc 0.982 论文名称(链接) RetinaFace: Single-stage Dense Face Localisation in the Wild

Abstract

We propose a novel Connectionist Text Proposal Network (CTPN) that accuratelylocalizes text lines in natural image. The CTPN detects a text line in asequence of fine-scale text proposals directly in convolutional feature maps.We develop a vertical anchor mechanism that jointly predicts location andtext/non-text score of each fixed-width proposal, considerably improvinglocalization accuracy. The sequential proposals are naturally connected by arecurrent neural network, which is seamlessly incorporated into theconvolutional network, resulting in an end-to-end trainable model. This allowsthe CTPN to explore rich context information of image, making it powerful todetect extremely ambiguous text. The CTPN works reliably on multi-scale andmulti- language text without further post-processing, departing from previousbottom-up methods requiring multi-step post-processing. It achieves 0.88 and0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpass- ing recentresults [8, 35] by a large margin. The CTPN is computationally efficient with0:14s/image, by using the very deep VGG16 model [27]. Online demo is availableat: this http URL.

MAP: 52.318 Real-time Convolutional Neural Networks for Emotion and Gender Classification

Abstract

In this paper we propose an implement a general convolutional neural network (CNN) building framework for designing real-time CNNs. We validate our models by creating a real-time vision system which accomplishes the tasks of face detection, gender classification and emotion classification simultaneously in one blended step using our proposed CNN architecture. After presenting the details of the training procedure setup we proceed to evaluate on standard benchmark sets. We report accuracies of 96% in the IMDB gender dataset and 66% in the FER-2013 emotion dataset. Along with this we also introduced the very recent real-time enabled guided back-propagation visualization technique. Guided back-propagation uncovers the dynamics of the weight changes and evaluates the learned features. We argue that the careful implementation of modern CNN architectures, the use of the current regularization methods and the visualization of previously hidden features are necessary in order to reduce the gap between slow performances and real-time architectures. Our system has been validated by its deployment on a Care-O-bot 3 robot used during RoboCup@Home competitions. All our code, demos and pre-trained architectures have been released under an open-source license in our public repository.

IMDB: 96% PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Abstract

The primary aim of single-image super-resolution is to construct high-resolution (HR) images from corresponding low-resolution (LR) inputs. In previous approaches, which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present an algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require supervised training on databases of LR-HR image pairs). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the"downscaling loss,"which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee realistic outputs. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show proof of concept of our approach in the domain of face super-resolution (i.e., face hallucination). We also present a discussion of the limitations and biases of the method as currently implemented with an accompanying model card with relevant metrics. Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.

CelebA HQ 3.6 Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

Abstract

Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.

FDDB: 0.82 Finding Tiny Faces

Abstract

Though tremendous strides have been made in object recognition, one of the remaining open challenges is detecting small objects. We explore three aspects of the problem in the context of finding small faces: the role of scale invariance, image resolution, and contextual reasoning. While most recognition approaches aim to be scale-invariant, the cues for recognizing a 3px tall face are fundamentally different than those for recognizing a 300px tall face. We take a different approach and train separate detectors for different scales. To maintain efficiency, detectors are trained in a multi-task fashion: they make use of features extracted from multiple layers of single (deep) feature hierarchy. While training detectors for large objects is straightforward, the crucial challenge remains training detectors for small objects. We show that context is crucial, and define templates that make use of massively-large receptive fields (where 99% of the template extends beyond the object of interest). Finally, we explore the role of scale in pre-trained deep networks, providing ways to extrapolate networks tuned for limited scales to rather extreme ranges. We demonstrate state-of-the-art results on massively-benchmarked face datasets (FDDB and WIDER FACE). In particular, when compared to prior art on WIDER FACE, our results reduce error by a factor of 2 (our models produce an AP of 82% while prior art ranges from 29-64%).

WIDER_FACE: resnet50 500*500 easy: 0.902 medium: 0.892 medium: 0.892 论文名称(链接) Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

Abstract

This paper addresses the visualisation of image classification models, learntusing deep Convolutional Networks (ConvNets). We consider two visualisationtechniques, based on computing the gradient of the class score with respect tothe input image. The first one generates an image, which maximises the classscore [Erhan et al., 2009], thus visualising the notion of the class, capturedby a ConvNet. The second technique computes a class saliency map, specific to agiven image and class. We show that such maps can be employed for weaklysupervised object segmentation using classification ConvNets. Finally, weestablish the connection between the gradient-based ConvNet visualisationmethods and deconvolutional networks [Zeiler et al., 2013].

可视化方法 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Abstract

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch

ucf-101; resnet18 112*112 Kinetics400 : 66.1 ucf-101: 42.4(不加载预训练) Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Abstract

In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies, and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially"slower"at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale methods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing convolutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.

imagenet: 1.125 MobileNet (v2) 3224224 imagenet top1 73 Learning Spatiotemporal Features with 3D Convolutional Networks

Abstract

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

UCF101: 128x171 TOP1=83.27% MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Abstract

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

UCF101: 4x16, Top1=96.6% Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Abstract

Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. In GCNs, graph topology dominates feature aggregation and therefore is the key to extracting representative features. In this work, we propose a novel Channel-wise Topology Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies and effectively aggregate joint features in different channels for skeleton-based action recognition. The proposed CTR-GC models channel-wise topologies through learning a shared topology as a generic prior for all channels and refining it with channel-specific correlations for each channel. Our refinement method introduces few extra parameters and significantly reduces the difficulty of modeling channel-wise topologies. Furthermore, via reformulating graph convolutions into a unified form, we find that CTR-GC relaxes strict constraints of graph convolutions, leading to stronger representation capability. Combining CTR-GC with temporal modeling modules, we develop a powerful graph convolutional network named CTR-GCN which notably outperforms state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

Revisiting Skeleton-based Action Recognition

Abstract

Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt graph convolutional networks (GCN) to extract features on top of human skeletons. Despite the positive results shown in previous works, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseC3D, a new approach to skeleton-based action recognition, which relies on a 3D heatmap stack instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseC3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseC3D can handle multiple-person scenarios without additional computation cost, and its features can be easily integrated with other modalities at early fusion stages, which provides a great design space to further boost the performance. On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.

UCF101 split1, top1=87.0

自然语言处理

论文名称(链接) A Structured Self-attentive Sentence Embedding

Abstract

This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification, and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.

SNLI: accuracy=84.4%(见论文Table 2) End-To-End Memory Networks

Abstract

This article offers an empirical exploration on the use of character-levelconvolutional networks (ConvNets) for text classification. We constructedseveral large-scale datasets to show that character-level convolutionalnetworks could achieve state-of-the-art or competitive results. Comparisons areoffered against traditional models such as bag of words, n-grams and theirTFIDF variants, and deep learning models such as word-based ConvNets andrecurrent neural networks.

Penn Treebank: ppl=111; Text8: ppl=147 Character-level Convolutional Networks for Text Classification

Abstract

Building open-domain chatbots is a challenging area for machine learningresearch. While prior work has shown that scaling neural models in the numberof parameters and the size of the data they are trained on gives improvedresults, we show that other ingredients are important for a high-performingchatbot. Good conversation requires a number of skills that an expertconversationalist blends in a seamless way: providing engaging talking pointsand listening to their partners, and displaying knowledge, empathy andpersonality appropriately, while maintaining a consistent persona. We show thatlarge scale models can learn these skills when given appropriate training dataand choice of generation strategy. We build variants of these recipes with 90M,2.7B and 9.4B parameter models, and make our models and code publiclyavailable. Human evaluations show our best models are superior to existingapproaches in multi-turn dialogue in terms of engagingness and humannessmeasurements. We then discuss the limitations of this work by analyzing failurecases of our models.

Amazon Review Full: error rate=40.45%; Yahoo! Answers: error rate=28.80% Recipes for building an open-domain chatbot

Abstract

Pre-trained language models like BERT and its variants have recently achievedimpressive performance in various natural language understanding tasks.However, BERT heavily relies on the global self-attention block and thussuffers large memory footprint and computation cost. Although all its attentionheads query on the whole input sequence for generating the attention map from aglobal perspective, we observe some heads only need to learn localdependencies, which means the existence of computation redundancy. We thereforepropose a novel span-based dynamic convolution to replace these self-attentionheads to directly model local dependencies. The novel convolution heads,together with the rest self-attention heads, form a new mixed attention blockthat is more efficient at both global and local context learning. We equip BERTwith this mixed attention design and build a ConvBERT model. Experiments haveshown that ConvBERT significantly outperforms BERT and its variants in variousdownstream tasks, with lower training cost and fewer model parameters.Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher thanELECTRAbase, while using less than 1/4 training cost. Code and pre-trainedmodels will be released.

BlenderbotForConditionalGeneration模型和BlenderbotSmallForConditionalGeneration模型前向推理输出与论文对齐(90M和2.7B distilled to 360M两个权重) ConvBERT: Improving BERT with Span-based Dynamic Convolution

Abstract

Natural Language Processing (NLP) has recently achieved great success byusing huge pre-trained models with hundreds of millions of parameters. However,these models suffer from heavy model sizes and high latency such that theycannot be deployed to resource-limited mobile devices. In this paper, wepropose MobileBERT for compressing and accelerating the popular BERT model.Like the original BERT, MobileBERT is task-agnostic, that is, it can begenerically applied to various downstream NLP tasks via simple fine-tuning.Basically, MobileBERT is a thin version of BERT_LARGE, while equipped withbottleneck structures and a carefully designed balance between self-attentionsand feed-forward networks. To train MobileBERT, we first train a speciallydesigned teacher model, an inverted-bottleneck incorporated BERT_LARGE model.Then, we conduct knowledge transfer from this teacher to MobileBERT. Empiricalstudies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASEwhile achieving competitive results on well-known benchmarks. On the naturallanguage inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuADv1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of90.0/79.2 (1.5/2.1 higher than BERT_BASE).

QNLI测试集accuracy=93.2%(见论文Table 3), SQuAD v1.1验证集上Exact Match=84.7%, F1=90.9%, SQuAD v2.0验证集Exact Match=80.6%, F1=83.1%(见论文Table 4) MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Abstract

BERT adopts masked language modeling (MLM) for pre-training and is one of themost successful pre-training models. Since BERT neglects dependency amongpredicted tokens, XLNet introduces permuted language modeling (PLM) forpre-training to address this problem. However, XLNet does not leverage the fullposition information of a sentence and thus suffers from position discrepancybetween pre-training and fine-tuning. In this paper, we propose MPNet, a novelpre-training method that inherits the advantages of BERT and XLNet and avoidstheir limitations. MPNet leverages the dependency among predicted tokensthrough permuted language modeling (vs. MLM in BERT), and takes auxiliaryposition information as input to make the model see a full sentence and thusreducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on alarge-scale dataset (over 160GB text corpora) and fine-tune on a variety ofdown-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNetoutperforms MLM and PLM by a large margin, and achieves better results on thesetasks compared with previous state-of-the-art pre-trained methods (e.g., BERT,XLNet, RoBERTa) under the same model setting. The code and the pre-trainedmodels are available at: this https URL.

MNLI验证集-m/mm accuracy=83.3/82.6(见论文table 4), SQuAD 1.1验证集F1/EM=90.0/82.9, SQuAD 2.0验证集F1/EM=79.2/76.2(见论文table 5) MPNet: Masked and Permuted Pre-training for Language Understanding

Abstract

BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting. The code and the pre-trained models are available at: https://github.com/microsoft/MPNet .

QQP验证集accuracy=91.9(见论文Table 3), SQuAD 1.1 F1/EM (dev set)=92.7/86.9, SQuAD 2.0 F1/EM (dev set)=85.7/82.7(见论文Table 4) Reformer: The Efficient Transformer

Abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(LlogL), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

ReformerModel, ReformerForSequenceClassification和ReformerForQuestionAnswering网络前向推理输出与论文对齐 SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

Abstract

Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, large computing systems, and better neural network models, natural language processing (NLP) technology has made significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set. The SqueezeBERT code will be released.

QQP验证集accuracy=89.4(见论文Table 2); SqueezeBERT模型加速比对比BERT-Base达到4.3x(见论文Table 2) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Abstract

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

GLUE dev set上达到平均指标85.97, CNNDM dev set上达到ROUGE-2=20.90(见论文Table 15) VisualBERT: A Simple and Performant Baseline for Vision and Language

Abstract

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

VQA: Test-Dev=70.80, Test-Std=71.00; NLVR: accuracy=67.4(见论文Table1, Table3) ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Abstract

Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the {\it glyph} and {\it pinyin} information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The porpsoed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition. Code and pretrained models are publicly available at https://github.com/ShannonAI/ChineseBert .

CMRC dev/test=70.70/78.05(见论文Table 2), XNLI dev/test=82.7/81.6(见论文Table 4), ChnSentiCorp dev/test=95.8/95.9 CTRL: A Conditional Transformer Language Model for Controllable Generation

Abstract

Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at https://github.com/salesforce/ctrl .

CTRLLMHeadModel模型和CTRLForSequenceClassification模型前向推理输出与论文对齐 Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Abstract

With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at https://github.com/laiguokun/Funnel-Transformer .

QNLI验证集上accuracy=95.1%(见论文table 3), SQuAD v1.1验证集上F1/EM=94.7/89.0, SQuAD v2.0验证集F1/EM=90.4/87.6(见论文table 5) Splinter: Few-Shot Question Answering by Pretraining Span Selection

Abstract

In several question answering benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available, and observe that standard models perform poorly, highlighting the discrepancy between current pretraining objectives and question answering. We propose a new pretraining scheme tailored for question answering: recurring span selection. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks (e.g., 72.7 F1 on SQuAD with only 128 training examples), while maintaining competitive performance in the high-resource setting.

SQuAD 1.1验证集, 16 examples F1=54.6, 128 examples F1=72.7, 1024 Examples F1=82.8(见论文Table1) UNILMv1: Unified Language Model Pre-training for Natural Language Understanding and Generation

Abstract

This paper presents a new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UniLM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UniLM achieves new state-of-the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm .

QNLI测试集达到92.7(见论文table11), CoQA验证集F1=82.5(见论文 table 7) BERT for Joint Intent Classification and Slot Filling

Abstract

Intent classification and slot filling are two essential tasks for natural language understanding. They often suffer from small-scale human-labeled training data, resulting in poor generalization capability, especially for rare words. Recently a new language representation model, BERT (Bidirectional Encoder Representations from Transformers), facilitates pre-training deep bidirectional representations on large-scale unlabeled corpora, and has created state-of-the-art models for a wide variety of natural language processing tasks after simple fine-tuning. However, there has not been much effort on exploring BERT for natural language understanding. In this work, we propose a joint intent classification and slot filling model based on BERT. Experimental results demonstrate that our proposed model achieves significant improvement on intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy on several public benchmark datasets, compared to the attention-based recurrent neural network models and slot-gated models.

在Snips测试集指标达到98.6, 97.0, 92.8; 在ATIS测试集上指标达到97.9, 96.0, 88.6 (见table2) Fastformer: Additive Attention Can Be All You Need

Abstract

Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.

Amazon F1=43.23(见论文table4), Pubmed测试集R-L=34.81(见论文table6), Convolutional Sequence to Sequence Learning

Abstract

The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks.1 Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT’14 English-German and WMT’14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.

在WMT’14 English-German测试集上BLEU=25.16 或者在WMT’16 English-Romanian测试集上BLEU=30.02 FNet: Mixing Tokens with Fourier Transforms

Abstract

We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that"mix"input tokens. These linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths, our FNet model is significantly faster: when compared to the"efficient"Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.

QQP验证集Acc=85%, SST-2验证集Acc=95%(见论文Table1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Abstract

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).

megatron-bert-cased-345m模型在MNLI验证集上Acc=89.7/90.0; 2. megatron-bert-cased-345m模型在SQuAD-1.1验证集上F1/EM=94.2/88.0, 在SQuAD-2.0验证集上F1/EM=88.1/84.8 LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Abstract

Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke .

Open Entity: F1=78.2, SQuAD1.1: F1=95.4, EM=90.2(见论文Table1 & Table5) RemBERT: Rethinking embedding coupling in pre-trained language models

Abstract

We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters during fine-tuning. We also show that allocating additional capacity to the output embedding provides benefits to the model that persist through the fine-tuning stage even though the output embedding is discarded after pre-training. Our analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages. Harnessing these findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.

XTREME: Sentence-pair Classification: Acc=84.2(见论文Table7) Nyströmformer: A Nystöm-based Algorithm for Approximating Self-Attention

Abstract

Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences -- a topic being actively studied in the community. To address this limitation, we propose Nystr\"{o}mformer -- a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nystr\"{o}m method to approximate standard self-attention with O(n) complexity. The scalability of Nystr\"{o}mformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nystr\"{o}mformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nystr\"{o}mformer performs favorably relative to other efficient self-attention methods. Our code is available at https://github.com/mlpen/Nystromformer .

IMDB: F1=93.2, LRA benchmark下text任务 acc=65.52; ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

Abstract

This paper presents a new sequence-tosequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of optimizing one-stepahead prediction in the traditional sequenceto-sequence model, the ProphetNet is optimized by n-step ahead prediction that predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large-scale dataset (160GB), respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.

prophetnet-large-uncased模型在CNN/DailyMail测试集上R-1=44.20, R-2=21.17, R-L=41.30 (见论文Table 4); prophetnet-large-uncased模型在Gigaword测试集上R-1=39.51, R-2=20.42, R-L=36.69 (见论文Table 4); Universal Language Model Fine-tuning for Text Classification

Abstract

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18- 24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100× more data. We opensource our pretrained models and code1 .

AG’s News: Err=5.01% (见论文Table 3) ByT5: Towards a token-free future with pre-trained byte-to-byte models

Abstract

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

在GEM-Xsum验证集上, small model BLEU score=9.1; 在TweetQA验证集上, small model BLEU-1/ROUGE-L=65.7/69.7 (见论文table3) 论文名称(链接) Comprehensive Semi-Supervised Multi-Modal Learning

Abstract

Multi-modal learning refers to the process of learning a precise model to represent the joint representations of different modalities. Despite its promise for multi-modal learning, the co-regularization method is based on the consistency principle with a sufficient assumption, which usually does not hold for real-world multi-modal data. Indeed, due to the modal insufficiency in real-world applications, there are divergences among heterogeneous modalities. This imposes a critical challenge for multi-modal learning. To this end, in this paper, we propose a novel Comprehensive Multi-Modal Learning (CMML) framework, which can strike a balance between the consistency and divergency modalities by considering the insufficiency in one unified framework. Specifically, we utilize an instance level attention mechanism to weight the sufficiency for each instance on different modalities. Moreover, novel diversity regularization and robust consistency metrics are designed for discovering insufficient modalities. Our empirical studies show the superior performances of CMML on real-world data in terms of various criteria.

Coverage: 2.669 Average Precision: 0.914 Ranking Loss: 0.058 Example AUC: 0.942 Micro AUC: 0.94 Macro AUC: 0.932 Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Abstract

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

RefCOCO+-val=72.34 Attention on Attention for Image Captioning

Abstract

Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an “Attention on Attention” (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an “information vector” and an “attention gate” using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the “attended information”, the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-ofthe-art performance of 129.8 CIDEr-D score on MS COCO “Karpathy” offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet .

COCO; {‘Bleu_1’: 0.8054903453672397, ‘Bleu_2’: 0.6523038976984842, ‘Bleu_3’: 0.5096621263772566, ‘Bleu_4’: 0.39140307771618477, ‘METEOR’: 0.29011216375635934, ‘ROUGE_L’: 0.5890369750273199, ‘CIDEr’: 1.2892294296245852, ‘SPICE’: 0.22680092759866174} VinVL: Revisiting Visual Representations in Vision-Language Models

Abstract

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [21], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL .

COCO 2014; Oscar-Large: COCO-Text Retireval: Recall@1=89.8 Recall@5=98.8 Recall@10=99.7 COCO-Image Retireval: Recall@1=78.2 Recall@5=95.8 Recall@10=98.3 From Recognition to Cognition: Visual Commonsense Reasoning

Abstract

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

VQA val, Q->A 63.8%, QA->R: 67.2%, Q-AR: 43.1% Uniter: Learning universal image-text representations

Abstract

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2. Code is available at https://github.com/ChenRocks/UNITER .

IR-flickr30K-R1=73.66 Efficient Low-rank Multimodal Fusion with Modality-Specific Factors

Abstract

Multimodal research is an emerging field of artificial intelligence, and one of the main research problems in this field is multimodal fusion. The fusion of multimodal data is the process of integrating multiple unimodal representations into one compact multimodal representation. Previous research in this field has exploited the expressiveness of tensors for multimodal representation. However, these methods often suffer from exponential increase in dimensions and in computational complexity introduced by transformation of input into tensor. In this paper, we propose the Low-rank Multimodal Fusion method, which performs multimodal fusion using low-rank tensors to improve efficiency. We evaluate our model on three different tasks: multimodal sentiment analysis, speaker trait analysis, and emotion recognition. Our model achieves competitive results on all these tasks while drastically reducing computational complexity. Additional experiments also show that our model can perform robustly for a wide range of low-rank settings, and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.

F1-Happy, 85.8%, F1-Sad 85.9%, F1-Angry 89.0%, F1-Neutral 71.7% 论文名称(链接) Solving inverse problems using conditional invertible neural networks

Abstract

Inverse modeling for computing a high-dimensional spatially-varying property field from indirect sparse and noisy observations is a challenging problem. This is due to the complex physical system of interest often expressed in the form of multiscale PDEs, the high-dimensionality of the spatial property of interest, and the incomplete and noisy nature of observations. To address these challenges, we develop a model that maps the given observations to the unknown input field in the form of a surrogate model. This inverse surrogate model will then allow us to estimate the unknown input field for any given sparse and noisy output observations. Here, the inverse mapping is limited to a broad prior distribution of the input field with which the surrogate model is trained. In this work, we construct a two- and three-dimensional inverse surrogate models consisting of an invertible and a conditional neural network trained in an end-to-end fashion with limited training data. The invertible network is developed using a flow-based generative model. The developed inverse surrogate model is then applied for an inversion task of a multiphase flow problem where given the pressure and saturation observations the aim is to recover a high-dimensional non-Gaussian permeability field where the two facies consist of heterogeneous permeability and varying length-scales. For both the two- and three-dimensional surrogate models, the predicted sample realizations of the non-Gaussian permeability field are diverse with the predictive mean being close to the ground truth even when the model is trained with limited data.

2D/3D模型下得到与Fig16和Fig17相吻合的结果 Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems

Abstract

Deep learning has been shown to be an effective tool in solving partial differential equations (PDEs) through physics-informed neural networks (PINNs). PINNs embed the PDE residual into the loss function of the neural network, and have been successfully employed to solve diverse forward and inverse PDE problems. However, one disadvantage of the first generation of PINNs is that they usually have limited accuracy even with many training points. Here, we propose a new method, gradient-enhanced physics-informed neural networks (gPINNs), for improving the accuracy and training efficiency of PINNs. gPINNs leverage gradient information of the PDE residual and embed the gradient into the loss function. We tested gPINNs extensively and demonstrated the effectiveness of gPINNs in both forward and inverse PDE problems. Our numerical results show that gPINN performs better than PINN with fewer training points. Furthermore, we combined gPINN with the method of residual-based adaptive refinement (RAR), a method for improving the distribution of training points adaptively during training, to further improve the performance of gPINN, especially in PDEs with solutions that have steep gradients.

完成论文3.2 得到和Fig2 Fig3 Fig3相吻合的结果，论文3.3 得到fig 6 fig7相吻合的结果，论文3.4.1 gPINN fig10 11 DeepCFD: Efficient Steady-State Laminar Flow Approximation with Deep Convolutional Neural Networks

Abstract

Computational Fluid Dynamics (CFD) simulation by the numerical solution of the Navier-Stokes equations is an essential tool in a wide range of applications from engineering design to climate modeling. However, the computational cost and memory demand required by CFD codes may become very high for flows of practical interest, such as in aerodynamic shape optimization. This expense is associated with the complexity of the fluid flow governing equations, which include non-linear partial derivative terms that are of difficult solution, leading to long computational times and limiting the number of hypotheses that can be tested during the process of iterative design. Therefore, we propose DeepCFD: a convolutional neural network (CNN) based model that efficiently approximates solutions for the problem of non-uniform steady laminar flows. The proposed model is able to learn complete solutions of the Navier-Stokes equations, for both velocity and pressure fields, directly from ground-truth data generated using a state-of-the-art CFD code. Using DeepCFD, we found a speedup of up to 3 orders of magnitude compared to the standard CFD approach at a cost of low error rates.

DeepCFD MSE (Ux= 0.773±0.0897，Uy=0.2153±0.0186，P=1.042±0.0431，Total 2.03±0.136） Unsupervised deep learning for super-resolution reconstruction of turbulence

Abstract

Recent attempts to use deep learning for super-resolution reconstruction of turbulent flows have used supervised learning, which requires paired data for training. This limitation hinders more practical applications of super-resolution reconstruction. Therefore, we present an unsupervised learning model that adopts a cycle-consistent generative adversarial network (CycleGAN) that can be trained with unpaired turbulence data for super-resolution reconstruction. Our model is validated using three examples: (i) recovering the original flow field from filtered data using direct numerical simulation (DNS) of homogeneous isotropic turbulence; (ii) reconstructing full-resolution fields using partially measured data from the DNS of turbulent channel flows; and (iii) generating a DNS-resolution flow field from large-eddy simulation (LES) data for turbulent channel flows. In examples (i) and (ii), for which paired data are available for supervised learning, our unsupervised model demonstrates qualitatively and quantitatively similar performance as that of the best supervised learning model. More importantly, in example (iii), where supervised learning is impossible, our model successfully reconstructs the high-resolution flow field of statistical DNS quality from the LES data. Furthermore, we find that the present model has almost universal applicability to all values of Reynolds numbers within the tested range. This demonstrates that unsupervised learning of turbulence data is indeed possible, opening a new door for the wide application of super-resolution reconstruction of turbulent fields.

可复现三个example中的任意一个，若对于example3可得到论文中cycleGAN对应的结果（图11,12,13,14,15） Lettuce: PyTorch-based Lattice Boltzmann Framework

Abstract

The lattice Boltzmann method (LBM) is an efficient simulation technique for computational fluid mechanics and beyond. It is based on a simple stream-and-collide algorithm on Cartesian grids, which is easily compatible with modern machine learning architectures. While it is becoming increasingly clear that deep learning can provide a decisive stimulus for classical simulation techniques, recent studies have not addressed possible connections between machine learning and LBM. Here, we introduce Lettuce, a PyTorch-based LBM code with a threefold aim. Lettuce enables GPU accelerated calculations with minimal source code, facilitates rapid prototyping of LBM models, and enables integrating LBM simulations with PyTorch's deep learning and automatic differentiation facility. As a proof of concept for combining machine learning with the LBM, a neural collision model is developed, trained on a doubly periodic shear layer and then transferred to a different flow, a decaying turbulence. We also exemplify the added benefit of PyTorch's automatic differentiation framework in flow control and optimization. To this end, the spectrum of a forced isotropic turbulence is maintained without further constraining the velocity field. The source code is freely available from https://github.com/lettucecfd/lettuce .

得到图1配置的展示结果与图2的曲线吻合 Systems Biology: Identifiability analysis and parameter identification via systems-biology informed neural networks

Abstract

The dynamics of systems biological processes are usually modeled by a system of ordinary differential equations (ODEs) with many unknown parameters that need to be inferred from noisy and sparse measurements. Here, we introduce systems-biology informed neural networks for parameter estimation by incorporating the system of ODEs into the neural networks. To complete the workflow of system identification, we also describe structural and practical identifiability analysis to analyze the identifiability of parameters. We use the ultridian endocrine model for glucose-insulin interaction as the example to demonstrate all these methods and their implementation.

结果吻合Fig13 TorchMD: A deep learning framework for molecular simulations

Abstract

Molecular dynamics simulations provide a mechanistic description of molecules by relying on empirical potentials. The quality and transferability of such potentials can be improved leveraging data-driven models derived with machine learning approaches. Here, we present TorchMD, a framework for molecular simulations with mixed classical and machine learning potentials. All of force computations including bond, angle, dihedral, Lennard-Jones and Coulomb interactions are expressed as PyTorch arrays and operations. Moreover, TorchMD enables learning and simulating neural network potentials. We validate it using standard Amber all-atom simulations, learning an ab-initio potential, performing an end-to-end training and finally learning and simulating a coarse-grained model for protein folding. We believe that TorchMD provides a useful tool-set to support molecular simulations of machine learning potentials. Code and data are freely available at \url{github.com/torchmd}.

Di-alanine 688 8 min 44 s， Trypsin 3,248 13 min 2 s。 PHYSICS-INFORMED NEURAL NETWORKS WITH HARD CONSTRAINTS FOR INVERSE DESIGN∗

Abstract

We achieve the same objective as conventional PDE-constrained optimization methods based on adjoint methods and numerical PDE solvers, but find that the design obtained from hPINN is often simpler and smoother for problems whose solution is not unique.

得到图6的结果，（the PDE loss is below 10−4 (Fig. 6A), and the L2 relative error of |E|2 between hPINN and FDFD for the final ε is 1.2%）可以展示图7 论文名称(链接) Field-Embedded Factorization Machines for Click-through rate prediction

Abstract

Click-through rate (CTR) prediction models are common in many online applications such as digital advertising and recommender systems. Field-Aware Factorization Machine (FFM) and Field-weighted Factorization Machine (FwFM) are state-of-the-art among the shallow models for CTR prediction. Recently, many deep learning-based models have also been proposed. Among deeper models, DeepFM, xDeepFM, AutoInt+, and FiBiNet are state-of-the-art models. The deeper models combine a core architectural component, which learns explicit feature interactions, with a deep neural network (DNN) component. We propose a novel shallow Field-Embedded Factorization Machine (FEFM) and its deep counterpart Deep Field-Embedded Factorization Machine (DeepFEFM). FEFM learns symmetric matrix embeddings for each field pair along with the usual single vector embeddings for each feature. FEFM has significantly lower model complexity than FFM and roughly the same complexity as FwFM. FEFM also has insightful mathematical properties about important fields and field interactions. DeepFEFM combines the FEFM interaction vectors learned by the FEFM component with a DNN and is thus able to learn higher order interactions. We conducted comprehensive experiments over a wide range of hyperparameters on two large publicly available real-world datasets. When comparing test AUC and log loss, the results show that FEFM and DeepFEFM outperform the existing state-of-the-art shallow and deep models for CTR prediction tasks. We have made the code of FEFM and DeepFEFM available in the DeepCTR library ( https://github.com/shenweichen/DeepCTR ).

criteo auc >0.8 Deep Learning Recommendation Model for Personalization and Recommendation Systems

Abstract

With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model (DLRM) and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design.

criteo auc >0.79 A Dual Input-aware Factorization Machine for CTR Prediction

Abstract

Factorization Machines (FMs) refer to a class of general predictors working with real valued feature vectors, which are well-known for their ability to estimate model parameters under significant sparsity and have found successful applications in many areas such as the click-through rate (CTR) prediction. However, standard FMs only produce a single fixed representation for each feature across different input instances, which may limit the CTR model’s expressive and predictive power. Inspired by the success of Input-aware Factorization Machines (IFMs), which aim to learn more flexible and informative representations of a given feature according to different input instances, we propose a novel model named Dual Input-aware Factorization Machines (DIFMs) that can adaptively reweight the original feature representations at the bit-wise and vector-wise levels simultaneously. Furthermore, DIFMs strategically integrate various components including Multi-Head Self-Attention, Residual Networks and DNNs into a unified end-to-end model. Comprehensive experiments on two real-world CTR prediction datasets show that the DIFM model can outperform several state-of-the-art models consistently.

crito auc >0.799 FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine

Abstract

Click through rate (CTR) estimation is a fundamental task in personalized advertising and recommender systems. Recent years have witnessed the success of both the deep learning based model and attention mechanism in various tasks in computer vision (CV) and natural language processing (NLP). How to combine the attention mechanism with deep CTR model is a promising direction because it may ensemble the advantages of both sides. Although some CTR model such as Attentional Factorization Machine (AFM) has been proposed to model the weight of second order interaction features, we posit the evaluation of feature importance before explicit feature interaction procedure is also important for CTR prediction tasks because the model can learn to selectively highlight the informative features and suppress less useful ones if the task has many input features. In this paper, we propose a new neural CTR model named Field Attentive Deep Field-aware Factorization Machine (FAT-DeepFFM) by combining the Deep Field-aware Factorization Machine (DeepFFM) with Compose-Excitation network (CENet) field attention mechanism which is proposed by us as an enhanced version of Squeeze-Excitation network (SENet) to highlight the feature importance. We conduct extensive experiments on two real-world datasets and the experiment results show that FAT-DeepFFM achieves the best performance and obtains different improvements over the state-of-the-art methods. We also compare two kinds of attention mechanisms (attention before explicit feature interaction vs. attention after explicit feature interaction) and demonstrate that the former one outperforms the latter one significantly.

crito AUC>=0.8099 BERT4Rec：Sequential Recommendation with Bidirectional Encoder Representations from Transformer

Abstract

Top-$N$ sequential recommendation models each user as a sequence of itemsinteracted in the past and aims to predict top-$N$ ranked items that a userwill likely interact in a `near future'. The order of interaction implies thatsequential patterns play an important role where more recent items in asequence have a larger impact on the next item. In this paper, we propose aConvolutional Sequence Embedding Recommendation Model (\emph{Caser}) as asolution to address this requirement. The idea is to embed a sequence of recentitems into an `image' in the time and latent spaces and learn sequentialpatterns as local features of the image using convolutional filters. Thisapproach provides a unified and flexible network structure for capturing bothgeneral preferences and sequential patterns. The experiments on public datasetsdemonstrated that Caser consistently outperforms state-of-the-art sequentialrecommendation methods on a variety of common evaluation metrics.

1、Beauty HR@10=0.30252、Steam HR@10=0.40133、ML-1m HR@10=0.69704、ML-20m HR@10=0.7473 Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding

Abstract

Top-N sequential recommendation models each user as a sequence of items interacted in the past and aims to predict top-N ranked items that a user will likely interact in a `near future'. The order of interaction implies that sequential patterns play an important role where more recent items in a sequence have a larger impact on the next item. In this paper, we propose a Convolutional Sequence Embedding Recommendation Model (\emph{Caser}) as a solution to address this requirement. The idea is to embed a sequence of recent items into an `image' in the time and latent spaces and learn sequential patterns as local features of the image using convolutional filters. This approach provides a unified and flexible network structure for capturing both general preferences and sequential patterns. The experiments on public datasets demonstrated that Caser consistently outperforms state-of-the-art sequential recommendation methods on a variety of common evaluation metrics.

1、MovieLens MAP=0.15072、Gowalla MAP=0.09283、Foursquare MAP=0.09094、Tmall MAP=0.0310 SASRec：Self-Attentive Sequential Recommendation

Abstract

Sequential dynamics are a key feature of many modern recommender systems, which seek to capture the `context' of users' activities on the basis of actions they have performed recently. To capture such patterns, two approaches have proliferated: Markov Chains (MCs) and Recurrent Neural Networks (RNNs). Markov Chains assume that a user's next action can be predicted on the basis of just their last (or last few) actions, while RNNs in principle allow for longer-term semantics to be uncovered. Generally speaking, MC-based methods perform best in extremely sparse datasets, where model parsimony is critical, while RNNs perform better in denser datasets where higher model complexity is affordable. The goal of our work is to balance these two goals, by proposing a self-attention based sequential model (SASRec) that allows us to capture long-term semantics (like an RNN), but, using an attention mechanism, makes its predictions based on relatively few actions (like an MC). At each time step, SASRec seeks to identify which items are `relevant' from a user's action history, and use them to predict the next item. Extensive empirical studies show that our method outperforms various state-of-the-art sequential models (including MC/CNN/RNN-based approaches) on both sparse and dense datasets. Moreover, the model is an order of magnitude more efficient than comparable CNN/RNN-based models. Visualizations on attention weights also show how our model adaptively handles datasets with various density, and uncovers meaningful patterns in activity sequences.

Hit Rate@10(Recall@10; Precision@10) andNDCG@10, Neural Collaborative Reasoning

Abstract

Existing Collaborative Filtering (CF) methods are mostly designed based on the idea of matching, i.e., by learning user and item embeddings from data using shallow or deep models, they try to capture the associative relevance patterns in data, so that a user embedding can be matched with relevant item embeddings using designed or learned similarity functions. However, as a cognition rather than a perception intelligent task, recommendation requires not only the ability of pattern recognition and matching from data, but also the ability of cognitive reasoning in data. In this paper, we propose to advance Collaborative Filtering (CF) to Collaborative Reasoning (CR), which means that each user knows part of the reasoning space, and they collaborate for reasoning in the space to estimate preferences for each other. Technically, we propose a Neural Collaborative Reasoning (NCR) framework to bridge learning and reasoning. Specifically, we integrate the power of representation learning and logical reasoning, where representations capture similarity patterns in data from perceptual perspectives, and logic facilitates cognitive reasoning for informed decision making. An important challenge, however, is to bridge differentiable neural networks and symbolic reasoning in a shared architecture for optimization and inference. To solve the problem, we propose a modularized reasoning architecture, which learns logical operations such as AND (∧), OR (∨) and NOT (¬) as neural modules for implication reasoning (→). In this way, logical expressions can be equivalently organized as neural networks, so that logical reasoning and prediction can be conducted in a continuous space. Experiments on real-world datasets verified the advantages of our framework compared with both shallow, deep and reasoning models.

ML100K : HR@K >0.68 TiSASRec: Time Interval Aware Self-Attention for Sequential Recommendation

Abstract

Sequential recommender systems seek to exploit the order of users’ interactions, in order to predict their next action based on the context of what they have done recently. Traditionally, Markov Chains (MCs), and more recently Recurrent Neural Networks (RNNs) and Self Attention (SA) have proliferated due to their ability to capture the dynamics of sequential patterns. However a simplifying assumption made by most of these models is to regard interaction histories as ordered sequences, without regard for the time intervals between each interaction (i.e., they model the time-order but not the actual timestamp). In this paper, we seek to explicitly model the timestamps of interactions within a sequential modeling framework to explore the influence of different time intervals on next item prediction. We propose TiSASRec (Time Interval aware Self-attention based sequential recommendation), which models both the absolute positions of items as well as the time intervals between them in a sequence. Extensive empirical studies show the features of TiSASRec under different settings and compare the performance of self-attention with different positional encodings. Furthermore, experimental results show that our method outperforms various state-of-the-art sequential models on both sparse and dense datasets and different evaluation metrics.

ml-1m: NDCG@10: 0.5706, Hit@10: 0.8038 Efficient Non-Sampling Factorization Machines for Optimal Context-Aware Recommendation

Abstract

To provide more accurate recommendation, it is a trending topic to go beyond modeling user-item interactions and take context features into account. Factorization Machines (FM) with negative sampling is a popular solution for context-aware recommendation. However, it is not robust as sampling may lost important information and usually leads to non-optimal performances in practical. Several recent e_x001d_orts have enhanced FM with deep learning architectures for modelling high-order feature interactions. While they either focus on rating prediction task only, or typically adopt the negative sampling strategy for optimizing the ranking performance. Due to the dramatic _x001e_uctuation of sampling, it is reasonable to argue that these sampling-based FM methods are still suboptimal for context-aware recommendation. In this paper, we propose to learn FM without sampling for ranking tasks that helps context-aware recommendation particularly. Despite e_x001d_ectiveness, such a non-sampling strategy presents strong challenge in learning e_x001c_ciency of the model. Accordingly, we further design a new ideal framework named E_x001c_cient Non-Sampling Factorization Machines (ENSFM). ENSFM not only seamlessly connects the relationship between FM and Matrix Factorization (MF), but also resolves the challenging e_x001c_ciency issue via novel memorization strategies. Through extensive experiments on three realworld public datasets, we show that 1) the proposed ENSFM consistently and signi_x001b_cantly outperforms the state-of-the-art methods on context-aware Top-K recommendation, and 2) ENSFM achieves signi_x001b_cant advantages in training e_x001c_ciency, which makes it more applicable to real-world large-scale systems. Moreover, the empirical results indicate that a proper learning method is even more important than advanced neural network structures for Top-K recommendation task. Our implementation has been released 1 to facilitate further developments on e_x001c_cient non-sampling methods.

Movielens: HR@5: 0.0601, HR@10: 0.1024, HR@20: 0.1690 (论文table3) AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction

Abstract

Learning feature interactions is crucial for click-through rate (CTR) prediction in recommender systems. In most existing deep learning models, feature interactions are either manually designed or simply enumerated. However, enumerating all feature interactions brings large memory and computation cost. Even worse, useless interactions may introduce noise and complicate the training process. In this work, we propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS). AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence. In the search stage, instead of searching over a discrete set of candidate feature interactions, we relax the choices to be continuous by introducing the architecture parameters. By implementing a regularized optimizer over the architecture parameters, the model can automatically identify and remove the redundant feature interactions during the training process of the model. In the re-train stage, we keep the architecture parameters serving as an attention unit to further boost the performance. Offline experiments on three large-scale datasets (two public benchmarks, one private) demonstrate that AutoFIS can significantly improve various FM based models. AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service, where a 10-day online A/B test demonstrated that AutoFIS improved the DeepFM model by 20.3% and 20.1% in terms of CTR and CVR respectively.

Criteo; (DeepFM)auc: 0.8009, logloss: 0.5404 (table1) Training Deep AutoEncoders for Collaborative Filtering

Abstract

�is paper proposes a model for the rating prediction task in recommender systems which signi�cantly outperforms previous stateof-the art models on a time-split Net�ix data set. Our model is based on deep autoencoder with 6 layers and is trained end-to-end without any layer-wise pre-training. We empirically demonstrate that: a) deep autoencoder models generalize much be�er than the shallow ones, b) non-linear activation functions with negative parts are crucial for training deep models, and c) heavy use of regularization techniques such as dropout is necessary to prevent over��ing. We also propose a new training algorithm based on iterative output re-feeding to overcome natural sparseness of collaborate �ltering. �e new algorithm signi�cantly speeds up training and improves model performance. Our code is available at h�ps://github.com/NVIDIA/DeepRecommender.

RMSE: Netflix 3 months: 0.9373, Netflix 6 months : 0.9207, Netflix 1 year : 0.9225, Netflix full: 0.9099; DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Abstract

The Mixture-of-experts (MoE) architecture is showing promising results in multitask learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable “sparse gate” to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. Our gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k in the context of MTL, on both synthetic and real datasets with up to 128 tasks. Our experiments indicate that MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance. Notably, on a real-world large-scale recommender system, DSelect-k achieves over 22% average improvement in predictive performance compared to the Top-k gate. We provide an open-source TensorFlow implementation of our gate1 .

MNIST: Accuracy1 92.56% Accuracy2: 90.98 % Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation

Abstract

Social relations are often used to improve recommendation quality when user-item interaction data is sparse in recommender systems. Most existing social recommendation models exploit pairwise relations to mine potential user preferences. However, real-life interactions among users are very complicated and user relations can be high-order. Hypergraph provides a natural way to model complex high-order relations, while its potentials for improving social recommendation are under-explored. In this paper, we fill this gap and propose a multi-channel hypergraph convolutional network to enhance social recommendation by leveraging high-order user relations. Technically, each channel in the network encodes a hypergraph that depicts a common high-order user relation pattern via hypergraph convolution. By aggregating the embeddings learned through multiple channels, we obtain comprehensive user representations to generate recommendation results. However, the aggregation operation might also obscure the inherent characteristics of different types of high-order connectivity information. To compensate for the aggregating loss, we innovatively integrate self-supervised learning into the training of the hypergraph convolutional network to regain the connectivity information with hierarchical mutual information maximization. The experimental results on multiple real-world datasets show that the proposed model outperforms the SOTA methods, and the ablation study verifies the effectiveness of the multi-channel setting and the selfsupervised task. The implementation of our model is available via https://github.com/Coder-Yu/RecQ .

LastFM: Precision@10: 20.052, Recall@10: 20.375, NDCG@10: 24.395 DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems

Abstract

Learning effective feature crosses is the key behind building recommender systems. However, the sparse and large feature space requires exhaustive search to identify effective crosses. Deep & Cross Network (DCN) was proposed to automatically and efficiently learn bounded-degree predictive feature interactions. Unfortunately, in models that serve web-scale traffic with billions of training examples, DCN showed limited expressiveness in its cross network at learning more predictive feature interactions. Despite significant research progress made, many deep learning models in production still rely on traditional feed-forward neural networks to learn feature crosses inefficiently. In light of the pros/cons of DCN and existing feature interaction learning approaches, we propose an improved framework DCN-V2 to make DCN more practical in large-scale industrial settings. In a comprehensive experimental study with extensive hyper-parameter search and model tuning, we observed that DCN-V2 approaches outperform all the state-of-the-art algorithms on popular benchmark datasets. The improved DCN-V2 is more expressive yet remains cost efficient at feature interaction learning, especially when coupled with a mixture of low-rank architecture. DCN-V2 is simple, can be easily adopted as building blocks, and has delivered significant offline accuracy and online business metrics gains across many web-scale learning to rank systems at Google.

Logloss: 0.4406, AUC: 0.8115 FLEN: Leveraging Field for Scalable CTR Prediction

Abstract

Click-Through Rate (CTR) prediction systems are usually based on multi-field categorical features, i.e., every feature is categorical and belongs to one and only one field. Modeling feature conjunctions is crucial for CTR prediction accuracy. However, it usually requires a massive number of parameters to explicitly model all feature conjunctions, which is not scalable for real-world production systems. In this paper, we describe a novel Field-Leveraged Embedding Network (FLEN) which has been deployed in the commercial recommender systems in Meitu and serves the main traffic. FLEN devises a field-wise bi-interaction pooling technique. By suitably exploiting field information, the field-wise bi-interaction pooling layer captures both inter-field and intra-field feature conjunctions with a small number of model parameters and an acceptable time complexity for industrial applications. We show that some classic shallow CTR models can be regarded as special cases of this technique, i.e., MF, FM and FwFM. We identify a unique challenge in this technique, i.e., the FM module in our model may suffer from the coupled gradient issue, which will damage the performance of the model. To solve this challenge, we develop Dicefactor: a novel dropout method to prevent independent latent features from co-adapting. Extensive experiments, including offline evaluations and online A/B testing on real production systems, demonstrate the effectiveness and efficiency of FLEN against the state-of-the-art models. In particular, compared to the previous version deployed on the system (i.e. NFM), FLEN has obtained 5.19% improvement on CTR with 1/6 of memory usage and computation time.

AUC: 0.7519, Logloss: 0.3944; Deep Position-wise Interaction Network For CTR Prediction

Abstract

Click-through rate (CTR) prediction plays an important role in online advertising and recommender systems. In practice, the training of CTR models depends on click data which is intrinsically biased towards higher positions since higher position has higher CTR by nature. Existing methods such as actual position training with fixed position inference and inverse propensity weighted training with no position inference alleviate the bias problem to some extend. However, the different treatment of position information between training and inference will inevitably lead to inconsistency and sub-optimal online performance. Meanwhile, the basic assumption of these methods, i.e., the click probability is the product of examination probability and relevance probability, is oversimplified and insufficient to model the rich interaction between position and other information. In this paper, we propose a Deep Position-wise Interaction Network (DPIN) to efficiently combine all candidate items and positions for estimating CTR at each position, achieving consistency between offline and online as well as modeling the deep non-linear interaction among position, user, context and item under the limit of serving performance. Following our new treatment to the position bias in CTR prediction, we propose a new evaluation metrics named PAUC (position-wise AUC) that is suitable for measuring the ranking quality at a given position. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving position bias problem. We have also deployed our method in production and observed statistically significant improvement over a highly optimized baseline in a rigorous A/B test.

按照论文数据，预计以DIN模型作为对比，AUC获得性能提升 AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks

Abstract

Click-through rate (CTR) prediction, which aims to predict the probability of a user clicking on an ad or an item, is critical to many online applications such as online advertising and recommender systems. The problem is very challenging since (1) the input features (e.g., the user id, user age, item id, item category) are usually sparse and high-dimensional, and (2) an effective prediction relies on high-order combinatorial features (\textit{a.k.a.} cross features), which are very time-consuming to hand-craft by domain experts and are impossible to be enumerated. Therefore, there have been efforts in finding low-dimensional representations of the sparse and high-dimensional raw features and their meaningful combinations. In this paper, we propose an effective and efficient method called the \emph{AutoInt} to automatically learn the high-order feature interactions of input features. Our proposed algorithm is very general, which can be applied to both numerical and categorical input features. Specifically, we map both the numerical and categorical features into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled. The whole model can be efficiently fit on large-scale raw data in an end-to-end fashion. Experimental results on four real-world datasets show that our proposed approach not only outperforms existing state-of-the-art approaches for prediction but also offers good explainability. Code is available at: \url{ https://github.com/DeepGraphLearning/RecommenderSystems }.

验收标准：1.按照论文数据Criteo，AUC 80.61% 2.复现后合入PaddleRec套件，并添加TIPC Personalized News Recommendation with Knowledge-aware Interactive Matching

Abstract

The most important task in personalized news recommendation is accurate matching between candidate news and user interest. Most of existing news recommendation methods model candidate news from its textual content and user interest from their clicked news in an independent way. However, a news article may cover multiple aspects and entities, and a user usually has different kinds of interest. Independent modeling of candidate news and user interest may lead to inferior matching between news and users. In this paper, we propose a knowledge-aware interactive matching method for news recommendation. Our method interactively models candidate news and user interest to facilitate their accurate matching. We design a knowledge-aware news co-encoder to interactively learn representations for both clicked news and candidate news by capturing their relatedness in both semantic and entities with the help of knowledge graphs. We also design a user-news co-encoder to learn candidate news-aware user interest representation and user-aware candidate news representation for better interest matching. Experiments on two real-world datasets validate that our method can effectively improve the performance of news recommendation.

AUC 67.13 ,MRR 32.08,NDCG@5 35.49, NDCG@10 41.79 Package Recommendation with Intra- and Inter-Package Attention Networks

Abstract

With the booming of online social networks in the mobile internet, an emerging recommendation scenario has played a vital role in information acquisition for user, where users are no longer recommended with a single item or item list, but a combination of heterogeneous and diverse objects (called a package, e.g., a package including news, publisher, and friends viewing the news). Different from the conventional recommendation where users are recommended with the item itself, in package recommendation, users would show great interests on the explicitly displayed objects that could have a significant influence on the user behaviors. However, to the best of our knowledge, few effort has been made for package recommendation and existing approaches can hardly model the complex interactions of diverse objects in a package. Thus, in this paper, we make a first study on package recommendation and propose an Intra- and inter-package attention network for Package Recommendation (IPRec). Specifically, for package modeling, an intra-package attention network is put forward to capture the object-level intention of user interacting with the package, while an inter-package attention network acts as a package-level information encoder that captures collaborative features of neighboring packages. In addition, to capture users preference representation, we present a user preference learner equipped with a fine-grained feature aggregation network and coarse-grained package aggregation network. Extensive experiments on three real-world datasets demonstrate that IPRec significantly outperforms the state of the arts. Moreover, the model analysis demonstrates the interpretability of our IPRec and the characteristics of user behaviors. Codes and datasets can be obtained at https://github.com/LeeChenChen/IPRec .

论文搜集的真实数据集，3-dayAUC：0.66915-dayAUC：0.6754， 10-dayAUC：0.6853 Modeling the Sequential Dependence among Audience Multi-step Conversions with Multi-task Learning in Targeted Display Advertising

Abstract

In most real-world large-scale online applications (e.g., e-commerce or finance), customer acquisition is usually a multi-step conversion process of audiences. For example, an impression->click->purchase process is usually performed of audiences for e-commerce platforms. However, it is more difficult to acquire customers in financial advertising (e.g., credit card advertising) than in traditional advertising. On the one hand, the audience multi-step conversion path is longer. On the other hand, the positive feedback is sparser (class imbalance) step by step, and it is difficult to obtain the final positive feedback due to the delayed feedback of activation. Multi-task learning is a typical solution in this direction. While considerable multi-task efforts have been made in this direction, a long-standing challenge is how to explicitly model the long-path sequential dependence among audience multi-step conversions for improving the end-to-end conversion. In this paper, we propose an Adaptive Information Transfer Multi-task (AITM) framework, which models the sequential dependence among audience multi-step conversions via the Adaptive Information Transfer (AIT) module. The AIT module can adaptively learn what and how much information to transfer for different conversion stages. Besides, by combining the Behavioral Expectation Calibrator in the loss function, the AITM framework can yield more accurate end-to-end conversion identification. The proposed framework is deployed in Meituan app, which utilizes it to real-timely show a banner to the audience with a high end-to-end conversion rate for Meituan Co-Branded Credit Cards. Offline experimental results on both industrial and public real-world datasets clearly demonstrate that the proposed framework achieves significantly better performance compared with state-of-the-art baselines.

AUC：0.6043 purchase AUC：0.6525 Detecting Beneficial Feature Interactions for Recommender Systems

Abstract

Feature interactions are essential for achieving high accuracy in recommender systems. Many studies take into account the interaction between every pair of features. However, this is suboptimal because some feature interactions may not be that relevant to the recommendation result, and taking them into account may introduce noise and decrease recommendation accuracy. To make the best out of feature interactions, we propose a graph neural network approach to effectively model them, together with a novel technique to automatically detect those feature interactions that are beneficial in terms of recommendation accuracy. The automatic feature interaction detection is achieved via edge prediction with an L0 activation regularization. Our proposed model is proved to be effective through the information bottleneck principle and statistical interaction theory. Experimental results show that our model (i) outperforms existing baselines in terms of accuracy, and (ii) automatically identifies beneficial feature interactions.

Movielens；AUC：0.9407，ACC：0.8970 Deep Session Interest Network for Click-Through Rate Prediction

Abstract

Easy-to-use,Modular and Extendible package of deep-learning based CTR models.DeepFM,DeepInterestNetwork(DIN),DeepInterestEvolutionNetwork(DIEN),DeepCrossNetwork(DCN),AttentionalFactorizationMachine(AFM),Neural Factorization Machine(NFM),AutoInt,Deep Session Interest Network(DSIN)

advertising-challenge-datase logloss; AUC >0.63 Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising

Abstract

In recommender systems and advertising platforms, marketers always want to deliver products, contents, or advertisements to potential audiences over media channels such as display, video, or social. Given a set of audiences or customers (seed users), the audience expansion technique (look-alike modeling) is a promising solution to identify more potential audiences, who are similar to the seed users and likely to finish the business goal of the target campaign. However, look-alike modeling faces two challenges: (1) In practice, a company could run hundreds of marketing campaigns to promote various contents within completely different categories every day, e.g., sports, politics, society. Thus, it is difficult to utilize a common method to expand audiences for all campaigns. (2) The seed set of a certain campaign could only cover limited users. Therefore, a customized approach based on such a seed set is likely to be overfitting. In this paper, to address these challenges, we propose a novel two-stage framework named Meta Hybrid Experts and Critics (MetaHeac) which has been deployed in WeChat Look-alike System. In the offline stage, a general model which can capture the relationships among various tasks is trained from a meta-learning perspective on all existing campaign tasks. In the online stage, for a new campaign, a customized model is learned with the given seed set based on the general model. According to both offline and online experiments, the proposed MetaHeac shows superior effectiveness for both content marketing campaigns in recommender systems and advertising campaigns in advertising platforms. Besides, MetaHeac has been successfully deployed in WeChat for the promotion of both contents and advertisements, leading to great improvement in the quality of marketing. The code has been available at \url{ https://github.com/easezyc/MetaHeac }.

AUC>=0.7239 Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction

Abstract

Easy-to-use, Modular and Extendible package of deep-learning based CTR models. DeepFM, DeepInterestNetwork(DIN), DeepInterestEvolutionNetwork(DIEN), DeepCrossNetwork(DCN), AttentionalFactorizationMachine(AFM), Neural Factorization Machine(NFM), AutoInt, Deep Session Interest Network(DSIN)

AUC:80.22% ,Log Loss:0.5388 论文名称(链接) End to End Learning for Self-Driving Cars

Abstract

Point cloud is an important type of geometric data structure. Due to itsirregular format, most researchers transform such data to regular 3D voxelgrids or collections of images. This, however, renders data unnecessarilyvoluminous and causes issues. In this paper, we design a novel type of neuralnetwork that directly consumes point clouds and well respects the permutationinvariance of points in the input. Our network, named PointNet, provides aunified architecture for applications ranging from object classification, partsegmentation, to scene semantic parsing. Though simple, PointNet is highlyefficient and effective. Empirically, it shows strong performance on par oreven better than state of the art. Theoretically, we provide analysis towardsunderstanding of what the network has learnt and why the network is robust withrespect to input perturbation and corruption.

能在模拟器上运行不偏离路面, 模拟器地址https: //github.com/udacity/self-driving-car-sim Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Abstract

Top-down visual attention mechanisms have been used extensively in imagecaptioning and visual question answering (VQA) to enable deeper imageunderstanding through fine-grained analysis and even multiple steps ofreasoning. In this work, we propose a combined bottom-up and top-down attentionmechanism that enables attention to be calculated at the level of objects andother salient image regions. This is the natural basis for attention to beconsidered. Within our approach, the bottom-up mechanism (based on FasterR-CNN) proposes image regions, each with an associated feature vector, whilethe top-down mechanism determines feature weightings. Applying this approach toimage captioning, our results on the MSCOCO test server establish a newstate-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability ofthe method, applying the same approach to VQA we obtain first place in the 2017VQA Challenge.

full test set 75.6% Towards Deep Learning Models Resistant to Adversarial Attacks

Abstract

Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge .

PGD-steps100-restarts20-sourceA: 89.3%PGD-steps100-restarts20-sourceA: 95.7%PGD-steps40-restarts1-sourceB: 96.4% Stacked Hourglass Networks for Human Pose Estimation

Abstract

This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a"stacked hourglass"network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

MPII Human Pose Dataset, hourglass52, size: 384x384, [email protected]: 0.366 size: 256x256, [email protected]: 0.317 Learning to See in the Dark

Abstract

Imaging in low light is challenging due to low photon count and low SNR. Short-exposure images suffer from noise, while long exposure can induce blur and is often impractical. A variety of denoising, deblurring, and enhancement techniques have been proposed, but their effectiveness is limited in extreme conditions, such as video-rate imaging at night. To support the development of learning-based pipelines for low-light image processing, we introduce a dataset of raw short-exposure low-light images, with corresponding long-exposure reference images. Using the presented dataset, we develop a pipeline for processing low-light images, based on end-to-end training of a fully-convolutional network. The network operates directly on raw sensor data and replaces much of the traditional image processing pipeline, which tends to perform poorly on such data. We report promising results on the new dataset, analyze factors that affect performance, and highlight opportunities for future work. The results are shown in the supplementary video at https://youtu.be/qWKUFK7MWvg

PSNR/SSIM: 28.88/0.787 Adversarial Autoencoders

Abstract

In this paper, we propose the"adversarial autoencoder"(AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We performed experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.

MNIST, Log-likelihood(10K): 340±2 Supervised Contrastive Learning

Abstract

Contrastive learning applied to self-supervised representation learning has seen a resurgence in recent years, leading to state of the art performance in the unsupervised training of deep image models. Modern batch contrastive approaches subsume or significantly outperform traditional contrastive losses such as triplet, max-margin and the N-pairs loss. In this work, we extend the self-supervised batch contrastive approach to the fully-supervised setting, allowing us to effectively leverage label information. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss. On ResNet-200, we achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above the best number reported for this architecture. We show consistent outperformance over cross-entropy on other datasets and two ResNet variants. The loss shows benefits for robustness to natural corruptions and is more stable to hyperparameter settings such as optimizers and data augmentations. Our loss function is simple to implement, and reference TensorFlow code is released at https://t.ly/supcon .

基于Contrastive loss, CIFAR10数据集在ResNet-50上, Top-1 Acc=96% CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

Abstract

We present a simple, fully-convolutional model for real-time (>30 fps)instance segmentation that achieves competitive results on MS COCO evaluated ona single Titan Xp, which is significantly faster than any previousstate-of-the-art approach. Moreover, we obtain this result after training ononly one GPU. We accomplish this by breaking instance segmentation into twoparallel subtasks: (1) generating a set of prototype masks and (2) predictingper-instance mask coefficients. Then we produce instance masks by linearlycombining the prototypes with the mask coefficients. We find that because thisprocess doesn't depend on repooling, this approach produces very high-qualitymasks and exhibits temporal stability for free. Furthermore, we analyze theemergent behavior of our prototypes and show they learn to localize instanceson their own in a translation variant manner, despite beingfully-convolutional. We also propose Fast NMS, a drop-in 12 ms fasterreplacement for standard NMS that only has a marginal performance penalty.Finally, by incorporating deformable convolutions into the backbone network,optimizing the prediction head with better anchor scales and aspect ratios, andadding a novel fast mask re-scoring branch, our YOLACT++ model can achieve 34.1mAP on MS COCO at 33.5 fps, which is fairly close to the state-of-the-artapproaches while still running at real-time.

Resnet50 AUC: Atelectasis=0.707 Cardiomegaly=0.81 Effusion=0.73 Infiltration =0.61 Mass=0.56Nodule=0.71 Pneumonia=0.63 Pneumothorax=0.78 TabNet: Attentive Interpretable Tabular Learning

Abstract

Though tremendous strides have been made in uncontrolled face detection,accurate and efficient face localisation in the wild remains an open challenge.This paper presents a robust single-stage face detector, named RetinaFace,which performs pixel-wise face localisation on various scales of faces bytaking advantages of joint extra-supervised and self-supervised multi-tasklearning. Specifically, We make contributions in the following five aspects:(1) We manually annotate five facial landmarks on the WIDER FACE dataset andobserve significant improvement in hard face detection with the assistance ofthis extra supervision signal. (2) We further add a self-supervised meshdecoder branch for predicting a pixel-wise 3D shape face information inparallel with the existing supervised branches. (3) On the WIDER FACE hard testset, RetinaFace outperforms the state of the art average precision (AP) by 1.1%(achieving AP equal to 91.4%). (4) On the IJB-C test set, RetinaFace enablesstate of the art methods (ArcFace) to improve their results in faceverification (TAR=89.59% for FAR=1e-6). (5) By employing light-weight backbonenetworks, RetinaFace can run real-time on a single CPU core for aVGA-resolution image. Extra annotations and code have been made available at:this https URL.

Forest Cover Type: acc=96.99% Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Abstract

Relational reasoning is a central component of generally intelligentbehavior, but has proven difficult for neural networks to learn. In this paperwe describe how to use Relation Networks (RNs) as a simple plug-and-play moduleto solve problems that fundamentally hinge on relational reasoning. We testedRN-augmented networks on three tasks: visual question answering using achallenging dataset called CLEVR, on which we achieve state-of-the-art,super-human performance; text-based question answering using the bAbI suite oftasks; and complex reasoning about dynamic physical systems. Then, using acurated dataset called Sort-of-CLEVR we show that powerful convolutionalnetworks do not have a general capacity to solve relational questions, but cangain this capacity when augmented with RNs. Our work shows how a deep learningarchitecture equipped with an RN module can implicitly discover and learn toreason about entities and their relations.

可视化方法 A simple neural network module for relational reasoning

Abstract

We introduce the variational graph auto-encoder (VGAE), a framework forunsupervised learning on graph-structured data based on the variationalauto-encoder (VAE). This model makes use of latent variables and is capable oflearning interpretable latent representations for undirected graphs. Wedemonstrate this model using a graph convolutional network (GCN) encoder and asimple inner product decoder. Our model achieves competitive results on a linkprediction task in citation networks. In contrast to most existing models forunsupervised learning on graph-structured data and link prediction, our modelcan naturally incorporate node features, which significantly improvespredictive performance on a number of benchmark datasets.

CLEVR: Acc = 95.5% Variational Graph Auto-Encoders

Abstract

This paper presents a self-supervised framework for training interest pointdetectors and descriptors suitable for a large number of multiple-view geometryproblems in computer vision. As opposed to patch-based neural networks, ourfully-convolutional model operates on full-sized images and jointly computespixel-level interest point locations and associated descriptors in one forwardpass. We introduce Homographic Adaptation, a multi-scale, multi-homographyapproach for boosting interest point detection repeatability and performingcross-domain adaptation (e.g., synthetic-to-real). Our model, when trained onthe MS-COCO generic image dataset using Homographic Adaptation, is able torepeatedly detect a much richer set of interest points than the initialpre-adapted deep model and any other traditional corner detector. The finalsystem gives rise to state-of-the-art homography estimation results on HPatcheswhen compared to LIFT, SIFT and ORB.

CiteSeer AUC: 90.8, AP: 92 SuperPoint: Self-Supervised Interest Point Detection and Description

Abstract

The demand of applying semantic segmentation model on mobile devices has beenincreasing rapidly. Current state-of-the-art networks have enormous amount ofparameters hence unsuitable for mobile devices, while other small memoryfootprint models follow the spirit of classification network and ignore theinherent characteristic of semantic segmentation. To tackle this problem, wepropose a novel Context Guided Network (CGNet), which is a light-weight andefficient network for semantic segmentation. We first propose the ContextGuided (CG) block, which learns the joint feature of both local feature andsurrounding context, and further improves the joint feature with the globalcontext. Based on the CG block, we develop CGNet which captures contextualinformation in all stages of the network and is specially tailored forincreasing segmentation accuracy. CGNet is also elaborately designed to reducethe number of parameters and save memory footprint. Under an equivalent numberof parameters, the proposed CGNet significantly outperforms existingsegmentation networks. Extensive experiments on Cityscapes and CamVid datasetsverify the effectiveness of the proposed approach. Specifically, without anypost-processing and multi-scale testing, the proposed CGNet achieves 64.8% meanIoU on Cityscapes with less than 0.5 M parameters. The source code for thecomplete system can be found at this https URL.

MS-COCO 2014 HPatches Homography Estimation, e=1 0.460 Relaxed Transformer Decoders for Direct Action Proposal Generation

Abstract

Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and efficient framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR). First, to deal with slowness prior in videos, we replace the original Transformer encoder with a boundary attentive module to better capture long-range temporal information. Second, due to the ambiguous temporal boundary and relatively sparse annotations, we present a relaxed matching scheme to relieve the strict criteria of single assignment to each groundtruth. Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness. Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection. Moreover, due to its simplicity in design, our framework is more efficient than previous proposal generation methods, without non-maximum suppression post-processing. The code and models are made available at https://github.com/MCG-NJU/RTD-Action .

THUMOS14, AR@50=41.52 Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

Abstract

Multi-horizon forecasting problems often contain a complex mix of inputs -- including static (i.e. time-invariant) covariates, known future inputs, and other exogenous time series that are only observed historically -- without any prior information on how they interact with the target. While several deep learning models have been proposed for multi-step prediction, they typically comprise black-box models which do not account for the full range of inputs present in common scenarios. In this paper, we introduce the Temporal Fusion Transformer (TFT) -- a novel attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, the TFT utilizes recurrent layers for local processing and interpretable self-attention layers for learning long-term dependencies. The TFT also uses specialized components for the judicious selection of relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of regimes. On a variety of real-world datasets, we demonstrate significant performance improvements over existing benchmarks, and showcase three practical interpretability use-cases of TFT.

Dataset: Electricity P90 loss: 0.027 Learning to Adapt Structured Output Space for Semantic Segmentation

Abstract

Convolutional neural network-based approaches for semantic segmentation rely on supervision with pixel-level ground truth, but may not generalize well to unseen image domains. As the labeling process is tedious and labor intensive, developing algorithms that can adapt source ground truth labels to the target domain is of great interest. In this paper, we propose an adversarial learning method for domain adaptation in the context of semantic segmentation. Considering semantic segmentations as structured outputs that contain spatial similarities between the source and target domains, we adopt adversarial learning in the output space. To further enhance the adapted model, we construct a multi-level adversarial network to effectively perform output space domain adaptation at different feature levels. Extensive experiments and ablation study are conducted under various domain adaptation settings, including synthetic-to-real and cross-city scenarios. We show that the proposed method performs favorably against the state-of-the-art methods in terms of accuracy and visual quality.

Cityscapes: resnet101, mIOU=42.4 Unsupervised Monocular Depth Estimation with Left-Right Consistency

Abstract

Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.

KiTTI: Abs Rel 0.130 Digging Into Self-Supervised Monocular Depth Estimation

Abstract

Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.

KiTTI: Abs Rel 0.106 Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

Abstract

Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the-art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach.

IMDb测试集error rates=4.6%, TREC-6测试集error rates=3.6% , AG’s News测试集 error rates=5.01%(见论文Table 2 & Table 3) PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

Abstract

We introduce Position Adaptive Convolution (PAConv), a generic convolution operation for 3D point cloud processing. The key of PAConv is to construct the convolution kernel by dynamically assembling basic weight matrices stored in Weight Bank, where the coefficients of these weight matrices are self-adaptively learned from point positions through ScoreNet. In this way, the kernel is built in a data-driven manner, endowing PAConv with more flexibility than 2D convolutions to better handle the irregular and unordered point cloud data. Besides, the complexity of the learning process is reduced by combining weight matrices instead of brutally predicting kernels from point positions. Furthermore, different from the existing point convolution operators whose network architectures are often heavily engineered, we integrate our PAConv into classical MLP-based point cloud pipelines without changing network configurations. Even built on simple networks, our method still approaches or even surpasses the state-of-the-art models, and significantly improves baseline performance on both classification and segmentation tasks, yet with decent efficiency. Thorough ablation studies and visualizations are provided to understand PAConv. Code is released on https://github.com/CVMI-Lab/PAConv .

Classification accuracy (%) on ModelNet40：93.9 Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery

Abstract

Geospatial object segmentation, as a particular semantic segmentation task, always faces with larger-scale variation, larger intra-class variance of background, and foreground-background imbalance in the high spatial resolution (HSR) remote sensing imagery. However, general semantic segmentation methods mainly focus on scale variation in the natural scene, with inadequate consideration of the other two problems that usually happen in the large area earth observation scene. In this paper, we argue that the problems lie on the lack of foreground modeling and propose a foreground-aware relation network (FarSeg) from the perspectives of relation-based and optimization-based foreground modeling, to alleviate the above two problems. From perspective of relation, FarSeg enhances the discrimination of foreground features via foreground-correlated contexts associated by learning foreground-scene relation. Meanwhile, from perspective of optimization, a foreground-aware optimization is proposed to focus on foreground examples and hard examples of background during training for a balanced optimization. The experimental results obtained using a large scale dataset suggest that the proposed method is superior to the state-of-the-art general semantic segmentation methods and achieves a better trade-off between speed and accuracy. Code has been made available at: \url{this https URL}.

数据集 iSAID 验收指标：1. FarSeg ResNet-50 mIOU= 63.71% 参考论文Table.62. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS PYSKL: Towards Good Practices for Skeleton Action Recognition

Abstract

We present PYSKL: an open-source toolbox for skeleton-based action recognition based on PyTorch. The toolbox supports a wide variety of skeleton action recognition algorithms, including approaches based on GCN and CNN. In contrast to existing open-source skeleton action recognition projects that include only one or two algorithms, PYSKL implements six different algorithms under a unified framework with both the latest and original good practices to ease the comparison of efficacy and efficiency. We also provide an original GCN-based skeleton action recognition model named ST-GCN++, which achieves competitive recognition performance without any complicated attention schemes, serving as a strong baseline. Meanwhile, PYSKL supports the training and testing of nine skeleton-based action recognition benchmarks and achieves state-of-the-art recognition performance on eight of them. To facilitate future research on skeleton action recognition, we also provide a large number of trained models and detailed benchmark results to give some insights. PYSKL is released at https://github.com/kennymckormick/pyskl and is actively maintained. We will update this report when we add new features or benchmarks. The current version corresponds to PYSKL v0.2.

1.joint top1=97.4 2.复现后合入PaddleVideo套件，并添加TIPC STANet for remote sensing image change detection

Abstract

Remote sensing image change detection (CD) is done to identify desired significant changes between bitemporal images. Given two co-registered images taken at different times, the illumination variations and misregistration errors overwhelm the real object changes. Exploring the relationships among different spatial–temporal pixels may improve the performances of CD methods. In our work, we propose a novel Siamese-based spatial–temporal attention neural network. In contrast to previous methods that separately encode the bitemporal images without referring to any useful spatial–temporal dependency, we design a CD self-attention mechanism to model the spatial–temporal relationships. We integrate a new CD self-attention module in the procedure of feature extraction. Our self-attention module calculates the attention weights between any two pixels at different times and positions and uses them to generate more discriminative features. Considering that the object may have different scales, we partition the image into multi-scale subregions and introduce the self-attention in each subregion. In this way, we could capture spatial–temporal dependencies at various scales, thereby generating better representations to accommodate objects of various sizes. We also introduce a CD dataset LEVIR-CD, which is two orders of magnitude larger than other public datasets of this field. LEVIR-CD consists of a large set of bitemporal Google Earth images, with 637 image pairs (1024 _x0002_ 1024) and over 31 k independently labeled change instances. Our proposed attention module improves the F1-score of our baseline model from 83.9 to 87.3 with acceptable computational overhead. Experimental results on a public remote sensing image CD dataset show our method outperforms several other state-of-the-art methods.

数据集 LEVIR-CD验收指标：1. STANet-PAM F1=87.3% 参考论文 Table.42. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images

Abstract

Change detection is an important task in remote sensing (RS) image analysis. It is widely used in natural disaster monitoring and assessment, land resource planning, and other fields. As a pixel-to-pixel prediction task, change detection is sensitive about the utilization of the original position information. Recent change detection methods always focus on the extraction of deep change semantic feature, but ignore the importance of shallow-layer information containing high-resolution and fine-grained features, this often leads to the uncertainty of the pixels at the edge of the changed target and the determination miss of small targets. In this letter, we propose a densely connected siamese network for change detection, namely SNUNet-CD (the combination of Siamese network and NestedUNet). SNUNet-CD alleviates the loss of localization information in the deep layers of neural network through compact information transmission between encoder and decoder, and between decoder and decoder. In addition, Ensemble Channel Attention Module (ECAM) is proposed for deep supervision. Through ECAM, the most representative features of different semantic levels can be refined and used for the final classification. Experimental results show that our method improves greatly on many evaluation criteria and has a better tradeoff between accuracy and calculation amount than other state-of-the-art (SOTA) change detection methods.

数据集 CDD https://drive.google.com/file/d/1GX656JqqOyBi_Ef0w65kDGVto-nHrNs9/edit验收指标：1 . SNUNet-c32 F1-Score=95.3% 参考 https://paperswithcode.com/sota/change-detection-for-remote-sensing-images-on2 . 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS Remote Sensing Image Change Detection with Transformers

Abstract

Modern change detection (CD) has achieved remarkable success by the powerful discriminative ability of deep convolutions. However, high-resolution remote sensing CD remains challenging due to the complexity of objects in the scene. Objects with the same semantic concept may show distinct spectral characteristics at different times and spatial locations. Most recent CD pipelines using pure convolutions are still struggling to relate long-range concepts in space-time. Non-local self-attention approaches show promising performance via modeling dense relations among pixels, yet are computationally inefficient. Here, we propose a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain. Our intuition is that the high-level concepts of the change of interest can be represented by a few visual words, i.e., semantic tokens. To achieve this, we express the bitemporal image into a few tokens, and use a transformer encoder to model contexts in the compact token-based space-time. The learned context-rich tokens are then feedback to the pixel-space for refining the original features via a transformer decoder. We incorporate BIT in a deep feature differencing-based CD framework. Extensive experiments on three CD datasets demonstrate the effectiveness and efficiency of the proposed method. Notably, our BIT-based model significantly outperforms the purely convolutional baseline using only 3 times lower computational costs and model parameters. Based on a naive backbone (ResNet18) without sophisticated structures (e.g., FPN, UNet), our model surpasses several state-of-the-art CD methods, including better than four recent attention-based methods in terms of efficiency and accuracy. Our code is available at https://github.com/justchenhao/BIT\_CD .

数据集 LEVIR-CD验收指标：1. STANet-PAM F1 = 89.31% 参考论文 Table.12. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition

Abstract

In skeleton-based action recognition, graph convolutional networks (GCNs), which model the human body skeletons as spatiotemporal graphs, have achieved remarkable performance. However, in existing GCN-based methods, the topology of the graph is set manually, and it is fixed over all layers and input samples. This may not be optimal for the hierarchical GCN and diverse samples in action recognition tasks. In addition, the second-order information (the lengths and directions of bones) of the skeleton data, which is naturally more informative and discriminative for action recognition, is rarely investigated in existing methods. In this work, we propose a novel two-stream adaptive graph convolutional network (2s-AGCN) for skeleton-based action recognition. The topology of the graph in our model can be either uniformly or individually learned by the BP algorithm in an end-to-end manner. This data-driven method increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples. Moreover, a two-stream framework is proposed to model both the first-order and the second-order information simultaneously, which shows notable improvement for the recognition accuracy. Extensive experiments on the two large-scale datasets, NTU-RGBD and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the state-of-the-art with a significant margin.

NTU-RGBD数据集, X-Sub=88.5%, X-view=95.1% MLDA-Net: Multi-Level Dual Attention-BasedNetwork for Self-Supervised MonocularDepth Estimation

Abstract

The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.

KITTI， ResNet18，RMSE：4.690 FFA-Net: Feature Fusion Attention Network for Single Image Dehazing

Abstract

In this paper, we propose an end-to-end feature fusion at-tention network (FFA-Net) to directly restore the haze-free image. The FFA-Net architecture consists of three key components: 1) A novel Feature Attention (FA) module combines Channel Attention with Pixel Attention mechanism, considering that different channel-wise features contain totally different weighted information and haze distribution is uneven on the different image pixels. FA treats different features and pixels unequally, which provides additional flexibility in dealing with different types of information, expanding the representational ability of CNNs. 2) A basic block structure consists of Local Residual Learning and Feature Attention, Local Residual Learning allowing the less important information such as thin haze region or low-frequency to be bypassed through multiple local residual connections, let main network architecture focus on more effective information. 3) An Attention-based different levels Feature Fusion (FFA) structure, the feature weights are adaptively learned from the Feature Attention (FA) module, giving more weight to important features. This structure can also retain the information of shallow layers and pass it into deep layers. The experimental results demonstrate that our proposed FFA-Net surpasses previous state-of-the-art single image dehazing methods by a very large margin both quantitatively and qualitatively, boosting the best published PSNR metric from 30.23db to 36.39db on the SOTS indoor test dataset. Code has been made available at GitHub.

YOWO: You only watch once: A unified cnn architecture for real-time spatiotemporal action localization

Abstract

Spatiotemporal action localization requires the incorporation of two sources of information into the designed architecture: (1) temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video streams. YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips, which is currently the fastest state-of-the-art architecture on spatiotemporal action localization task. Remarkably, YOWO outperforms the previous state-of-the art results on J-HMDB-21 and UCF101-24 with an impressive improvement of ~3% and ~12%, respectively. Moreover, YOWO is the first and only single-stage architecture that provides competitive results on AVA dataset. We make our code and pretrained models publicly available.

UCF101-24数据集，YOWO (16-frame)模型，frame-mAP under IoU threshold of 0.5=80.4 Token Shift Transformer for Video Classification

Abstract

Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a plain 2D vision transformer for learning 3D video representation. It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding. Experiments on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly, with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets, comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced in: https://github.com/VideoNetworks/TokShift-Transformer .

UCF101数据集，无预训练模型条件下，8x256x256输入尺寸，Top1=91.60 XLM: Cross-lingual Language Model Pretraining

Abstract

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.

XNLI测试集average accuracy=75.1%（见论文Table 1） A Closer Look at Few-shot Classification

Abstract

Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples. While significant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difficult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classification algorithms, with results showing that deeper backbones significantly reduce the performance differences among methods on datasets with limited domain differences, 2) a modified baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the \miniI and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic cross-domain evaluation setting, we show that a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.

Matching Networks for One Shot Learning

Abstract

Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.

omniglot k-way=5, n-shot=1, 精度98.1 Few-Shot Learning with Graph Neural Networks

Abstract

We propose to study the problem of few-shot learning with the prism of inference on a partially observed graphical model, constructed from a collection of input images whose label can be either observed or not. By assimilating generic message-passing inference algorithms with their neural-network counterparts, we define a graph neural network architecture that generalizes several of the recently proposed few-shot learning models. Besides providing improved numerical performance, our framework is easily extended to variants of few-shot learning, such as semi-supervised or active learning, demonstrating the ability of graph-based models to operate well on ‘relational’ tasks.

Exploring Simple Siamese Representation Learning

Abstract

Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our"SimSiam"method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.

下面二者之一即可1. bs 256的情况下，top1-acc 68.3%2. bs512的情况下，top1-acc 68.1% Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Abstract

Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.

swav 2x224 + 6x96 ImageNet-1k 100epoch linear top1 acc 72.1% Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Abstract

To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. Code is available at: https://git.io/AdelaiDet

densecl_resnet50_8xb32-coslr-200epoch ImageNet-1k linear top1 acc 63.62% EfficientGCN: Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition

Abstract

One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where"x"denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at https://github.com/yfsong0709/EfficientGCNv1 .

NTU RGB+D 60数据集，EfficientGCN-B0模型，X-sub=90.2% X-view=94.9% CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Abstract

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

CANINE-S模型，TYDI-QA Passage Selection Task上F1=66.0, TYDI-QA Minimal Answer Span Task上F1=52.5（见论文Table2） INFOXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

Abstract

In this work, we present an information-theoretic framework that formulates cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, inspired by the framework, we propose a new pre-training task based on contrastive learning. Specifically, we regard a bilingual sentence pair as two views of the same meaning and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available at https://aka.ms/infoxlm .

Taboeba测试集 cross-lingual sentence retrival avg(xx to en; en to xx)分别达到77.8, 80.6(见论文table2); XNLI测试集 transfer gap score=10.3(见论文table 5)

北理工贡献社区开源模型

计算机视觉

论文名称(链接) Deep Supervised Hashing for Fast Image Retrieval

Abstract

针对卷积神经网络（CNNs）在学习各种视觉任务的鲁棒图像表示方面的改进，提出的一种新的深度监督散列方法来学习庞大的图像数据的紧凑相似保持二进制代码。

HashNet HashNet: Deep Learning to Hash by Continuation

Abstract

一种新颖的、具有收敛保证的基于连续法进行深度学习哈希架构，在失衡相似性数据中学习精确二进制哈希代码

Wav2Lip A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

Abstract

一种基于GAN的说话人脸生成模型，实现了生成视频中嘴唇和音频的高度同步。在音频嘴唇同步指标LSE-D上达到了6.652。

A2L_IP_LAP Identity-Preserving Talking Face Generation with Landmark and Appearance Priors

Abstract

基于transformer的模型，融合了语音、面部关键点等多模态信息，实现了生成面部关键点的动作的生动自然准确。在LPIPS指标上打到了0.0303。

L2V_IP_LAP Identity-Preserving Talking Face Generation with Landmark and Appearance Priors

Abstract

融合面部关键点信息和面部纹理特征信息，实现了身份信息保留的从面部关键点到人脸视频渲染的过程。在SSIM指标上达到了0.9399

JoMoLD Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

Abstract

对每个模态中小批量的所有实例的损失进行排序，然后根据模态内和模态间损失的关系选择噪声样本。此外，提出了一种简单但有效的噪声比率估计方法，通过计算置信度低于预设阈值的实例比例来实现。该方法在前沿技术上取得了显著进步（例如，在片段级视觉度量中从60.0%提高到63.8%），证明了该方法的有效性。

CM-Co-Occurrence-AVVP Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Abstract

提出了一种方法，通过探索跨视频和跨模态的监督信号来辅助弱监督音视频视频解析。这种方法利用了视频之间的共同和不同的事件语义，来识别音频或视觉事件。此外，还探索了音频、视觉和音视频流之间的事件共现情况，利用这种跨模态共现来定位目标事件的片段，同时排除无关片段。通过不同视频和模态之间发现的监督信号，可以极大地促进仅使用视频级标注的训练。在Type@AV, and Event@AV达到了60.5和59.5

MLS3RDUH MLS3RDUH: Deep Unsupervised Hashing via Manifold based Local Semantic Similarity Structure Reconstructing

Abstract

利用流形相似性和余弦相似性来重构局部语义相似结构，进而定义了新的相似性矩阵，并使用新的对数双曲哈希损失函数来优化哈希网络

Asymmetric Deep Supervised Hashing

Abstract

通过仅对查询点进行特征学习并直接学习数据库点的二进制哈希码，实现对查询点和数据库点的不对称处理，实现了更高效的训练和更佳的检索性能

Feature Learning based Deep Supervised Hashing with Pairwise Labels

Abstract

提出了首个端到端的深度哈希方法DPSH，实现了同时进行特征学习和哈希码学习的功能，并在图像检索应用中取得了 state-of-the-art 的性能

Semantic Structure-based Unsupervised Deep Hashing

Abstract

通过构建语义结构并设计成对损失函数来保留语义关系，在无监督情况下学习哈希码，显著优于当前的无监督深度哈希方法

BoxeR BoxeR: Box-Attention for 2D and 3D Transformers

Abstract

BoxeR预测输入特征的兴趣框相对于参考窗口的平移和尺寸变换，从而注意到变换后兴趣框内的网格特征。

SAM-DETR Accelerating DETR Convergence via Semantic-Aligned Matching

Abstract

SAM-DETR是一种语义对齐匹配的 DETR，极大地加速了 DETR 的收敛而不牺牲其准确性。首先，它将对象查询投影到与编码图像特征相同的嵌入空间中，其中可以通过对齐的语义有效地完成匹配。其次，它显式搜索具有最具辨别力的特征的显着点以进行语义对齐匹配，这进一步加快了收敛速度并提高了检测精度。

TCTrack TCTrack: Temporal Contexts for Aerial Tracking

Abstract

TCTrack是一个充分利用时间上下文进行空中跟踪的综合框架，对于特征提取，提出了一种在线时间自适应卷积，以利用时间信息增强空间特征；对于相似性图细化，提出了一种自适应时间变换器，它首先以内存有效的方式有效地编码时间知识，然后对时间知识进行解码以精确调整相似性图。

Omni-DETR Omni-DETR: Omni-Supervised Object Detection with Transformers

Abstract

Omni-DETR是一个使用 Transformer 进行全方位监督的目标检测的架构，该架构基于学生-教师框架和基于端到端变压器的对象检测的最新进展，可以利用不同类型的弱标签，通过基于二分匹配的过滤机制生成准确的伪标签供模型学习。

自然语言处理

论文名称(链接) Variational Deep Semantic Hashing for Text Documents

Abstract

融合深度生成模型自然地将概率生成模型的表达能力与深度神经网络的高容量方法，以提高文本建模任务中的表现力。

VDSH-S Variational Deep Semantic Hashing for Text Documents

Abstract

使用相同标签与单词的图像对和相应的相似性标签来训练网络，以高效保留原始数据结构相似性。

VDNSH-SP Variational Deep Semantic Hashing for Text Documents

Abstract

区分化训练私有文档变量以及非共享标签信息，结合深度模型对于文本建模任务中的表征能力构建新兴的神经网络模型。

W2NER Unified Named Entity Recognition as Word-Word Relation Classification

Abstract

一个统一的NER神经网络框架，提出了一种多粒度二维卷积方法来充分捕获近距词和远距词之间的相互作用，论文达到14个NER数据集的sota结果。

CLNNER Unified Named Entity Recognition as Word-Word Relation Classification

Abstract

一种bert-based 自定义编码及解码的拓展性极高的模型，基于词头尾链接与多种嵌入表示融合，在多个主流NER任务中接近sota结果。

MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction

Abstract

提出了一种新的基于提及的推理(MRN)模块，该模块基于显式和协作的局部和全局推理，联合预测实体关系，在三个广泛使用的关系抽取数据集上达到当时的sota结果。

LMBRN MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction

Abstract

提出了一种基于提及的推理网络，在关系提取中区分远近实体提及的影响，同时考虑局部上下文之间的相互作用，显式推理提及之间的关联关系，在DocRED关系抽取数据集上接近sota结果。

GMBRN MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction

Abstract

提出了一个协同预测器来与推理网络中的提及推理块协同工作，并且使用全局特征预测一对实体之间的关系，通过多跳提及级推理块和协作预测器捕获全局上下文信息以及近距离提及交互，在DocRED关系抽取数据集上接近sota结果。

HiTrans HiTrans: A Transformer-Based Context- and Speaker-Sensitive Model for Emotion Detection in Conversations

Abstract

利用BERT作为低级转换器来生成局部话语表示，并将它们馈送到另一个高级转换器中，以便话语表示可以对对话的全局上下文敏感，在MELD情感抽取数据集上达到了接近sota的结果。

Emory HiTrans: A Transformer-Based Context- and Speaker-Sensitive Model for Emotion Detection in Conversations

Abstract

在传统情感抽取模型中，增加辅助任务来使模型对说话人敏感，增加话语说话人验证模块(PUSV)，实现对话过程中的不同对话者的情感继承，在Emory情感抽取数据集上达到了接近sota的结果。

HiTrans: A Transformer-Based Context- and Speaker-Sensitive Model for Emotion Detection in Conversations

Abstract

设计了一个transformer双重模型框架，有效地捕捉对话和说话人的整个语境信息，并设置多任务学习模式来辅助情感抽取的连续判断，在IEMOCAP情感抽取数据集上达到了接近sota的结果。

PURE-entity A Frustratingly Easy Approach for Entity and Relation Extraction

Abstract

一种基于transformers预训练模型的实体抽取模型，在模型早期融合全局信息，极大提高了模型推理的速度。

PURE-relation A Frustratingly Easy Approach for Entity and Relation Extraction

Abstract

一种简单有效的端到端关系提取方法，学习两个独立的编码器分别用于实体识别和关系提取，在关系模型的早期融合实体信息并结合全局上下文，在三个数据集下接近sota结果并极大提高推理速度。

EEQA-ED Event Extraction by Answering (Almost) Natural Questions

Abstract

一种用于事件抽取的trigger模型，提出了一种基于问题回答（QA）的框架，通过将事件触发词识别任务转化为QA问题来进行事件抽取。该模型利用BERT等预训练模型来生成问题和候选答案的表示，充分捕捉上下文信息，并在ACE 2005数据集上达到了SOTA（state-of-the-art）结果。

RCEE-ED Event Extraction as Machine Reading Comprehension

Abstract

采用基于依赖解析的结构，通过利用依赖树来捕捉句法信息，从而更好地识别事件触发词。该模型引入了图神经网络（GNN）来处理依赖树中的节点和边，增强了模型对复杂句法结构的理解能力，并在ACE 2005数据集上达到了SOTA（state-of-the-art）结果。

MLEE-ED Event Extraction across Multiple Levels of Biological Organization

Abstract

提出了一种基于预定义的事件触发词的模板问答机制，利用该机制对事件触发词进行模板问答抽取，在mlee数据集上实现了接近sota的效果。

EEQA-EAE Event Extraction by Answering (Almost) Natural Questions

Abstract

一个针对事件抽取的argument模型，采用与EEQA-ED相同的基于问题回答（QA）框架，将事件论元识别任务转化为QA问题来进行事件抽取。通过生成问题和候选答案的表示，该模型能够有效地捕捉事件论元的上下文信息，并在ACE 2005数据集上取得了接近SOTA的表现。

RCEE-EAE Event Extraction as Machine Reading Comprehension

Abstract

采用基于依赖解析的结构，通过利用依赖树来识别事件论元。该模型引入了图神经网络（GNN），以处理依赖树中的节点和边，从而更准确地捕捉事件论元的上下文信息，并在ACE 2005数据集上取得了接近SOTA的表现。

MLEE-EAE Event Extraction across Multiple Levels of Biological Organization

Abstract

提出了基于事件触发词和预定义事件论元模板的问答方法，通过与原文拼接实现针对性地事件论元抽取，在mlee数据集上实现了接近sota的效果。

QAnet QANET: COMBINING LOCAL CONVOLUTION WITH GLOBAL SELF-ATTENTION FOR READING COMPREHENSION

Abstract

QANet不依赖循环神经网络，仅使用卷积和自注意力来分别建模局部和全局交互，这使得在SQuAD数据集上的训练速度比传统模型快3到13倍，推理速度快4到9倍。

SongNet Rigid Formats Controlled Text Generation

Abstract

提出了一种名为SongNet的框架，基于Transformer的自回归语言模型，通过特别设计的符号集和改进的注意力机制，以及预训练和微调框架，有效地处理了需要严格遵循预定义格式、韵律方案，并保证句子完整性的特殊文本生成任务（如歌词、十四行诗、宋词等）。

An Empirical Investigation of Pre-Trained Transformer Language Models for Open-Domain Dialogue Generation

Abstract

结合新闻和维基百科中英文语料对基于Transformer的自回归语言模型在开放域对话生成任务上进行预训练。

GPT2（系列） Language Models are Unsupervised Multitask Learners

Abstract

GPT-2是一个至多可达1.5B参数的Transformer，它在零样本设置下，在8个测试语言建模数据集中的7个上实现了当时最先进的结果。

OPT-125M OPT: Open Pre-trained Transformer Language Models

Abstract

OPT是一套开源的仅解码器的预训练Transformer，参数范围从125M到175B不等，并且OPT-175B与GPT-3相当，而模型开发过程只需要七分之一的碳足迹。

GPT-Neo-125M GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Abstract

GPT-NeoX-20B是一种特别强大的少样本推理模型，在适配只有5个样本的下游任务时，其性能远高于类似大小的GPT-3和FairSeq模型，GPT-Neo是与其架构相似的小规模版本。

BLOOM-396M-zh BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Abstract

BLOOM是一个176B参数的开放访问语言模型，它是在ROOT语料库上训练的，在各种基准测试中都能获得有竞争力的性能，在经过多任务提示的微调后，效果会更强。

TinyLlama-1.1B LLaMA: Open and Efficient Foundation Language Models

Abstract

TinyLlama项目旨在在3万亿词元上预训练一个11亿参数的Llama模型。通过适当的优化，可以使用16个A100-40G GPU在“仅”90天内实现这一目标。TinyLlama的紧凑性使其能够满足大量需要有限计算和内存占用的应用场景。

Unlikelihood Training Token-Level Neural Text Generation with Unlikelihood Training

Abstract

Unlikelihood Training迫使现实文本中不太可能出现的序列被模型分配较低的概率。Token-Level的Unlikelihood Training在保持模型困惑度的同时提供了更低重复度的文本，使用标准贪心搜索或束搜索提供了更高级的生成文本。

Unlikelihood Training Sequence-Level Neural Text Generation with Unlikelihood Training

Abstract

Unlikelihood Training迫使现实文本中不太可能出现的序列被模型分配较低的概率。Sequence-Level的Unlikelihood Training在保持模型困惑度的同时提供了更低重复度的文本，使用标准贪心搜索或束搜索提供了更高级的生成文本。

ScaleGrad Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation

Abstract

ScaleGrad过直接操纵梯度信息，使模型学会使用新的词元，它不仅在开放式生成中有效，而且在定向生成任务中也有效。由于结构简单，ScaleGrad可以作为一个通用的训练目标，适用于大多数文本生成任务。

SimCTG A Contrastive Framework for Neural Text Generation

Abstract

SimCTG作为一种对比训练目标，可以用于校准模型的表示空间，有效缓解模型退化问题。

DITTO Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation

Abstract

DITTO方法可以使模型学习从伪重复数据中惩罚句子级重复的概率，在不牺牲困惑的情况下减少了重复问题，而且实现了更好的生成质量。

Qwen/Qwen1.5-0.5B Qwen Technical Report

Abstract

大型语言模型（LLM）彻底改变了人工智能领域，实现了以前被认为是人类独有的自然语言处理任务。在这项工作中，我们将介绍Qwen，这是我们的大型语言模型系列的第一部分。Qwen是一个综合性的语言模型系列，包含不同参数计数的不同模型。它包括Qwen，基本的预训练语言模型，以及Qwen Chat，这些聊天模型使用人类对齐技术进行了微调。基本语言模型在许多下游任务中始终表现出优异的性能，而聊天模型，尤其是那些使用人类反馈强化学习（RLHF）训练的聊天模型，具有很强的竞争力。聊天模型具有创建代理应用程序的高级工具使用和规划功能，即使与使用代码解释器等复杂任务的大型模型相比，也能显示出令人印象深刻的性能。此外，我们还开发了专门的编码模型Code Qwen和Code Qwen Chat，以及基于基本语言模型的数学模型Math Qwen Chat。与开源模型相比，这些模型的性能显著提高，略落后于专有模型。

AI-Sweden-Models/gpt-sw3-126m GPT-SW3: An Autoregressive Language Model for the Nordic Languages

Abstract

本文详细介绍了为北欧语言开发第一个本地大型生成语言模型GPT-SW3的过程。我们涵盖了开发过程的所有部分，从数据收集和处理、培训配置和指令微调，到发布策略的评估和考虑。我们希望本文能为其他从事小型语言大型生成模型开发的研究人员提供指导和参考。

facebook/galactica-125m Galactica: A Large Language Model for Science

Abstract

信息过载是科学进步的主要障碍。科学文献和数据的爆炸性增长使得在大量信息中发现有用的见解变得越来越困难。今天，科学知识是通过搜索引擎获取的，但它们无法单独组织科学知识。在本文中，我们介绍了Galactica：一个可以存储、组合和推理科学知识的大型语言模型。我们在大量科学论文、参考资料、知识库和许多其他来源的基础上进行培训。我们在一系列科学任务上优于现有模型。在LaTeX方程等技术知识探针上，Galactica的性能优于最新的GPT-3，分别为68.2%和49.0%。Galactica在推理方面也表现出色，在数学MMLU方面比Chinchilla高41.3%至35.7%，在MATH方面比PaLM 540B高20.4%至8.8%。它还为PubMedQA和MedMCQA等下游任务设定了77.6%和52.9%的新技术。

deepseek-ai/deepseek-coder-1.3b-base DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Abstract

大型语言模型的快速发展使软件开发中的代码智能发生了革命性的变化。然而，闭源模型的优势限制了广泛的研究和开发。为了解决这一问题，我们推出了DeepSeek编码器系列，这是一系列大小从1.3B到33B的开源代码模型，在2万亿代币上从头开始训练。这些模型是在高质量的项目级代码语料库上预先训练的，并使用具有16K窗口的填空任务来增强代码生成和填充。我们的广泛评估表明，DeepSeek Coder不仅在多个基准测试的开源代码模型中实现了最先进的性能，而且超过了Codex和GPT-3.5等现有的闭源代码模型。此外，DeepSeek编码器模型受许可，允许研究和不受限制的商业使用。

internlm/internlm2-1_8b InternLM2 Technical Report

Abstract

像ChatGPT和GPT-4这样的大型语言模型（LLM）的发展引发了关于通用人工智能（AGI）出现的讨论。然而，在开源模型中复制这些进步一直是一项挑战。本文介绍了InternetLM2，这是一种开源LLM，通过创新的预训练和优化技术，在6个维度和30个基准的综合评估、长上下文建模和开放式主观评估方面优于其前身。InternetLM2的预训练过程非常详细，强调了各种数据类型的准备，包括文本、代码和长上下文数据。InternetLM2有效地捕获了长期依赖性，使用监督微调（SFT）和一种新的从人类反馈中进行条件在线强化学习（COOL RLHF）策略进行了进一步调整，该策略解决了人类偏好冲突和奖励黑客攻击。通过发布InternetLM2模型，我们为社区提供了对模型演变的见解。

EleutherAI/pythia-70m Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Abstract

Pythia是一套16个LLM，所有LLM都是在以完全相同的顺序看到的公共数据上训练的，大小从70M到12B不等。我们为16个模型中的每一个提供了154个检查点的公共访问权限，以及下载和重建其精确训练数据加载器的工具，以供进一步研究。Pythia能够促进许多领域的研究，包括记忆方面的新结果、术语频率对少镜头表现的影响，以及减少性别偏见。我们证明，这种高度受控的设置可以用于产生对LLM及其训练动态的新见解。

microsoft/phi-1_5 Textbooks Are All You Need II: phi-1.5 technical report

Abstract

phi-1是一个1000万参数的模型，可以生成连贯的英语文本；phi-2是13亿参数的模型。Python编码性能接近最先进水平。后一项工作建议使用现有的大型语言模型（LLM）生成“教科书质量”数据，作为与传统网络数据相比增强学习过程的一种方式。我们遵循“教科书就是你所需要的一切”方法，这次重点关注自然语言中的常识推理，并创建了一个新的13亿参数模型phi-1.5，在自然语言任务中的性能相当到大5倍的模型，在更复杂的推理任务（如小学数学和基本编码）上超过了大多数非前沿LLM。

Dense passage retrieval for open-domain question answering

Abstract

DPR是一种用于信息检索的模型，它通过将问题和文档嵌入到同一向量空间中，然后通过计算它们之间的向量相似性来检索相关文档。

Contriever Unsupervised dense information retrieval with contrastive learning

Abstract

Contriever模型是一种先进的、无监督的密集信息检索模型，由Gautier Izacard等人提出，其详细描述见于论文《Unsupervised Dense Information Retrieval with Contrastive Learning》。Contriever模型使用对比学习方法，在没有任何监督信息的情况下训练密集检索器，并在各种检索设置中展现出了强大的性能。

ColBERT ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Abstract

ColBERT是一种基于BERT的高效信息检索模型，旨在解决大规模检索任务中的准确性和效率问题。通过引入一种名为“Late Interaction”的技术，ColBERT能够在保持深度语义理解能力的同时，显著提高检索速度。ColBERT独特地对查询和文档的每个词生成独立的向量表示，而不是将整个查询或文档映射到单一的密集向量。这种细粒度表示允许模型在词级别上进行更精细的相似度计算。

COIL-tok COIL: Revisit exact lexical match in information retrieval with contextualized inverted list

Abstract

COIL-tok (Contextualized Inverted List with token-only model) 是一种信息检索模型，主要侧重于利用BERT等预训练语言模型对查询和文档中的单个词汇进行编码，然后计算这些词汇之间的相似性得分。COIL-tok不涉及文档级的语义信息，它通过计算查询和文档中每个词汇的点积相似度，并通过最大池化操作提取最有价值的特征，以此来评估查询和文档之间的相关性。这种模型结构简单，主要用于那些侧重词汇级匹配的场景，能够有效捕捉查询和文档之间的细粒度相似性。

COIL-full COIL: Revisit exact lexical match in information retrieval with contextualized inverted list

Abstract

COIL-full (Contextualized Inverted List with full context model) 计算词汇级的相似性，还加入了文档级的语义匹配。在COIL-full模型中，除了使用预训练语言模型对查询和文档中的词汇进行编码和计算相似度外，还特别引入了对CLS向量的处理。CLS向量通常被用作句子或文档的整体语义表示，在COIL-full中，通过计算查询和文档的CLS向量之间的点积来评估它们的整体语义相关性。此外，COIL-full还包括了层归一化（Layer Normalization）操作，以提高模型的训练稳定性和性能。这使得COIL-full在处理包含丰富语义信息的复杂查询时表现更为出色，能够更全面地理解查询意图和文档内容。

Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval

Abstract

DCE模型（Dual Cross Encoder）是一种用于密集文档检索的框架。该模型旨在通过深度的查询交互提升文档的表示多样性，从而改进检索质量和效率。DCE模型特别关注于如何通过交叉编码器来捕捉查询和文档间的深层交互信息。不同于传统的Bi-encoder，该模型能够同时处理查询和文档，通过交叉编码器生成的丰富表示来优化检索效果。通过这种结合了深度语义理解和高效检索机制的方法，DCE模型在多个标准信息检索和问答数据集上都显示出了优越的性能，证明了其在实际应用中的潜力和有效性。

Approximate nearest neighbor negative contrastive learning for dense text retrieval

Abstract

ANCE（Approximate Nearest Neighbor Negative Contrastive Estimation）是一种用于密集文本检索的学习方法。其核心思想是从整个语料库中选择负样本（即不相关的文档），利用异步更新的最近邻索引（ANN）。这种方法能够从索引中检索出对当前密集检索（DR）模型具有挑战性的“最难”负样本。这些负样本因具有较高的训练损失和梯度范数上限，从而有助于模型的训练收敛。在训练过程中，ANCE使用BERT Siamese/Dual Encoder结构，采用点积相似性和负对数似然（NLL）损失函数。这种方法首先使用预训练的BM25模型生成初始训练数据，然后进行模型训练和ANN索引的周期性更新，以维护索引的实时性。

CLIM: Contrastive Language-Image Mosaic for Region Representation

Abstract

CLIM 是一个使用对比学习进行开放词汇物体检测的创新架构。该架构基于图像-文本对的对比学习，通过将多个图像组合成拼图图像并将每个图像视为“伪区域”，实现区域和文本表示的对齐。CLIM 结合了最新的对比学习技术和大规模图像-文本对，通过对比损失机制生成高质量的伪标签供模型学习，避免了昂贵的框注释需求。该方法适用于多种开放词汇物体检测模型，并在 OV-COCO 和 OV-LVIS 基准测试中展示了显著的性能提升。

C2AM: Contrastive Learning of Class-Agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation

Abstract

C²AM 是一个创新的架构，通过对比学习生成类无关的激活图。该架构仅使用未标记的图像数据，不涉及图像级别的监督。C²AM 的核心思想源自以下观察：前景物体的语义信息通常不同于其背景；外观相似的前景物体或颜色/纹理相似的背景在特征空间中有相似的表示。基于这些关系，我们构建正负样本对，并通过一种新的对比损失机制迫使网络用类无关的激活图分离前景和背景。由于网络被引导识别跨图像的前景和背景，我们的方法生成的类无关激活图覆盖了更完整的物体区域。我们成功地从 C²AM 提取了用于物体定位的类无关物体边界框，并利用背景线索来优化分类网络生成的 CAM 用于语义分割。在 CUB-200-2011、ImageNet-1K 和 PASCAL VOC2012 数据集上的大量实验表明，WSOL 和 WSSS 都能从 C²AM 中受益。

FIFO: Learning Fog-invariant Features for Foggy Scene Segmentation

Abstract

FogNet 是一种针对雾霾条件的鲁棒语义分割模型，大大提高了在雾霾条件下的分割性能而不牺牲晴朗天气下的准确性。首先，它将雾霾条件视为图像的风格，并将不同雾霾条件的图像在神经风格空间中进行对齐，从而实现风格间的有效匹配。其次，它引入了雾霾通滤波模块，该模块显式提取与雾霾相关的特征进行对齐，这进一步加快了模型的收敛速度并提高了分割精度。

论文名称(链接) DSCMR Deep Supervised Cross-modal Retrieval

Abstract

结合应用线性分类器对公共表示空间中的样本进行分类和具有两个具有权重共享约束的子网的新型网络的方法有效学习到不同模态数据之间的公共的表达

NDSCMR Deep Supervised Cross-modal Retrieval

Abstract

针对现有方法双通道处理多模态数据减少多种模态间的两个特征属于同一个类别的概率损失计算，以降低两个特征属于同一个类别的概率。

Exploring Graph-Structured Semantics for Cross-Modal Retrieval

Abstract

利用图潜空间中最大限度地探索标注数据的结构，并将其作为语义约束来进行语义匹配，以全面探索跨模态检索的模态内和模态间信息

GDLCR Exploring Graph-Structured Semantics for Cross-Modal Retrieval

Abstract

一种基于 GAN 的双重学习（GDL）网络，并结合利用文本投影器来限制图像投影器的训练，来实现多模态数据之间的检索任务。

DeepCCA Deep Canonical Correlation Analysis

Abstract

一种学习两个数据视图的复杂非线性变换的方法，其结果是数据表示高度线性相关。两种变换的参数被共同学习，以最大化（正则化）总相关性。

Deep Cross-modal Proxy Hashing

Abstract

一种新颖的深度跨模态代理哈希算法，称为 DCPH。具体来说，DCPH 首先学习一个代理散列网络，为每个类别生成一个区分性代理散列码。

DCNPH Deep Cross-modal Proxy Hashing

Abstract

一种利用学习到的代理哈希码作为监督信息的方法，提出了一种新的类似于 Margin-SoftMax 的损耗，而无需定义数据点之间至少有一个相似度。

MCAN-small Deep Modular Co-Attention Networks for Visual Question Answering

Abstract

提出了一种深度模块化共注意网络（MCAN），通过级联模块化共注意（MCA）层来模拟问题和图像的自注意力和引导注意力，实现了在视觉问答（VQA）领域的显著性能提升。

MCAN-large Deep Modular Co-Attention Networks for Visual Question Answering

Abstract

在原来模型的基础上，将隐藏层数目增大一倍，变为1024，并且在测试开发分割上表现略优，特别是在"Yes/No"和"Other"类问题上，其综合准确率达到了70.93%。

BAN-4 Bilinear Attention Networks

Abstract

BAN_4是双线性注意力网络（BAN）的一个变体，它利用四个注意力图来有效地整合视觉和语言信息，通过考虑两组输入通道间的双线性交互和低秩双线性池化来提取每对通道的联合表示。

BAN-8 Bilinear Attention Networks

Abstract

BAN_8采用八个注意力图来进一步优化多模态输入的交互处理，通过提出的多模态残差网络变体来更高效地利用这些注意力图，从而在视觉问答和Flickr30k Entities等数据集上实现更高的性能表现。

Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Abstract

提出了一种结合多模态因子化双线性（MFB）池化和共注意力机制的新型网络架构，有效地融合视觉和文本特征，为视觉问答任务提供了一个统一的模型。

Beyond Bilinear: Generalized Multimodal Factorized High-order Pooling for Visual Question Answering

Abstract

提出了一种综合使用共注意力机制、多模态因子化高阶池化（MFH）和KL散度损失函数的深度神经网络架构，有效地整合了图像和问题的精细特征表达、多模态特征融合和自动答案预测，从而在视觉问答任务上达到了最新的最佳性能。

MMNasNet Deep Multimodal Neural Architecture Search

Abstract

提出了一个通用的深度多模态神经架构搜索（MMnas）框架，通过定义一系列基础操作并构建一个统一的编解码器骨干网络，结合梯度导向的NAS算法，高效地学习最优架构，显著提升了包括视觉问答、图像-文本匹配和视觉定位在内的多种多模态学习任务的性能。

TRAR_VQA TRAR: Routing the Attention Spans in Transformerfor Visual Question Answering

Abstract

提出了一种称为Transformer路由（TRAR）的样本依赖路由方案来解决Transformer中全局和局部依赖性建模问题。

TRAR_CLEVR TRAR: Routing the Attention Spans in Transformerfor Visual Question Answering

Abstract

提出TRAR来解决视觉推理任务，为每个视觉Transformer层都配备了一个具有不同注意力跨度的路由模块。该模型可以基于前一推理步骤的输出动态选择相应的注意力，从而为每个样本制定最佳路由路径，效果比传统的Transformer模型更加优越。

Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval

Abstract

通过自注意力机制挖掘逐位语义概念，同时在概念级别上对异构模态进行对齐，解决了传统方法的语义表达能力、特征级多模态融合和语义相关性三个问题

GCIMH Graph Convolutional Incomplete Multi-modal Hashing

Abstract

利用图卷积自编码器和多模态、标签网络，处理多模态数据缺失模态问题，通过师生学习传递知识，实现离线训练和在线查询的有效多模态哈希编码。

Partial Multi-Modal Hashing via Neighbor-Aware Completion Learning

Abstract

结合了跨模态补全学习和多模态哈希学习，支持不完整数据训练和查询，引入邻居感知补全学习模块。

KNNCH Partial Multi-Modal Hashing via Neighbor-Aware Completion Learning

Abstract

利用k近邻算法找出同缺失实例最近的k个锚点，并基于平均化思想直接补全缺失数据

Deep Adversarial Discrete Hashing for Cross-Modal Retrieval

Abstract

采用对抗训练实现跨模态特征学习，确保特征分布一致性。引入加权余弦三元组约束利用多标签语义知识，采用离散哈希策略最小化量化损失，保留标签中的语义知识。

Data-Aware Proxy Hashing for Cross-modal Retrieval

Abstract

通过训练数据感知代理网络生成基于类别、图像、文本的哈希码，引入新型哈希损失，有效监督模态特定哈希网络，确保语义信息得以充分保留。

DJSRH Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval

Abstract

一种无监督深度跨模态哈希编码方法，通过构建联合语义相似性矩阵和重建框架，学习保留原始数据邻域结构的二进制码，最大程度地重建联合语义关系。

Deep semantic-alignment hashing for unsupervised cross-modal retrieval

Abstract

一种无监督跨模态检索的深度哈希方法，通过充分利用图像-文本对的共现信息，设计语义对齐损失函数，并创新性地提出模态间的特征重构。

PSLDH Partial-Softmax Loss based Deep Hashing

Abstract

通过训练分类哈希网络生成每个类别的区分性哈希码，直接用于图像哈希网络的学习过程。

WGLHH Weighted Gaussian Loss based Hamming Hashing

Abstract

通过引入加权高斯损失优化哈希模型，显著惩罚汉明球内不相似的数据对，生成更紧凑的哈希码。

Deep Cross-Modal Hashing

Abstract

将特征学习和哈希码学习集成到同一个端到端的深度学习框架中，实现高效的跨模态哈希检索

Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval

Abstract

通过构建联合模态相似性矩阵来融合跨模态相似性信息，并提出了基于分布的相似性决策和加权方法来改进无监督跨模态哈希的采样和加权方案，以生成更具区分性的哈希码

Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval

Abstract

通过对抗学习指导的多标签注意力模块增强特征学习，同时采用不对称哈希方法生成高质量的二进制码，有效提高了跨模态检索的性能。

Cross-Modal Hashing Method with Properties of Hamming Space: A New Perspective

Abstract

通过将样本对分类并根据其相似性施加不同的约束，来有效地利用汉明空间并缓解损失函数振荡问题。

论文名称(链接) Fastspeech2-LibriTTS FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Abstract

在LibriTTS数据集上训练的fastspeech2模型，实现了直接从文本并行生成语音波形。在MOS指标上达到了3.83

glow-tts Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Abstract

基于流式模型，在提高速度的基础上保证了合成语音的质量。在MOS指标上达到了3.97

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Abstract

结合变分推理、规范化流程和对抗性训练，生成生动自然的语音。在MOS指标上达到了4.43

AutoVC AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Abstract

不依赖于复杂的GAN或CVAE，使用精心设计瓶颈的自编码器来实现分布匹配的风格转换。在MOS指标上达到了3.56

AdainVC One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Abstract

不依赖于成对训练数据的语音转换模型，拓展了语音转换的应用场景。在风格相似度上达到了80%

AssemVC Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Abstract

通过分解和重新组装非并行VC系统的不同部分，创建了一个新的任意到多个的语音转换系统。在MOS指标上达到了3.91

Cotatron Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

Abstract

提供一个说话者独立的语言表示，将多说话人TTS的数据集应用于语音转换模型中。在MOS指标上达到了3.75

西电贡献社区开源模型

论文名称(链接) SMT-T Scale-Aware Modulation Meet Transformer

Abstract

一种基于Transformer的神经网络。充分结合CNN和Transformer的优势减轻了SA的运算负担，同时又解决了浅层的CNN局部特征捕捉能力的痛点。

SMT-S Scale-Aware Modulation Meet Transformer

Abstract

一种基于Transformer的神经网络。充分结合CNN和Transformer的优势减轻了SA的运算负担，同时又解决了浅层的CNN局部特征捕捉能力的痛点。

SMT-B Scale-Aware Modulation Meet Transformer

Abstract

一种基于Transformer的神经网络。充分结合CNN和Transformer的优势减轻了SA的运算负担，同时又解决了浅层的CNN局部特征捕捉能力的痛点。

SMT-L_224_1K Scale-Aware Modulation Meet Transformer

Abstract

一种基于Transformer的神经网络。充分结合CNN和Transformer的优势减轻了SA的运算负担，同时又解决了浅层的CNN局部特征捕捉能力的痛点。

SMT-L_384_1K Scale-Aware Modulation Meet Transformer

Abstract

一种基于Transformer的神经网络。充分结合CNN和Transformer的优势减轻了SA的运算负担，同时又解决了浅层的CNN局部特征捕捉能力的痛点。

SMT-L_224_22K Scale-Aware Modulation Meet Transformer

Abstract

一种基于Transformer的神经网络。充分结合CNN和Transformer的优势减轻了SA的运算负担，同时又解决了浅层的CNN局部特征捕捉能力的痛点。

BiFormer-T BiFormer: Vision Transformer with Bi-Level Routing Attention

Abstract

一种基于Transformer的神经网络。通过设计一种自适应的分区注意力方式，使token不会关注到低相关度的token，从而提高计算效率。

BiFormer-S BiFormer: Vision Transformer with Bi-Level Routing Attention

Abstract

一种基于Transformer的神经网络。通过设计一种自适应的分区注意力方式，使token不会关注到低相关度的token，从而提高计算效率。

BiFormer-B BiFormer: Vision Transformer with Bi-Level Routing Attention

Abstract

一种基于Transformer的神经网络。通过设计一种自适应的分区注意力方式，使token不会关注到低相关度的token，从而提高计算效率。

FLatten-PVT-T FLatten Transformer: Vision Transformer using Focused Linear Attention

Abstract