Contributions of Shape, Texture, and Color in Visual Recognition Abstract

  • /zotero
  • Yunhao Ge, Yao Xiao, Zhi Xu, Xingrui Wang, and Laurent Itti
     

Abstract

  • [humanoid vision engine](humanoid vision engine.md) (HVE) that explicitly and separately computes shape, texture, and color features from images
  • resulting feature vectors are then concatenated to support the final classification
  • HVE can summarize and rankorder the contributions of the three features to object recognition.
  • We use human experiments to confirm that both HVE and humans predominantly use some specific features to support the classification of specific classes
  • To demonstrate more usefulness of HVE, we use it to simulate the open-world zeroshot learning ability of humans with no attribute labeling
  • Finally, we show that HVE can also simulate human imagination ability with the combination of different features.

Introduction

  • A widely accepted intuition about the success of CNNs on perceptual tasks is that CNNs are the most predictive models for the human ventral stream object recognition
  • To understand which feature is more important for CNN-based recognition, recent paper shows promising results: ImageNet-trained CNNs are biased towards texture while increasing shape bias improves accuracy and robustness [33]
  • Here, inspired by HVS, we wish to find a general way to understand how shape, texture, and color contribute to a recognition task by pure data-driven learning.
  • It has been shown by neuroscientists that there are separate neural pathways to process these different visual features in primate
  • Among the many kinds of features crucial to visual recognition in humans, the shape property is the one that we primarily rely on in static object recognition [16]. Meanwhile, some previous studies show that surface-based cues also play a key role in our vision system
  • For example, [21] shows that scene recognition is faster for color images compared with grayscale ones
  • [Humanoid Vision Engine](Humanoid Vision Engine.md)

Image Parsing and Foreground Identification.

  • we use the entity segmentation method [41] to simulate the process of parsing objects from a scene in our brain.
  • Entity segmentation is an open-world model and can segment the object from the image without labels.
  • This method aligns with human behavior, which can (at least in some cases; e.g., autostereograms [29]) segment an object without deciding what it is
  • After we get the segmentation of the image, we use a pre-trained CNN and Grad-CAM [47] to find the foreground object among all masks.
  • We design three different feature extractors after identifying the foreground object segment: shape extractor, texture extractor, and color extractor, similar to the separate neural pathways in the human brain which focus on specific property

Shape Feature Extractor

  • want to keep both 2D and 3D shape information while eliminating the information of texture and color
  • first use a 3D depth prediction model [44,43] to obtain the 3D depth information of the whole image
  • After element-wise multiplying the 3D depth estimation and 2D mask of the object, we obtain our shape feature
  • We can notice that this feature only contains 2D shape and 3D structural information (the 3D depth) and without color or texture information

Texture Feature Extractor

  • want to keep both local and global texture information while eliminating shape and color information.
  • to remove the color information, we convert the RGB object segmentation to a grayscale image
  • cut this image into several square patches with an adaptive strategy (the patch size and location are adaptive with object sizes to cover more texture information)
  • If the overlap ratio between the patch and the original 2D object segment is larger than a threshold τ, we add that patch to a patch pool (we set τ to be 0.99 in our experiments, which means the over 99% of the area of the patch belongs to the object
  • Since we want to extract both local (one patch) and global (whole image) texture information, we randomly select 4 patches from the patch pool and concatenate them into a new texture image

Color Feature Extractor

  • The first method is phase scrambling

Phase Scrambling

  • transforms the image into the frequency domain using the fast Fourier transform (FFT)
  • In the frequency domain, the phase of the signal is then randomly scrambled, which destroys shape information while preserving color statistics
  • Then we use IFFT to transfer back to image space
  • We also used simple color histograms (see suppl.) as an alternative, but the results were not as good, hence we focus here on the phase scrambling approach for color representation.

Humanoid Neural Network

  • After preprocessing, we have three features
  • To simulate the separate neural pathways in humans’ brains for different feature information [1,11], we design three feature representation encoders for shape, texture, and color, respectively
  • ResNet-18 [24] as the backbone for all feature encoders to project the three types of features to the corresponding well-separated embedding spaces.
  • hard to define the ground-truth label of the distance between features.
  • Given that the objects from the same class are relatively consistent in shape, texture, and color, the encoders can be trained in the classification problem independently instead, with the supervision of class labels.
  • fter training our encoders as classifiers, the feature map of the last convolutional layer will serve as the final feature representation
  • We also propose a gradient-based contribution attribution method to interpret the contributions of shape, texture, and color to the classification decision,
  • Take the shape feature as an example, given a prediction p and the probability of
  • class k, namely pk, we compute the gradient of pk with respect to the shape feature Vs
  • gradient as shape importance weights ↵sk
  • In other words, Ssk represents the “contribution” of shape feature to classifying this
  • image as class k

Effectiveness of Feature Encoders

  • handcrafted three subsets of ImageNet
  • Shape-biased dataset containing 12 classes, where the classes were chosen which intuitively are strongly determined by shape
  • Texture-biased dataset uses 14 classes which we believed are more strongly determined by texture
  • Color-biased dataset includes 17 classes
  • After pre-processing the original images and getting their feature images, we input the feature images into feature encoders and get the T-SNE
  • Each row represents one feature-biased dataset and each column is bounded with one feature encoder, each image shows the results of one combination

Effectiveness of Humanoid Neural Network

  • As these classifiers classify images based on corresponding feature representation, we call them feature nets.
  • If we combine these three feature nets with the interpretable aggregation module, the classification accuracy is very close to the upper bound, which means our vision system can classify images based on these three features almost as well as based on the full original color images.

More Humanoid Applications with HVE Open-world Zero-shot Learning with HVE

  • Most current methods [37,32,13] need humans to provide detailed attribute labels for each image, which is costly in time and energy. However, given an image from an unseen class, humans can still describe it with their learned knowledge
  • First, to represent learnt knowledge, we use feature extractors
  • To retrieve learnt classes as description, we calculate the average distance dkm
  • between Iun and images of other class k in the latent space on feature m Open-world classification
  • To further predict the actual class of Iun based on the feature-wise description, we use ConceptNet as common knowledge to conduct reasoning
  • We form a reasoning root pool R⇤ consisting of feature roots Rs, Rt, Rc obtained during image description, and shared attribute roots Ras , Rat , Rac . The reasoning roots will be our evidence for reasoning
  • We humans can intuitively imagine an object when seeing one aspect of a feature, especially when this feature is prototypical (contribute most to classification)
  • For instance, we can imagine a zebra when seeing its stripe (texture). This process is similar but harder than the classical image generation task since the input features Modality here dynamic which can be any feature among shape, texture, or color

Cross Feature Retrieval

  • In order to reasonably retrieve the most possible other two corresponding features given only one feature (among shape, texture, or color), we learn a feature agnostic encoder that projects the three features into one same feature space and makes sure that the features belonging to the same class are in the nearby regions.
  • In the retrieval process, given any feature of any object, we can map it into the cross feature embedding space by the corresponding encoder net and the feature agnostic net
  • Then we apply the 2 norm to find the other two features closest to the input one as output. The output is correct if they belong to the same class as the input.

Cross Feature Imagination

  • To stimulate imagination, we propose a crossfeature imagination model to generate a plausible final image with the input and retrieved features
  • Inspired by the pixel2pixel GAN[26] and AdaIN[25] in the style transfer, we design a crossfeature pixel2pixel GAN model to generate the final image.

Pictures