Explaining the Implicit Neural Canvas (XINC): Connecting Pixels to Neurons by Tracing their Contributions

University of Maryland, College Park
* Equal contribution

XINC dissects Implicit Neural Representation (INR) models to understand how neurons represent images and videos and to reveal the inner workings of INRs.

Teaser

Abstract

The many variations of Implicit Neural Representations (INRs), where a neural network is trained as a continuous representation of a signal, have tremendous practical utility for downstream tasks including novel view synthesis, video compression, and image super-resolution. Unfortunately, the inner workings of these networks are seriously understudied. Our work, eXplaining the Implicit Neural Canvas (XINC), is a unified framework for explaining properties of INRs by examining the strength of each neuron’s contribution to each output pixel. We call the aggregate of these contribution maps the Implicit Neural Canvas and we use this concept to demonstrate that the INRs we study learn to “see” the frames they represent in surprising ways. For example, INRs tend to have highly distributed representations. While lacking high-level object semantics, they have a significant bias for color and edges, and are almost entirely space-agnostic. We arrive at our conclusions by examining how objects are represented across time in video INRs, using clustering to visualize similar neurons across layers and architectures, and show that this is dominated by motion. These insights demonstrate the general usefulness of our analysis framework.

How does XINC work?

XINC dissects an INR to create contribution maps for each “neuron” (group of weights), which connect neurons to individual pixels in terms of their activations. Together, these contribution maps comprise the “implicit neural canvas” for a visual signal. XINC unveils surprising characteristics such as highly distributed representations, bias for low-level features like color and edges over space, lack of high-level object semantics and in video INRs, the dominance of motion in object representation.

XINC framework.

Left: We dissect MLP-based INRs such as FFN [1] by aggregating their activations (weights multiplied by previous layer outputs) for each pixel at each neuron. Right: We extend this core idea of pixel-to-neuron mapping for CNN-based INRs, NeRV [2] by computing intermediate feature maps that are not yet summed on the input dimension.


Dissecting MLP-based Image INRs

FFN Maps.

Using XINC to dissect the FFN network, we can obtain neuron contribution maps and group contributions on the basis of Gabor filter feature clusters in the input image. The above figure shows the average contribution map of neurons in each of two clusters, revealing how early FFN layers manifest strong Fourier patterns, and how the last layers tend to resemble the image.



Dissecting CNN-based Video INRs


NeRV Head Layer Neuron Contributions

      Source Video
    Neuron (i)
    Neuron (ii)    
     Neuron (iii)       
   Neuron (iv)      


Each of the videos above shows the input frames of a source video and the contribution maps for a few randomly sampled neurons from the outermost layer of NeRV, for each frame. The last layer thus tends to resemble the image and the various neurons seem to capture a variety of features, including edges, textures, colors, and depth. These maps show that the behavior of a neuron is not simply to represent a pixel or set of pixels and instead learns some attributes of the scene.



NeRV Penultimate Layer Neuron Contributions

      Source Video
    Neuron (i)
    Neuron (ii)    
     Neuron (iii)       
   Neuron (iv)      


Each video above shows the input frames of a source video and the contribution maps for a few randomly sampled neurons from the penultimate layer of NeRV, for each frame. In general, NeRV layers are reminiscent of classical image processing filters.



Correlation between Motion and Neuron Contributions


    Source Video
Flow
Head Layer Fluctuation     
Penultimate Layer Fluctuation


Each video above shows the correlation between motion in the source video and changes in NeRV neuron contributions over time by computing optical flow between two adjacent frames and the difference in contribution maps between those frames. Fluctuation is driven by motion, and it seems the areas revealed by motion matter as much as the objects that are actually moving, since when a moving object reveals new background, this causes fluctuations in the spatially proximal pixels. Thus, in spite of its lack of high-level object semantics, the changes in how NeRV represents a video over time are dominated by the motion of the entities in the video.

BibTeX


        @InProceedings{Padmanabhan_2024_CVPR,
          author    = {Padmanabhan, Namitha and Gwilliam, Matthew and Kumar, Pulkit and Maiya, Shishira R and Ehrlich, Max and Shrivastava, Abhinav},
          title     = {Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions},
          booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
          month     = {June},
          year      = {2024},
          pages     = {10957-10967}
        }