Group Equivariant Convolutional Networks (G-CNN) have gained significant traction in recent years owing to their ability to generalize the property of CNNs being equivariant to translations in convolutional layers. With equivariance, the network is able to exploit groups of symmetries and a direct consequence of this is that it generally needs less data to perform well. However, incorporating such knowledge into the network may not always be advantageous, especially when the data itself does not exhibit full equivariance. To address this issue, the concept of relaxed equivariance was introduced, offering a means to adaptively learn the degree of equivariance imposed on the network, enabling it to operate on a level between full equivariance and no equivariance.
Interestingly, for rotational symmetries on fully equivariant data, Wang et al. (2023) found that a fully equivariant network exhibits poorer performance compared to a relaxed equivariant network. This is a surprising result since usually, fully equivariant models perform the best in cases where the data is also fully equivariant and is precisely what they were designed for. One plausible rationale for why it is different in this case is that the training dynamics benefit from the relaxation of the equivariance constraint. To address this proposition, we use the framework described in Park & Kim (2022) for measuring convexity and flatness using the Hessian spectra.
Since relaxed equivariance adaptively learns the amount of equivariance from the data it is important to also understand this process to investigate the training dynamics. To do this, we investigate how the equivariance possessed by the data at hand influences the training dynamics. Importantly, Gruver et al. (2022) shows that more equivariance imposed on a network does not necessarily imply more equivariance learned. As such, we will use more advanced methods to discover the true equivariance that the model expresses.
Inspired by the aforementioned observations, the purpose of this blog post is to investigate the following question. How does the equivariance imposed on a network affect its training dynamics? With support from these observations we hypothesize that the less equivariance is imposed the better the training dynamics will be. To answer our research question, we identify the following subquestions:
- How effective is regularization for imposing equivariance on a network?
- How does the amount of equivariance imposed on the network affect the amount of equivariance learned?
- How does equivariance imposed on a network influence generalization?
- How does equivariance imposed on a network influence the convexity of the loss landscape?
We answer these questions by:
- Reproducing results to establish common ground
- Performing experiments to investigate learned equivariance
- Analyzing trained models to investigate their training dynamics
First, we introduce some theory needed for the reproduction. After that, we will go over the exact experiments we reproduced and our results.
Consider the segmentation task depicted in the picture below.
Naturally, applying segmentation on a rotated 2D image should give the same segmented image as applying such rotation after segmentation. Mathematically, for a neural network
for all images
To build such a network, it is sufficient that each of its layers is equivariant in the same sense. Recall that a CNN achieves equivariance to translations by sharing weights in kernels that are translated across the input in each of its convolution layers. A G-CNN extends this concept of weight sharing to achieve equivariance w.r.t an arbitrary (finite) group
Consider any 2D image as an input signal
Suppose
Now,
For Relaxed Equivariant Networks, we define
Note that for the group convolution to be practically feasible,
First, consider the representation of rotations
- The input signal becomes
$f: \mathbb{R}^2 \rightarrow \mathbb{R}^{in}$ . - The kernel
$\psi: \mathbb{R}^2 \rightarrow \mathbb{R}^{out \times in}$ used must satisfy the following constraint for all$g \in G$ :$$\psi(gx) = \rho_{out}(g) \psi(x) \rho_{in}(g^{-1})$$ - Standard convolution only over
$\mathbb{R}^2$ and not$G$ is performed.
To secure kernel
Therefore, the convolution is of the form:
Whenever both
The desirability of equivariance in a network depends on the amount of equivariance possessed by the data of interest. To this end, relaxed equivariant networks are built on top of G-CNNs using a modified (relaxed) kernel consisting of a linear combination of standard G-CNN kernels. Define
The equivariance error increases with the number of kernels
to the loss function. A higher value of the hyperparameter
Therefore, using relaxed group convolutions allows the network to relax strict symmetry constraints, offering greater flexibility at the cost of reduced equivariance.
Relaxed steerable G-CNNs are defined using a similar idea, again we let the weights depend on the variable of integration:
which leads to a loss of equivariance. Not unlike the previous case, the closer the weights are to constant functions the more equivariant the model is, and thus we can impose equivariance by adding the following term to the loss function:
Here the partial derivatives are discrete and just represent the difference of neighbouring weight values over spacial locations.
We perform two reproduction studies in this blog post.
Our first objective is to reproduce the experiment demonstrating that a relaxed equivariant model can outperform a fully equivariant model on fully equivariant data. Specifically, we reproduce the Super-resolution of 3D Turbulence experiment (Experiment 5.5) in the paper "Discovering Symmetry Breaking in Physical Systems with Relaxed Group Convolution" (Wang et al., 2023). This reproduction provides grounds for expecting a relaxed equivariant model to outperform a model that is properly equivariant to the symmetries of the data.
Additionally, we reproduce results from Wang et al. (2022) that introduced relaxed group convolutions. In this paper, relaxed group convolutions are compared to other methods on the 2D smoke simulation data. We intend to do the same, focusing on the experiments that fit our first objective the closest, namely the ones involving rotational symmetries.
In Experiment 5.5 of Wang et al. (2023), the authors evaluate the performance of one network architecture with three variations 1) convolutional blocks, 2) group equivariant blocks, and 3) relaxed group equivariant blocks. All networks are tasked with upscaling 3D channel isotropic turbulence data.
The data consists of liquid flowing in 3D space and is produced by a high-resolution state-of-the-art simulation hosted by the John Hopkins University (Li et al., 2008). Importantly, this dataset is forced to be isotropic, i.e. fully equivariant to rotations, by design.
For the experiment, a subset of 50 timesteps are taken, each downsampled from
The architecture of the models can be visualized on
1 and is mostly described in Wang et al. (2023), with the following additions that are not specified in the paper: The first layer of the relaxed and the regular GCNNs are lifting layers. To preserve spatial size, in every non-Upconv convolution,
- For isotropic turbulence, networks with full equivariance should either outperform or be on par with those with relaxed equivariance, as the data fully adheres to isotropic symmetry. However, as shown in Wang et al. (2023), the opposite happens.
For this experiment in Wang et al. (2022), a specialized 2D smoke simulation was generated using PhiFlow (Holl et al., 2020). In these simulations, smoke flows into the scene from a position toward the direction of the buoyant force (see Figure 2). For instance, in everyday life, the buoyant force opposes gravity and thus smoke floats upwards. Additionally, every inflow position has a slightly different buoyant force, varying in either strength or angle of dispersion. Furthermore, for a given inflow position, the simulation is ran multiple times. Each time, its buoyant force is modified by a scalar factor to increase the difference between the buoyant forces. In total, this results in
With this dataset, we are able to control the amount of equivariance it possesses. On a small scale, smoke will flow the same way regardless of rotation as it is mostly influenced by the smoke particles around it. However, on a larger scale, the influence of the specific angle of the buoyant force on movement is larger. This is due to the fact that some directions will have stronger buoyant forces than others. Of note is that since all the tested models are equivariant with respect to translation as they are based on CNNs. Meaning that the exact location of the inflow position is not generally important.
Using this dataset, the task is to predict the upcoming frames based on the previous ones. Evaluation on this task is done on the following settings:
- Domain: the model is tested on inflow locations it was not trained on.
- Future: the model is tested on timesteps that are further in the simulation than what it was trained on.
For this experiment, two different model architectures are used. For the relaxed steerable model (rsteer) we use the hyperparameters provided in the repository of Wang et al. 2022. This means we use
- Since the dataset is partially equivariant to rotation, the partial equivariant model is expected to perform the best.
Both of our reproduction studies corroborate the conclusion drawn from the results in the original papers.
We compare our results with those of Wang et al. (2023) for the CNN (SuperResCNN), regular group equivariant network (GCNNOhT3), and relaxed regular group equivariant network (RGCNNOhT3). The reconstruction mean absolute error (MAE) is presented in the table below.
Results from original paper (MAE 1e-1) | ||
---|---|---|
cnn | gcnn | rgcnn |
1.22 (0.04) | 1.12(0.02) | 1.00(0.01) |
Reproduction Results (MAE 1e-1) | ||
---|---|---|
cnn | gcnn | rgcnn |
0.992(0.03) | 0.928(0.04) | 0.915(0.04) |
We see that although all our results are a little better than the original ones, the trend of relaxed equivariant model outperforming the fully equivariant model on fully equivariant data remains.
Additionally, we investigate the parameter efficiency of the networks below.
Number of learnable parameters | ||
---|---|---|
cnn | gcnn | rgcnn |
132795 | 123801 | 130175 |
Parameter Efficiency (MAE per 1e6 parameters) | ||
---|---|---|
cnn | gcnn | rgcnn |
0.747 (0.03) | 0.750 (0.03) | 0.703 (0.03) |
We observe that the relaxed equivariant network is more parameter-efficient than the fully equivariant network.
We compare our results with those in Wang et al. (2022) for the relaxed regular and steerable GCNNs. The reconstruction RMSE for both methods is shown in the table below.
Results from original paper | ||
---|---|---|
rgroup | rsteer | |
Domain | 0.73 (0.02) | 0.67 (0.01) |
Future | 0.82 (0.01) | 0.80 (0.00) |
Reproduction Results | ||
---|---|---|
rgroup | rsteer | |
Domain | 0.90 (0.04) | 0.67 (0.00) |
Future | 0.88 (0.03) | 0.82 (0.00) |
Again, we see that although there are some discrepancies in values, the same trend with comparable performance is observed for the relaxed steerable GCNN. On the other hand, our results for the relaxed regular GCNN are somewhat different from those reported in the original paper. One potential reason for this is that the original paper did not provide the hyperparameters used to obtain their results, and we did not perform a grid search over the provided grid of parameter values. Another reason might be that the early stopping metric we used is different.
To maximize reproducibility and future usability, we provide config files for all the experiments, models, datasets, trainers, etc. using Hydra and PyTorch Lightning (more information on the README). This means that all the models are wrapped in Lightning Modules and all datasets (SmokePlume, JHTDB) are uploaded to HuggingFace and have a corresponding Lightning DataModule. We reuse and upgrade the data generation scripts for the SmokePlume datasets from Wang et al. (2022) and implement a configurable data generation and HuggingFace-compatible data loading script from scratch for the JHTDB Dataset. Furthermore, we integrate our code with Weights and Biases and publish all the relevant runs and plots on publicly accessible reports ([11], [12], [13]). Finally, for the relaxed regular group convolutional networks, we implement all components on our fork of the gconv
library ([14]).
To summarize the missing/added reproduction code:
- For Wang et al. (2023):
- All models (rgcnn, gcnn, cnn), where we implement 3d relaxed separable convolutions, octahedral group convolutions, 3d equivariant transposed, convolutions, 3d group upsampling, and we made educated guesses on which activations, normalizations to use and where to place them (along skip and upsampling residual connections).
- The JHTDB dataset, where we implement all the subsampling, preprocessing and loading logic of the 3d turbulence velocity fields.
- For Wang et al. (2022), we added the missing weight constraint and hyperparameters for rgroup.
All the experimentation code can be found at: https://github.com/dgcnz/dl2.
As the results of the reproduction match those in their respective papers, we are free to conduct several analyses using approximate equivariance. For these experiments, we introduce a dataset that is very similar to the 2D smoke dataset seen in Reproduction. Additionally, we analyze trained models to learn about their training dynamics. The techniques used for this are explained in Theory for Analysis. With these results, we answer the research questions posed in Introduction to ultimately shed light on the role of imposed equivariance equivariant models on its training dynamics.
In this section, we introduce the necessary definitions of measuring quantities of interest for our additional experiments.
It is natural to measure the amount of equivariance a network
We can estimate this expectation by computing the average for a series of batches from our test set. However, this approach has downsides, which we can tackle using the Lie derivative.
In practice, even though we are imposing
Gruver et al. (2022) proposed the use of Lie derivatives, which focus on the equivariance of the network towards very small transformations in
Specifically, if
Having small Lie derivatives (in norm) therefore implies that
To assess the training dynamics of a network, we are interested in the final performance and the generalizability of the learned parameters, which are quantified by the final RMSE, and the sharpness of the loss landscape near the final weight-point (Zhao et al., 2023).
To measure the sharpness of the loss landscape after training, we consider changes in the loss averaged over random directions. Let
This definition is an adaptation from the one in Zhao et al. (2023) which does not normalize by
Finally, the Hessian eigenvalue spectrum (Park & Kim, 2022) sheds light on both the efficiency and efficacy of neural network training. Negative Hessian eigenvalues indicate a non-convex loss landscape, which can disturb the optimization process, whereas very large eigenvalues indicate training instability, sharp minima and consequently poor generalization.
The purpose of our extensions is twofold. First, we examine the impact of equivariance imposed and data equivariance on the amount of equivariance learned, answering subquestions 1 and 2 posed in Introduction. We do this by computing the equivariance error and Lie derivative. We plot these measures for varying levels of imposed equivariance and data equivariance.
Second, we examine how equivariance imposed on a network influences the convexity of the loss landscape and generalization, answering subquestions 3 and 4 posed in Introduction. We can strongly impose equivariance on a network through architecture design, and we can weakly impose equivariance on a network through a regularization term in the loss of the relaxed models. We train multiple models with different levels of imposed equivariance on two fully equivariant datasets, namely the super resolution and 2D Smoke Plume with varying levels of Equivariance, both of which were introduced in Reproduction Methodology. Note however that the 2D Smoke Plume we use for our additional study is modified, with the changes described in Smoke Plume with Varying Equivariance. For these models, we examine the convexity of its loss landscape and its generalizability with the measures defined in Extension Theory.
For this experiment, we use a synthetic
To quantify different levels of data equivariance, we use the following metric proposed in Wang et al. (2022): First, we rotate the right, down, and left directions back to the upward position. Then, we compare these rotated directions against the original upward direction. The MAE of these comparisons is considered as the equivariance error the dataset possesses.
In total, we experiment with
We use
Using these models means we have the following
Strongly, by using the E2CNN model architecture which is strictly equivariant and has a very similar architecture to the rsteer model. Or weakly by adding a regularization term to the loss in the rsteer model; namely, the parameter
We investigate this on two different datasets.
We use the same Smoke Plume dataset. We analyze the model checkpoints corresponding to the third and best epochs during training, where best means highest validation RMSE. As we will see, the regularization using
We first investigate the model's generalizability by looking at the training and loss curves we obtained by running the reproduction experiments in Reproduction for both the relaxed and the non-relaxed models. Additionally, we evaluate the sharpness metric for both models and a CNN.
For this, we again use the third epoch and the epoch with the best loss. Although we wanted to also compute the Hessian spectra, this was unfortunately not possible because the second derivative of the 3D grid sampler used in both equivariant networks is not yet implemented in PyTorch.
Figure 3: Impact of equivariance imposed on Model's Equivariance Error, for rsteer and E2CNN models |
Figure 4: Impact of equivariance imposed on Model's Lie Derivative, for rsteer and E2CNN models |
Figure 3 shows the Equivariance Error of different model specifications. The Equivariant Net is the E2CNN model, it is positioned to the right of the x-axis because we can think of a fully equivariant net as a relaxed equivariant net where alpha is set to infinity.
For rsteer, we observe that the Data Equivariance has a large effect on how equivariant the model learns to be. This shows that the relaxed architecture can adapt well to the equivariance of the data, which matches the findings in Wang et al. (2022). However, we see that the hyperparameter
Figure 4 shows the Lie derivative for different model specifications. A lower Lie Derivative means the model is more equivariant to the complete rotation group. For Rsteer we see similar results to Figure 3. However, for E2CNN, we do not see a zero Lie derivative because the architecture only guarantees equivariance w.r.t. the C4 group.
Interestingly, rsteer exhibits a lower Lie derivative than E2CNN when trained on fully equivariant data. This could be due to rsteer's greater flexibility, allowing it to learn equivariance w.r.t. a broader group of rotations beyond C4. In contrast, E2CNN achieves perfect C4 equivariance but struggles to generalize to all rotations.
First, we examine the training, validation and test RMSE for the E2CNN and Rsteer models on the fully equivariant Smoke Plume dataset.
Figure 5: Train RMSE curve for rsteer and E2CNN models |
Figure 6: Validation RMSE curve for rsteer and E2CNN models |
Figures 5 and 6 show the train and validation RMSE curves, respectively. We see that on the training data, rsteer and E2CNN have similar performance. However, on the validation data, the curve for rsteer lies strictly below the one for E2CNN. Therefore, the relaxed steerable GCNN, i.e. rsteer, seems to generalize better. This again might be attributed to its flexibility compared to the vanilla steerable GCNN.
Figure 7 shows the test set RMSE for the two models averaged over five seeds. We find that the relaxed equivariant model performs better, even though the data is fully C4 equivariant, reaffirming the observation we validated on the Isotropic Flow dataset.
To obtain insight into why the relaxed equivariant models outperform the fully equivariant ones on these datasets, we inspect the hessian spectra and the sharpness of the loss landscape of these models.
Figure 8: Hessian spectra at an early epoch for rsteer and E2CNN models |
Figure 9: Hessian spectra at the best epoch for rsteer and E2CNN models |
Figures 8 and 9 show hessian spectra for the same early and best checkpoints of E2CNN and rsteer used in the previous analysis. With regard to the flatness of the loss landscape, these plots allow us to make a similar conclusion. We see that for both checkpoints E2CNN has much larger eigenvalues than rsteer, which can lead to training instability, less flat minima, and consequently poor generalization for E2CNN.
To evaluate the convexity of the loss landscape, we focus on the negative eigenvalues in the Hessian Spectra. We see that for both models, neither spectra shows any negative eigenvalues. This suggests that for both the fully equivariant E2CNN and the relaxed rsteer models, the points it has traversed in the loss landscape, exhibit "convex" loss landscapes. Thus, in this case, the convexity of the loss landscapes does not seem to play a large role in the performance.
Next, we examine checkpoints for the two models trained on the Smoke Plume Dataset with
Similarly, we also analyze the training dynamics of the superresolution models on the isotropic JHTDB dataset as a potential explanation for the superiority of the relaxed equivariant model over the fully equivariant one.
First, we examine the training and validation MAE curves for the Relaxed Equivariant (RGCNN), Fully Equivariant (GCNN), and non-equivariant (CNN) models (run on 6 different seeds).
Figure 11: Training MAE curve for RGCNN, GCNN and CNN models |
Figure 12: Validation MAE curve for RGCNN, GCNN and CNN models |
Here, we observe that early in the training (around epoch
Figure 13: Sharpness of the loss landscape on the super resolution dataset. Ran over 6 seeds, error bars represent the standard deviation. For early, the third epoch was chosen, while for best the epoch with the best validation loss was chosen.
In any case, as seen in Figure 13, the sharpness value of the loss landscape was the lowest for the relaxed model in both the early and best checkpoints. This again indicates that the relaxed steerable GCNN has better generalisability during its training and at its convergence, matching our previous findings in our extensions on the SmokePlume dataset and the reproduction study on the Super Resolution dataset.
We reproduced two experiments: (1) On the Smoke Plume dataset in Wang et al. (2022) and (2) Super Resolution dataset in Wang et al. (2023). Our reproduction results align with the findings of the authors of the original papers, reaffirming the effectiveness of relaxed equivariant models and demonstrating that they are able to outperform fully equivariant models even on perfectly equivariant datasets. We extend our findings from the reproduction in (2) to the fully equivariant smokeplume dataset and find that the same conclusion can be made there.
We furthermore investigated the authors' speculation that this superior performance could be due to relaxed models having enhanced training dynamics. Our experiments empirically support this hypothesis, showing that relaxed models exhibit lower validation error, a flatter loss landscape around the final weights, and smaller Hessian eigenvalues, all of which are indicators of improved training dynamics and better generalization.
Finally, we demonstrated that the amount of equivariance in the training data predominantly influences the amount of equivariance learned by relaxed equivariant models. Datasets with higher degrees of equivariance yield models with higher degrees of internalized equivariance. Conversely, adding regularization terms to the loss function has negligible effects on the amount of learned equivariance.
Our results suggest that replacing fully equivariant networks with relaxed equivariant networks could be advantageous in all application domains where some level of model equivariance is desired, including those where full equivariance is beneficial. For future research, we should investigate different versions of the relaxed model to find out which hyperparameters, like the number of filter banks, correlate with sharpness. Additionally, the method should be applied to different types of data to see if the same observations can be made there.
- Nesta: Reproduction of Wang et al. (2022), including porting models to lighting and creating configuration. Implementation of experiment scripts using Wandb API. Implementation of Equivariance Error, parts of Hessian Spectra and Sharpness metric. Writing of the analysis in the results section for the experiments using the Smoke Plume Dataset.
- Sebastian: Research of Lie derivatives, Hessians, Sharpness and Writing.
- Jiapeng: Research and implementation of Lie derivatives and Sharpness, Research on Hessians, Writing.
- Thijs: Research on the octahedral group, Implementation of Super-Resolution models and 3D separable group upsampling on gconv. Reproduction code from Wang et al. (2023). Writing.
- Diego: Integration with Hydra, Integration with W&B, Implementation of Hessian Spectra, Reproduction code for Wang et al. (2023), Implementation of the JHTDB dataloader, Implementation of octahedral relaxed separable, lifting and regular group convolutions on gconv library, SLURM setup, hyperparameter search.
[1] Wang, R., Walters, R., & Smidt, T. E. (2023). Relaxed Octahedral Group Convolution for Learning Symmetry Breaking in 3D Physical Systems. arXiv preprint arXiv:2310.02299.
[2] Gruver, N., Finzi, M., Goldblum, M., & Wilson, A. G. (2022). The lie derivative for measuring learned equivariance. arXiv preprint arXiv:2210.02984.
[3] Park, N., & Kim, S. (2022). How do vision transformers work?. arXiv preprint arXiv:2202.06709.
[4] Zhao, B., Gower, R. M., Walters, R., & Yu, R. (2023). Improving Convergence and Generalization Using Parameter Symmetries. arXiv preprint arXiv:2305.13404.
[5] Wang, R., Walters, R., & Yu, R. (2022, June). Approximately equivariant networks for imperfectly symmetric dynamics. In International Conference on Machine Learning (pp. 23078-23091). PMLR.
[6] Holl, P., Koltun, V., Um, K., & Thuerey, N. (2020). phiflow: A differentiable pde solving framework for deep learning via physical simulations. In NeurIPS workshop (Vol. 2).
[7] Y. Li, E. Perlman, M. Wan, Y. Yang, C. Meneveau, R. Burns, S. Chen, A. Szalay & G. Eyink. "A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence". Journal of Turbulence 9, No. 31, 2008.
[8] E. Perlman, R. Burns, Y. Li, and C. Meneveau. "Data Exploration of Turbulence Simulations using a Database Cluster". Supercomputing SC07, ACM, IEEE, 2007.
[9] Super-resolution of Velocity Fields in Three-dimensional Fluid Dynamics: https://huggingface.co/datasets/dl2-g32/jhtdb
[10] Weiler, M. and Cesa, G. General E(2)-equivariant steerable CNNs. In Advances in Neural Information Processing Systems (NeurIPS), pp. 14334–14345, 2019b.
[11] Turbulence SuperResolution Replication W&B Report: https://api.wandb.ai/links/uva-dl2/hxj68bs1
[12] Equivariance and Training Stability W&B Report: https://api.wandb.ai/links/uva-dl2/yu9a85jn
[13] Rotation SmokePlume Replication W&B Report: https://api.wandb.ai/links/uva-dl2/hjsmj1u7
[14] gconv
library for regular group convnets: https://github.com/dgcnz/gconv
[15] Bekkers, E. J., Vadgama, S., Hesselink, R. D., van der Linden, P. A., & Romero, D. W. (2023). Fast, Expressive SE