Our implementation of a simplified Vision Transformer (ViT) model adopts a patch size of 16 and and features an encoder with 4 layers, each equipped with 4 attention heads.
The model has been evaluated using standard classification metrics:
- Accuracy: 58.96%
- Hamming Loss: 0.0680
These metrics reflect the preliminary results obtained under our current experimental setup.