The task was to identify the camera that the image was taken with.
- Images in the training set were captured with 10 different camera models, a single device per model, with 275 full images from each device.
- The list of camera models is as follows:
- Sony NEX-7
- Motorola Moto X
- Motorola Nexus 6
- Motorola DROID MAXX
- LG Nexus 5x
- Apple iPhone 6
- Apple iPhone 4s
- HTC One M7
- Samsung Galaxy S4
- Samsung Galaxy Note 3
- Images in the test set were captured with the same 10 camera models, but using a second device.
- While the train data includes full images, the test data contains only single 512 x 512 pixel blocks cropped from the center of a single image taken with the device.
- Half of the images in the test set have been altered. The image names indicate whether or not they were manipulated. The set of possible processing operations is as follows
- JPEG compression with quality factor \in {70, 90}
- resizing (via bicubic interpolation) by a factor of {0.5, 0.8, 1.5, 2.0}
- gamma correction using gamma \in {0.8, 1.2}
- Random samples from the training set:
Weighted accuracy, with weights 0.7 for unaltered images, and 0.3 for altered images.
- fine-tune from CNNs trained on ImageNet, start from DenseNet-121, ResNet-{34, 50}
- add 2-layer FC with
PReLU
activation
- add 2-layer FC with
- random crops 256x256 with horizontal mirroring
- naturally, use the above augmentations (random JPG compression, resizing, or gamma correction), to make the model robust to those manipulations
- losses:
- standard categorical cross-entropy (logloss)
- multi-class classification Hinge loss (
nn.MultiMarginLoss
) nn.MultiLabelSoftMarginLoss
- optimization:
- SGD+m worked better (converged to deeper minima) for these ResNet-like architectures, but also tried Adam
- implement and use stratified mini-batch variants of SGD (by class and eventually also by
is_manip
) - use early stopping
- implement and use
ReduceLROnPlateau
- after a simple baseline using DenseNet-121 + ResNet-{34,50} and having 0.913 (public LB) is built, build a stronger validation set:
- use the least confident predictions on a training set
- use the same number of examples from each of the classes
- also generate pseudo-labels: use the most confident predictions on a test set as "ground truth" and add them to the validation set (helps when there is some train/test data distribution mismatch, as in this case)
- the resulting validation set correlated with public LB very well
- try larger architectures: DenseNet-201, ResNet-{101, 152}, ResNext-101({32, 64})
- add Dropout to the "head" (FC layer)
- implement and use Cyclical LR for faster transfer learning, [arXiv]
- use external data :P from Flickr and other resources (+ check metadata in extensions to find out the device) to construct a lot larger dataset, [kaggle discussion], filter data in the [notebook]
- construct new strong, balanced validation set from least confident predictions on a new dataset (notebook)
- the new training dataset is imbalanced, so use:
- stratified undersampling
- class-weighted loss function (class weight =
1/n_images_in_train_per_class
)
- add new pseudo-labels, all in stratified manner
- generate and save random crops in large blocks, for faster reading from HDD (+ other approaches like saving to
lmdb
database can be found in the notebook) - add
--bootstrap
option for bagged runs - the last day:
- add new data (from
artgor
), generate new validation set and update pseudo-labels [notebook] - rot90 all patches during training
- add
is_manip
flag to FC ‼️ implement and use Distillation learning [arXiv], in order to train some of the new models really fast by matching logits with strong models early in the training
- add new data (from
- test-time augmentation (TTA): 2/3 *
FiveCrops
(center and corners) + 1/3 *rot90
, the idea is that almost nobody takes the photo upside down (sorot180
androt270
are quite unlikely) - combine predictions of multiple models using arithmetic averaging of the logits
- "equalize" predictions using a simple greedy algorithm, as we know that each class has the same fraction in the private test set:
- concluded after a couple of submissions with one class only, and getting exactly 0.1 on public LB
- train models from scratch:
- simple "toy" CNNs
- AlexNet-like CNNs with wide receptive field
- ResNet-like architectures (takes too long)
- other preprocessings:
- special preprocessing, or special ordering of patches according to their complexity or informativity (a variant of "curriculum learning"), for more details please refer to [paper] and [notebook]
- central crops 256x256 (even with rotations and/or horizontal mirroring)
- larger or smaller patches (e.g. 512x512, 384x384, 224x224, 128x128)
- central 1024x1024 patches followed by random 256x256 crops
- apply convolution with 5x5 edge-detection kernel (
--kernel
) prior training - random "optical" crops (rotate patches such that the optical center is always in the same, predetermined corner)
- align crops by two pixels (
--align
)
- other variants of test-time augmentation:
- 16 random crops (no rotation)
- JPG compression (some of the images are already compressed)
- gamma correction {0.8, 1.0, 1.2}
FiveCrops
(center and corners), no rotationsFiveCrops
+ rot90TenCrops
(center and corners + horizontally mirrored), no rotationsTenCrops
+ rotationsTenCrops
+ rot90TenCrops
- 2/3 *
TenCrops
+ 1/3 * rot90TenCrops
- 2/5 *
TenCrops
+ 1/5 * rot90TenCrops
+ 1/5 * rot180TenCrops
+ 1/5 * rot270TenCrops
- the whole D4 group of transformations
- other variants of combining predictions from multiple models:
- arithmetic, geometric average of probabilities
- median of probabilities, logits
- weighted median of probabilities, logits
- arithmetic average of
sqrt(proba)
- arithmetic average of
proba ** 2
- arithmetic average of
softmax(logits * C)
,C \in {0.5, 2.0}
- arithmetic average of
g(logits)
, whereg(x) = sqrt(|x|) * sign(x)
- arithmetic average of
softmax(g(logits))
, whereg(x) = sqrt(|x|) * sign(x)
- stacking (blending): train Logistic regression or SVM on their logits or probabilities
- best single model: 0.970 (public LB)
- final model:
- 7 bagged DenseNet-121's
- ensemble of various architectures trained using different initialization, hyperparameters, preprocessing, TTA, losses, optimizers, LR schedules, stages of training (checkpoints), etc. throughout the project (most of which are not as powerful as the best single one), in total 33 models: 0.979 (private LB)
- best private LB was 0.981 (:arrow_right: 14th place)
- top1 solution: 0.989, using 350GB of data and 20 GPUs 😱
- placed 17/581 🎉
To get the most out of my limited resources, I have implemented a Telegram bot, that can quickly:
- display current checkpoints
- plot learning curves
- plot confusion matrices
- stratified split not only by a class (and
is_manip
) but also by a scene - try freeze for the first epoch (update only affine layers in "head"), then "release" with small learning rate
- try Mixup: [arXiv]
- use Snapshot ensembles for CNNs trained with Cyclical LR: [arXiv]
- or simply average out top-K best checkpoints (similar if learning curves oscillate)
- play around with low-level features from CNNs (e.g. train k-NN on top of those)
- incorporate FFT-based features (on, e.g.
image - smooth(image)
) - stack with
xgboost
💪 - more advanced equalization methods:
- Hungarian algorithm
- force uniform distribution on a test set predictions in an information-theoretic sense (maximize the differential entropy)
- try
np.memmap
in case of large datasets on HDD