Kaggle reviewPlanet: Understanding the Amazon from Space

Kaggle review
Planet: Understanding the Amazon
from Space
Tyantov Eduard

#1 Challenge
Numbers
– Images of land surface of Earth
– Goal: help to detect deforestation of Amazon rainforest
– Classification task, 17 classes: atmospheric conditions, common land cover and rare land cover
– 3 months
– 938 teams
– $60k in prizes

#2 Scene examples: atmospheric
Cloudy
Haze
Partly cloudy

#2 Scene examples: common
Primary
Habitation
Water (river)

#2 Scene examples: common
Bare ground
Cultivation
Agriculture + road

#2 Scene examples: rare
Selective logging
Conventional Mining
Slash & burn

#2 Scene examples: rare
Blooming
Blow down
"Artisinal" Mining
(small & illegal)

#3 Data acquiring
Process
– GeoTiff format (red, green, blue, near infrared)
– Tiff -> JPG using “Planet visual product processor”
– 1600 panoramas splitted into 150k chips
– 17 chosen labels
– Labeling: Crowd Flower platform
– Assessors used only JPG data !

#4 Data characteristics
– 256x256 clips
– TIF: 4 channels, 16 bit numbers
– JPG converted from TIF
– Train 40.5k, test: 62k

#5 Evaluation
Score
F2 score, averaged across rows

#6 Data problems
Resolved problems
– Test data leakage (geo data in tif) -> new test set
– New test set was heavily muddled (jpg != tif)
Unresolved
– Misalignment of jpg and tif
– Different signals in tif & jpg (for small percentage of the data)
– Very noisy labels
• In the same cases: random of atmospheric
JPG-BGR TIF-BGR TIF-NIR Alignment
JPG-BGR TIF-BGR

#7 Label distribution
Co-occurrenceFrequencies

#8 Baseline
Aspects
– Code: Pytorch
– Model
• resnet18 pretrained from Imagenet
• sigmoid + cross-entropy
• trainable: block3 + block4 + FC
– Training aspects:
• SGD, lr=0.1
• Augmentaion: typical imagenet
• Test-time augmentation (TTA) - same
– Decision: p > 0.5
Result (F2-score): 90.06%

#9 Baseline: overall threshold
Aspects
– Code: Pytorch
– Model
• SGD, lr=0.1
• Augmentaion: standart imagenet
– Decision: p > 0.2
Result (F2-score): 91.79% (+1.73)

#10: Choosing F2-thresholds
We can find quazzy-optimal thresholds per class !
It’s ok to use valid set. Two methods:
1. Per class
– Bruteforce thresholds for each class independently
– Metric = F2-score per class
2. Joint optimization
– Gibbs sampling
• Starting from thresholds=[0.2]*17
– Metric = F2-score averaged across valid set
Summary
– Per class is more coherent
– Joint yields better results if averaged on folds

#11 Baseline: optimal thresholds
Aspects
– Code: Pytorch
– Model
• SGD, lr=0.1
• Augmentaion: standart imagenet
– Decision: p > p_class ( 0.1 .. 0.3 )
– Conclusion: choosing thresholds is crucial for LeaderBoard
Result (F2-score): 92% (+0.21)

#12 Enhancing baseline
Aspects
1. Plateau scheduler
• Start from the highest lr
• lr=lr/10 on N=3 degrading epochs
• After changing lr: load best model so far
2. How to finetune ? Experiments:
• conv layers LR=LR/10
• warmup: several epochs only FC
• Best:
– FC, L3, L4 on the same lr
– FC, {L4,L3} * 0.1, {L2, L1}*0.05 with warm-up until loss degrades
3. Model tweak:
• + FC (256 units) + BN + ReLu
– adding BN yields much better results
Result (F2-score): 92.53% (+0.53)

#13: Other zoo models
Models
– Resnet34 + FCBN -> 92.65%
– Densenet121 -> 92.76%
– Densenet169 yields best result of standart augmentation
Result (F2-score): 92.79% (+0.26)

#14: Heng’s activity
– He organized kagglers to experiment and share results/insights
– Created Slack channel (later it was prohibited)
– Shared code (until some time )
– Posted all ideas during the competition
– A lot of top-finishers used his code as a baseline
– Finished at 19th

#15 Some results of this activity
Best result: 93.015

#16: Best single jpg model
Models
– Migrated from PIL to CV2 (easier to augment)
– Model: Resnet18 + FCBN
– Resolution: 256x256 (instead of 224x224)
– Train augmentation: random shift/zoom(+-10%)/rotate/flip/transpose
• Zooming’s crucial for fighting overfitting
• But cuts roads/cultivations/… off
– 6 TTA: 4 rotation + 2 flips, average
– Fixed avgpool in all zoo models Avgpool(7) -> Avgpool(7,1) + Avg
• otherwise avgpool will use only 224x224 of an original image
Result (F2-score): 92.975% (+0.185)

#17 Using tif data
Pros
– NIR channel - additional info
– TIF – has 16 bits (JPG 8 bits)
– Domain specific features (various indexes)
Cons
– No pretrained models
– Assessors used only JPG
– Misalignments

#18 TIF from scratch
Aspects
– 4-channels: RGB + NIR
– Same setup for Resnet18 +FCBN
– Training from scratch
Result (F2-score): 91.72% (-1.26)

#19 Various indexes
Indexes
– NDVI - Normalized difference vegetation index
• detects live green vegetation
– NDWI - Normalized Difference Water Index
• water ;)
– SAVI - Soil-Adjusted Vegetation Index
– EVI - Enhanced vegetation index

#19 Various indexes: examples
RGB

#20 Mix model
Aspects
– Use all available data: RGB + NIR + 2 best Indexes
– Split 6 input channels into 3 + 3
– Model
• JPG-branch: best jpg model
• TIF-branch: Resnet18/WideResnet/ResNext from scratch
– Learning rates
• JPG: lr * 0.05
– This setup’s best: WideResnet
Resnet
18
(JPG)
FC-256 FC-256
FC-17
Some
Resnet
(TIF)
RGB
NIR
NDWI
SAVI
prediction
Result (F2-score): 93.00% (+0.025)

#21 Enhancing mix model: NIR only
Insight
We can use pretrained imagenet weights for TIF and it’s better than from scratch !
Only NIR
– Used pretrained resnet18
– Cut off first conv layer 3 x ... -> 1 x ...
Resnet
18
(JPG)
FC-256 FC-256
FC-17
Resnet
18
(NIR)
RGB NIR
prediction
Result (F2-score): 93.01% (+0.01)

#22 Enhancing mix model: best single model
Aspects
– Use all available data: RGB + NIR + Indexes
– Model
• JPG-branch: best jpg model
• TIF-branch: Resnet18 pretrained
– Learning rates
• JPG: lr*0.05
• TIF: FC * 1, {L4,L3} * 1, {L2, L1}*0.1
– Results
• public: 93.071, private: 92.905
• Most competitors have ~ same: public: 93.143, private: 92.915 (overfit)
• 1-st place: local: ~93.3
Resnet
18
(JPG)
FC-256 FC-256
FC-17
Resnet
18
(TIF)
RGB
NIR
NDWI
SAVI
prediction
Result (F2-score): 93.071% (+0.061)

#23 It’s time to stack !
Guides
– Kaggle ensemble guide
– An Introduction to StackNet

#24 Ensembles from submission files
Correlated case
submissions
1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy
result
1111111100 = 80% accuracy
Less correlated
submissions
1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracy
result
1111111101 = 90% accuracy
Pick less correlated submission files and vote. I just averaged last 20 submissions.
Result (F2-score): 93.095% (+0.024)

#25 Stacking: out-of-fold predictions
Idea: train k-fold, predict on valid, concatenate

#27 Blending
Steps
1. Construct holdout set
2. For any model «layer1»
1. Train model on train set
2. Predict holdout
3. Train blending model on «layer1» on holdout set
Pros (vs stacking)
– Simpler
– No information leak
– Teammates can throw any model into blender. No seed.
Cons
– Less data
– May overfit to holdout
– Only 2 layers of models

#28 Stack & blend submissions
Problems
– F2 thresholds! They’re different and may overfit (I faced it)
– Overfitted on holdout ;)
Results
– Blending works better for me (log regression on top)
– Best submission based on simple weighting on F2 holdout score
• weight = ((score– min)/(max-min)) ** 0.5
– Models: 10 ensembles (59 models in total)
• jpg: densent121, 5 folds
• jpg: densent169, 5 folds
• jpg: resnet18, 8 folds
• mix: mixnet, 5,6,7 folds
• mix: wideresnet, 7 folds
• mix: wideresnet with selu, 5 folds
– Submission of 130 models scored worse
Final result (F2-score): 93.217% (+0.146), private: 93.015

#29 Last day: vain attempts (1/2)
Problem to solve
– In the train data there is a lot of wrong labels
– Label noise
Solution: purge
– 1% semi-automatic blacklist & fix labels
Examples below, random labels (clear, primary, …)

#29 Last day: vain attempts (2/2)
Results
– Purge significantly impoved valid scores and untouched HOLDOUT score
– But it’s a trap, overfitting somehow
Submission fuss
– Very confident in improvement 
– Waiting till 2:30 to assemble all model results (3:00 due)
– Had an exact plan for 5 submissions, but got worse and no time to think 
Lessons
– Not much sense to purge if test set’s the same (noisy)
– Spare more than 2 days for stacking 
– Plenty of room for stacking to improve results according to leaders’ posts
Final result (F2-score): 93.186% (-0.031)

#30 What didn’t work
List
– Hierarchical final layer (cloudy excludes all other)
• out_1 = sigmoid (cloud_activation)
• out_2 = out_1 * FC [only round works, floating not OK]
– Loss weighting using class distribution (to balance classes)
• oversampling also didn’t work
– YellowFin optimizer – gradient explosion
– Pretrained from Auto-Encoders, Split-Brain

#31 What didn’t work: AAE concept
Concept
– Adversarial Auto Encoders – almost the same as VAE, but instead of KL-distance -> Discriminator
– Trained resnet18 for decoder, transposed resnet18 (transposed conv instead strided conv) for encoder

#32 What didn’t work: Split Brain
Concept
– Splits input in 2 parts, 2 models predict each other input

#33 Technical
– Ubuntu 16, Cuda 8, cudnn 5, Anaconda, pytorch
– TitanX
– Single model training time:
• mixnet 2-3 hours
• jpg resnet18 1-1.5 hour
– Ensembles: 12-24 hours
– Code: https://github.com/EdwardTyantov/pytorch-kaggle-amazon-space

#34 Shake-up: worst case
Messed with sorting (merging submission files)
– top-15 -> bottom

#35 Other competitors: panoramas
Used CNN averaged predictions for 4 or 8 neighbours as features for central element. Link

#36 Other competitors: dehazing
– Single Image Haze Removal Using Dark Channel Prior (paper)

#37 Other competitors: tricks
List
– Different network sizes: 64*64, 224*224, 256*256 (64 good performance on label «clear»)
– Hard example mining (1/3 with largest loss)
– Averaging TTA using XGBoost (learn mapping)
– XGBoost/Ridge on top of CNNs’ out-of-fold predictions
– cross-entropy + F2-loss (?!)

Kaggle reviewPlanet: Understanding the Amazon from Space

More Related Content

What's hot (20)

Similar to Kaggle reviewPlanet: Understanding the Amazon from Space (20)

More from Eduard Tyantov (9)

Recently uploaded (20)