Research project investigating how teacher network architecture affects knowledge distillation quality for automated pain detection from facial expressions. Implemented and extended the DeiT-PNP architecture from El Morabit & Rivenq (2022), adapting it to the SynPain synthetic dataset and introducing a comparative teacher study — ResNet-50 vs. Swin Transformer — not present in the original paper.
Research Question: How do different teacher network architectures (ResNet-50 and Swin Transformer) compare in terms of accuracy and knowledge transfer efficiency when distilling to DeiT for binary pain/no-pain classification from facial expressions?
Automated pain detection from facial expressions has real clinical applications — assessing pain in patients who cannot self-report (post-surgery, neonates, or patients with cognitive impairments). Deep learning models trained for this task tend to be large and computationally expensive. Knowledge distillation addresses this by training a smaller, efficient student model to mimic a larger, more capable teacher enabling deployment in resource-constrained clinical settings.
The architectural choice of teacher model is an open research question. CNN-based teachers (like ResNet-50) encode spatial features hierarchically through convolutions. Transformer-based teachers (like Swin Transformer) encode global attention patterns. Whether this architectural difference affects what the student learns, and how well was the core question explored.
The SynPain dataset contains AI-generated synthetic facial images in a side-by-side format, each image shows two faces, with the labeling determined by the filename:
A custom "prepare_dataset.py" script splits each image and applies this labeling rule, producing a structured "Pain/" and "NoPain/" directory with a manifest CSV. The synthetic dataset was generated using Ideogram and Runway, covering diverse demographics (age, gender, ethnicity).
All faces were processed using MTCNN (Multi-Task Cascaded CNN) from facenet-pytorch:
Training augmentations included random resized crops, horizontal flip, rotation (±10°), and color jitter, all normalized with ImageNet statistics.
Hard distillation combines two Binary Cross Entropy terms:
L_total = L_BCE(class_token, ground_truth) + L_BCE(distill_token, teacher_hard_label)
The class token is supervised by the ground truth label. The distillation token is supervised by the teacher's prediction (hard argmax), not a soft probability distribution.
| Parameter | Value | |---|---| | Optimizer | Adam | | Learning rate | 1e-5 | | LR schedule | StepLR (×0.5 every 10 epochs) | | Epochs | 30 | | Batch size | 64 | | Split | 70 / 15 / 15 (train / val / test) | | Mixed precision | Optional (AMP) | | Reproducibility | Seeded + deterministic mode |
El Morabit, S., & Rivenq, A. (2022). Pain Detection From Facial Expressions Based on Transformers and Distillation. 2022 11th International Symposium on Signal, Image, Video and Communications (ISIVC), IEEE.