Pain Detection via Knowledge Distillation

Pain Detection via Knowledge Distillation
Pain Detection via Knowledge Distillation
Pain Detection via Knowledge Distillation

Project information

Overview

Research project investigating how teacher network architecture affects knowledge distillation quality for automated pain detection from facial expressions. Implemented and extended the DeiT-PNP architecture from El Morabit & Rivenq (2022), adapting it to the SynPain synthetic dataset and introducing a comparative teacher study — ResNet-50 vs. Swin Transformer — not present in the original paper.


Research Question: How do different teacher network architectures (ResNet-50 and Swin Transformer) compare in terms of accuracy and knowledge transfer efficiency when distilling to DeiT for binary pain/no-pain classification from facial expressions?


Problem & Motivation

Automated pain detection from facial expressions has real clinical applications — assessing pain in patients who cannot self-report (post-surgery, neonates, or patients with cognitive impairments). Deep learning models trained for this task tend to be large and computationally expensive. Knowledge distillation addresses this by training a smaller, efficient student model to mimic a larger, more capable teacher enabling deployment in resource-constrained clinical settings.


The architectural choice of teacher model is an open research question. CNN-based teachers (like ResNet-50) encode spatial features hierarchically through convolutions. Transformer-based teachers (like Swin Transformer) encode global attention patterns. Whether this architectural difference affects what the student learns, and how well was the core question explored.


Technical Implementation


Dataset: SynPain


The SynPain dataset contains AI-generated synthetic facial images in a side-by-side format, each image shows two faces, with the labeling determined by the filename:


  • Filename contains "Pain" → left half = NoPain, right half = Pain
  • Filename contains "NoPain" → both halves = NoPain


A custom "prepare_dataset.py" script splits each image and applies this labeling rule, producing a structured "Pain/" and "NoPain/" directory with a manifest CSV. The synthetic dataset was generated using Ideogram and Runway, covering diverse demographics (age, gender, ethnicity).


Preprocessing: MTCNN Face Alignment

All faces were processed using MTCNN (Multi-Task Cascaded CNN) from facenet-pytorch:

  1. Face detection with 5 facial landmark points
  2. Alignment based on eye positions
  3. Cropping with margin
  4. Resize to 256×256
  5. Fallback to center-crop if MTCNN fails

Training augmentations included random resized crops, horizontal flip, rotation (±10°), and color jitter, all normalized with ImageNet statistics.


Model Architecture: DeiT with Knowledge Distillation


Student - DeiT-Base


  • "deit_base_distilled_patch16_224" from "timm", adapted for 256×256 input
  • Binary classification head (Pain / NoPain)
  • Distillation token alongside the class token - the architectural feature that enables knowledge distillation natively


Teacher - Two Configurations Compared


  • ResNet-50: ImageNet-pretrained CNN, frozen during distillation. Provides hard labels (argmax) to supervise the student's distillation token
  • Swin Transformer: Transformer-based teacher, fine-tuned on SynPain before being used for distillation


Distillation Loss


Hard distillation combines two Binary Cross Entropy terms:


L_total = L_BCE(class_token, ground_truth) + L_BCE(distill_token, teacher_hard_label)


The class token is supervised by the ground truth label. The distillation token is supervised by the teacher's prediction (hard argmax), not a soft probability distribution.


Training Setup


| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Learning rate | 1e-5 |
| LR schedule | StepLR (×0.5 every 10 epochs) |
| Epochs | 30 |
| Batch size | 64 |
| Split | 70 / 15 / 15 (train / val / test) |
| Mixed precision | Optional (AMP) |
| Reproducibility | Seeded + deterministic mode |


Key Features


  • Unified training entrypoint (train_multi_teacher.py) with "--teacher resnet50" or "--teacher swin" for side-by-side comparison under identical conditions
  • Comprehensive evaluation: accuracy, precision, recall, F1-score, confusion matrix, ROC/AUC
  • Reproducible pipeline: seeded operations, deterministic mode, stratified splits
  • Full inference support: single image (auto-detects side-by-side format) and batch inference with confidence scores
  • Research posters included in the repository (Posters/ directory)


Tech Stack


  • Deep Learning: PyTorch, timm (DeiT, Swin Transformer), torchvision (ResNet-50)
  • Face Detection: facenet-pytorch (MTCNN)
  • Data & Evaluation: Scikit-learn, Pandas, NumPy
  • Visualization: Matplotlib, Seaborn
  • Tooling: Python 3.8+, CUDA, AMP (mixed precision)


Paper Reference


El Morabit, S., & Rivenq, A. (2022). Pain Detection From Facial Expressions Based on Transformers and Distillation. 2022 11th International Symposium on Signal, Image, Video and Communications (ISIVC), IEEE.

Chat with me
Hello! I'm an AI assistant for this portfolio. Ask me anything about the professional experience, education, skills, or projects!