ES
Anxiety Detection in Classrooms — banner
← Back to projects

Anxiety Detection in Classrooms

· 5 min read

Computer vision system based on YOLO11-pose to identify anxious states through student movement patterns in classrooms at Universidad del Bío-Bío.

Python YOLO11 PyTorch OpenCV Roboflow

Problem and Motivation

Despite the DDE’s extensive support catalog (clinical care, workshops, recreational activities), everything operates reactively. Anxiety has observable physical manifestations: self-touching gestures (rubbing the neck, touching the face) increase in frequency and duration under internal tension. The project coins the term “anxious state” to delimit the scope: it does not diagnose clinical anxiety, it detects transient body patterns.

A systematic review in Web of Science, Scopus, and IEEE Xplore (1,428 initial papers, 5 relevant after filters) revealed that no prior work addresses anxiety detection in classrooms through body postures, validating the originality of the approach.


Dataset Construction

Capture

5 recording sessions in real UBB courses (~3 h of footage), with informed consents, face anonymization, and labeling in Roboflow.

Custom 16-keypoint Topology

Unlike the standard COCO topology (17 keypoints), the 5 facial points were unified into a single keypoint (head), prioritizing the relevant anatomical region (arms, hands, and head) over facial details.

KeypointDescription
0Head
1Neck
2Left shoulder
3Left elbow
4Left hand
5Right shoulder
6Right elbow
7Right hand
8Central torso
9Torso base
10Left hip
11Left knee
12Left foot
13Right hip
14Right knee
15Right foot

Two classes to handle occlusion: person-full-body (16 kp visible) and person-half-body (only kp 0–9, torso upward). Labeling criteria: tight bounding box, keypoint occluded if its position can be inferred, removed if it cannot be inferred.

Dataset evolution:

VersionImagesInstancesContribution
v1511,344Baseline
v2512,838+ augmentation (mosaic, mixup, erasing)
v3704,344+ unseen angles, variable lighting

Incremental monitoring was applied (51 → 61 → 70) with a cutoff point upon detecting convergence saturation. The high density (~36 students/image) compensates for the reduced volume: 4,300+ annotated skeletons.


Training: 4-Phase KDD Methodology

The KDD methodology (Fayyad et al., 1996) was followed, which is an iterative process that allows feeding findings back between phases. Each phase isolates a critical variable: architecture (F1), resolution (F2), data (F3), and hardware (F4).

PhaseSetupMain Finding
F1Nano/Small/Medium/Large at 640 px, dataset v1, 200 epochs, Tesla T4Box OK (~0.73) but Pose mAP50-95 ≤0.42. Spatial quantization error: at 640 px, distant students occupy very few pixels. The bottleneck is resolution, not network depth.
F2YOLO11s at 960 px + aggressive augmentation (mosaic, mixup, erasing, rotation, scaling, flip)Pose mAP50-95 increases from 0.419 to 0.478 (+14%). Small (9.9M) outperforms Large (26.1M) with 62% fewer parameters. Half-body: +20%. Spatial resolution is more critical than depth.
F3YOLO11s at 960 px, dataset v3 (70 img, 4,344 instances)Pose mAP50-95 reaches 0.521 (+9%). Half-body: 0.416 (+11%). Stable curves without divergence. Asymptotic convergence: the learning limit for this data volume was reached.
F4YOLO11l at 960 px, dataset v3, Google Colab Pro + NVIDIA A100 (40 GB VRAM)Comparative benchmark on local GPU (RTX 3060). F1-Confidence calibration → optimal threshold = 0.393 (prioritizes sensitivity, delegates FP reduction to temporal filter). Final model selection.

Inference System

Two-stage modular pipeline, decoupling GPU (extraction) from CPU (analysis):

Video → [Extraction: YOLO + ByteTrack + geometric logic] → CSV → [Analysis: temporal filter] → Report

Stage 1 — Tracking and contact detection

ByteTrack assigns persistent IDs to each student even under occlusions (600-frame buffer). For each person, the 16 keypoints are extracted and geometric logic is applied: the torso length (neck → torso-base distance) is computed as a scale reference and a dynamic threshold radius = torso × 0.5 is defined. If the distance from either hand to the head is less than that radius, touching_head = True is recorded. This mechanism automatically scales with student depth: a large torso in pixels (front row) generates a proportionally large radius; a small one (background) generates a small one, maintaining criterion equivalence across the entire scene and avoiding false positives/negatives due to perspective.

Results are stored in CSV (frame, timestamp, person_id, touching_head, torso_len, distances, 48 raw coordinates). Visual rendering is suppressed to prioritize processing speed.

Stage 2 — Temporal filtering and report

After completion, the analysis module segments data by person and detects blocks of consecutive frames with contact. Only events ≥ 2.0 seconds are considered, a threshold based on Shreve et al.: self-pacification gestures are sustained, not instantaneous touches. The output is a tabular report with student ID, number of anxious events, and cumulative total duration.

Inference hardware: NVIDIA RTX 3060 GPU (8 GB VRAM), AMD Ryzen 5 9600X CPU, 32 GB RAM. Local consumer hardware, no cloud dependency.


Results

PhaseModelRes.Pose mAP50-95 (global)Full-bodyHalf-body
1YOLO11l6400.4190.5260.312
2YOLO11s9600.4780.5810.375
3YOLO11s9600.5210.6260.416
4YOLO11l9600.5790.7040.454
  • Global progression: +38.19%.
  • Full-body: +33.84%.
  • Half-body: +45.51%.

The torso length histogram (~15 to ~95 px) validated the dynamic threshold’s adaptability. The 2 s filter successfully eliminated postural noise (<1 s), retaining only events compatible with anxious state. The model processes video at functional speed on the RTX 3060.


Libraries

LibraryVersion
Ultralytics8.3.223
torch2.9.0
OpenCV4.12.0
Pandas2.3.3
Numpy2.2.6
Roboflow1.2.11
Matplotlib3.10.7

Conclusions

  • Feasibility: it was demonstrated that detecting anxious states through HPE in real classrooms is feasible, covering the complete cycle: ethical data → iterative training → inference on consumer hardware.
  • Class design: the full-body/half-body split was a well-founded decision that enabled occlusion handling and was the driver of metric progress.
  • Dynamic threshold: the radius proportional to torso length correctly normalizes detection across different scene depths.
  • Temporal filter: the 2 seconds of continuous contact bridge geometry with psychological interpretation, eliminating postural noise.
  • Limitations: despite having over 4,300 labeled instances, these came from few images (70), captured in only two classrooms and from only two camera angles. This low visual variability limits the model’s ability to generalize pose estimation to new scenarios, furniture layouts, or unseen perspectives. Added to this is monocular resolution and the absence of concurrent clinical validation.
  • Future work: streaming deployment with real-time alerts, multi-camera 3D estimation, incorporation of multimodal signals, and integration with DDE systems for consultation by mental health professionals.

References

  • Shreve, E. G. et al. (1988). Nonverbal expressions of anxiety in physician-patient interactions. Psychiatry, 51(4).
  • Cao, Z. et al. (2017). Realtime multi-person 2D pose estimation using part affinity fields. CVPR.
  • Fayyad, U. et al. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3).
  • Ultralytics. (2025). YOLO11 Documentation. https://docs.ultralytics.com