Anxiety Detection in Classrooms — Alejandro Yáñez

Computer vision system based on YOLO11-pose to identify anxious states through student movement patterns in classrooms at Universidad del Bío-Bío.

Problem and Motivation

Despite the DDE’s extensive support catalog (clinical care, workshops, recreational activities), everything operates reactively. Anxiety has observable physical manifestations: self-touching gestures (rubbing the neck, touching the face) increase in frequency and duration under internal tension. The project coins the term “anxious state” to delimit the scope: it does not diagnose clinical anxiety, it detects transient body patterns.

A systematic review in Web of Science, Scopus, and IEEE Xplore (1,428 initial papers, 5 relevant after filters) revealed that no prior work addresses anxiety detection in classrooms through body postures, validating the originality of the approach.

Dataset Construction

Capture

5 recording sessions in real UBB courses (~3 h of footage), with informed consents, face anonymization, and labeling in Roboflow.

Custom 16-keypoint Topology

Unlike the standard COCO topology (17 keypoints), the 5 facial points were unified into a single keypoint (head), prioritizing the relevant anatomical region (arms, hands, and head) over facial details.

Keypoint	Description
0	Head
1	Neck
2	Left shoulder
3	Left elbow
4	Left hand
5	Right shoulder
6	Right elbow
7	Right hand
8	Central torso
9	Torso base
10	Left hip
11	Left knee
12	Left foot
13	Right hip
14	Right knee
15	Right foot

Two classes to handle occlusion: person-full-body (16 kp visible) and person-half-body (only kp 0–9, torso upward). Labeling criteria: tight bounding box, keypoint occluded if its position can be inferred, removed if it cannot be inferred.

Dataset evolution:

Version	Images	Instances	Contribution
v1	51	1,344	Baseline
v2	51	2,838	+ augmentation (mosaic, mixup, erasing)
v3	70	4,344	+ unseen angles, variable lighting

Incremental monitoring was applied (51 → 61 → 70) with a cutoff point upon detecting convergence saturation. The high density (~36 students/image) compensates for the reduced volume: 4,300+ annotated skeletons.

Training: 4-Phase KDD Methodology

The KDD methodology (Fayyad et al., 1996) was followed, which is an iterative process that allows feeding findings back between phases. Each phase isolates a critical variable: architecture (F1), resolution (F2), data (F3), and hardware (F4).

Phase	Setup	Main Finding
F1	Nano/Small/Medium/Large at 640 px, dataset v1, 200 epochs, Tesla T4	Box OK (~0.73) but Pose mAP50-95 ≤0.42. Spatial quantization error: at 640 px, distant students occupy very few pixels. The bottleneck is resolution, not network depth.
F2	YOLO11s at 960 px + aggressive augmentation (mosaic, mixup, erasing, rotation, scaling, flip)	Pose mAP50-95 increases from 0.419 to 0.478 (+14%). Small (9.9M) outperforms Large (26.1M) with 62% fewer parameters. Half-body: +20%. Spatial resolution is more critical than depth.
F3	YOLO11s at 960 px, dataset v3 (70 img, 4,344 instances)	Pose mAP50-95 reaches 0.521 (+9%). Half-body: 0.416 (+11%). Stable curves without divergence. Asymptotic convergence: the learning limit for this data volume was reached.
F4	YOLO11l at 960 px, dataset v3, Google Colab Pro + NVIDIA A100 (40 GB VRAM)	Comparative benchmark on local GPU (RTX 3060). F1-Confidence calibration → optimal threshold = 0.393 (prioritizes sensitivity, delegates FP reduction to temporal filter). Final model selection.

Inference System

Two-stage modular pipeline, decoupling GPU (extraction) from CPU (analysis):

Video → [Extraction: YOLO + ByteTrack + geometric logic] → CSV → [Analysis: temporal filter] → Report

Stage 1 — Tracking and contact detection

ByteTrack assigns persistent IDs to each student even under occlusions (600-frame buffer). For each person, the 16 keypoints are extracted and geometric logic is applied: the torso length (neck → torso-base distance) is computed as a scale reference and a dynamic threshold radius = torso × 0.5 is defined. If the distance from either hand to the head is less than that radius, touching_head = True is recorded. This mechanism automatically scales with student depth: a large torso in pixels (front row) generates a proportionally large radius; a small one (background) generates a small one, maintaining criterion equivalence across the entire scene and avoiding false positives/negatives due to perspective.

Results are stored in CSV (frame, timestamp, person_id, touching_head, torso_len, distances, 48 raw coordinates). Visual rendering is suppressed to prioritize processing speed.

Stage 2 — Temporal filtering and report

After completion, the analysis module segments data by person and detects blocks of consecutive frames with contact. Only events ≥ 2.0 seconds are considered, a threshold based on Shreve et al.: self-pacification gestures are sustained, not instantaneous touches. The output is a tabular report with student ID, number of anxious events, and cumulative total duration.

Inference hardware: NVIDIA RTX 3060 GPU (8 GB VRAM), AMD Ryzen 5 9600X CPU, 32 GB RAM. Local consumer hardware, no cloud dependency.

Results

Phase	Model	Res.	Pose mAP50-95 (global)	Full-body	Half-body
1	YOLO11l	640	0.419	0.526	0.312
2	YOLO11s	960	0.478	0.581	0.375
3	YOLO11s	960	0.521	0.626	0.416
4	YOLO11l	960	0.579	0.704	0.454

Global progression: +38.19%.
Full-body: +33.84%.
Half-body: +45.51%.

The torso length histogram (~15 to ~95 px) validated the dynamic threshold’s adaptability. The 2 s filter successfully eliminated postural noise (<1 s), retaining only events compatible with anxious state. The model processes video at functional speed on the RTX 3060.

Libraries

Library	Version
Ultralytics	8.3.223
torch	2.9.0
OpenCV	4.12.0
Pandas	2.3.3
Numpy	2.2.6
Roboflow	1.2.11
Matplotlib	3.10.7

Conclusions

Feasibility: it was demonstrated that detecting anxious states through HPE in real classrooms is feasible, covering the complete cycle: ethical data → iterative training → inference on consumer hardware.
Class design: the full-body/half-body split was a well-founded decision that enabled occlusion handling and was the driver of metric progress.
Dynamic threshold: the radius proportional to torso length correctly normalizes detection across different scene depths.
Temporal filter: the 2 seconds of continuous contact bridge geometry with psychological interpretation, eliminating postural noise.
Limitations: despite having over 4,300 labeled instances, these came from few images (70), captured in only two classrooms and from only two camera angles. This low visual variability limits the model’s ability to generalize pose estimation to new scenarios, furniture layouts, or unseen perspectives. Added to this is monocular resolution and the absence of concurrent clinical validation.
Future work: streaming deployment with real-time alerts, multi-camera 3D estimation, incorporation of multimodal signals, and integration with DDE systems for consultation by mental health professionals.

References

Shreve, E. G. et al. (1988). Nonverbal expressions of anxiety in physician-patient interactions. Psychiatry, 51(4).
Cao, Z. et al. (2017). Realtime multi-person 2D pose estimation using part affinity fields. CVPR.
Fayyad, U. et al. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3).
Ultralytics. (2025). YOLO11 Documentation. https://docs.ultralytics.com