Introduction
Recent advances in large language models (LLMs) have enabled substantial progress in detecting and classifying psychopathology from linguistic data. However, their clinical applicability remains uncertain, as (1) their ability to generalize to data beyond the characteristics of the training sample or to non-standard linguistic inputs has not been sufficiently investigated, and (2) performance variations across different computational environments, such as GPU types, raise concerns regarding model consistency.
Purpose
This study examined the consistency and generalizability of language-based psychopathology classification models. Specifically, it investigated (1) whether GPU type influences the performance of BERTurk-based models trained for psychopathology detection, and (2) how well these models generalize when tested on data differing from their training sample.
Method
Using the dataset from Eyrikaya and Dağ (2025), models were trained with identical hyperparameters on four GPUs (V100, A100, L4, and T4) to isolate hardware-related effects. As in the previous study, data were filtered by excluding empty or meaningless responses and applying age (18-43) and word count criteria, reducing the sample from 2551 to 1901 participants. Following training, all models were evaluated on four external validation sets used in the prior study and one newly constructed test set. This additional dataset (Outlier_Test_Set, n=450), created from the previously excluded participants (n=650), was used to assess out-of-sample generalizability, reflecting model robustness beyond the training distribution.
Results
Model performance remained stable across all validation sets (mean baseline≈.80 AUC). Notably, the Outlier_Test_Set yielded the highest AUC (≈.84) across GPUs. Importantly, performance differences across GPU types were minimal (Δ≈0.02 AUC), indicating that hardware variations had a negligible impact on model outcomes.
Conclusions
These findings demonstrate that LLM-based psychopathology models are both consistent and generalizable across different GPU hardware and sample characteristics. They also indicate strong robustness and even improved generalization when applied to data beyond the original training distribution, highlighting the models' reliability and potential for scalable psychological assessment.