Onlinecongress manager

1801 - EVALUATION OF LLM TO LIKERT-TYPE PSYCHOMETRIC SCALE ASSESSMENT

Session: D02S001 - AI-Driven Psychological Assessment 1

AUTHORS:

Yasui Ayano (Bunkyo University ~ Koshigaya ~ Japan) , Komamizu Takahiro (Nagoya University ~ Nagoya ~ Japan) , Masuda Tomohiro (Bunkyo University ~ Koshigaya ~ Japan)

Abstract text:

AI technologies have been increasingly grown, and some studies have used large language models (LLMs) response to psychological scales. However, the psychometric validity of such responses has rarely been tested. As far as known, LLM outputs may contain distortions and the use of LLMs in psychometrics in Likert-type scales remains underexplored. Therefore, more direct analyses, such as comparisons with the consistency of human responses or confirmatory factor analyses, are required.
To deal with this issue, this study examined how LLMs with assigned personas respond to Likert-type scales. This study generated 300 responses using GPT-4o-mini for each of 15 conditions defined by three dimensions: type of information (Big Five, time perspective [TP], both), granularity (two-level, five-level, factor scores), and format (numeric, textual, with factor scores provided numerically only). The model was instructed to complete the Big Five and TP scales, both in a 5-point Likert format.
The results showed that persona information, granularity, and presentation format produced minimal differences. However, when factor scores were provided, both scales yielded inconsistent responses, with reliability coefficients dropping sharply (Big Five: α = .33-.41; TP: α = .19-.27). In other conditions, the Big Five scale showed moderately acceptable internal consistency (α = .70-.83), whereas the TP scale remained unstable (α = -.04-.31). Validity analyses revealed more serious issues. Confirmatory factor analyses consistently indicated poor model fitting, with indices such as CFI (.30-.85) ranging from marginal to very poor.
In conclusion, Likert-type responses generated by LLMs cannot substitute for human responses in psychometric validation. Even when internal consistency appears adequate, validity is fundamentally lacking. While this study focused on the Likert scaling, future work should examine whether alternative scaling methods yield different outcomes. Until the mechanisms underlying LLM "personality" are better understood, their psychometric use should be approached with caution.