Psycholinguistic norming studies traditionally rely on human participants to rate individual words on lexical variables such as valence, arousal, and concreteness. These ratings enable researchers to control or manipulate lexical variables when examining word recognition processes. Given the time-intensive nature of collecting human ratings, recent research has explored whether Large Language Models (LLMs) can approximate human judgments through conversational probing across various languages (Martínez et al., 2025; Trott, 2024). However, our previous work revealed that LLM-generated ratings for Chinese two-character words showed only moderate correlations with human ratings across several lexico-semantic variables (arousal: r = .62; familiarity: r = .53; concreteness: r = .67; imageability: r = .65; Huang et al., under review).
The present study extends this work by examining whether variable ambiguity moderates the relationship between LLM and human ratings. We analyzed valence, arousal, familiarity, concreteness, and imageability ratings for over 25,000 Chinese two-character words (Chan & Tse, 2024) using two LLMs (GPT-4o-Turbo and DeepSeek-R1-FW). Variable ambiguity was operationalized as the standard deviation of ratings across human raters, reflecting individual differences in word interpretation. For instance, "dog" may evoke positive or negative valence depending on personal experience. We hypothesized that higher variable ambiguity would weaken LLM-human correlations, as LLMs may struggle to capture the contextual and individual variability inherent in lexico-semantic ratings.
Our results revealed significant LLM × ambiguity interactions across both LLM platforms. For valence, arousal, concreteness, and imageability, these interactions were negative, indicating that greater ambiguity indeed attenuated LLM-human correlations. In contrast, familiarity showed positive interactions (with the exception of DeepSeek's familiarity × familiarity ambiguity term). These findings largely support our hypothesis that variable ambiguity compromises LLMs' ability to predict human lexico-semantic ratings for Chinese two-character words, highlighting an important boundary condition for using LLMs in psycholinguistic research.