This paper presents one methodological strand implemented within the broader Digital Maktaba project, which explores AI-assisted workflows for Arabic-script digital libraries. It details a case study centred on the Giorgio La Pira Library in Palermo, addressing the specific challenge of generating semantic metadata for scholarly works when only minimal contextual information, namely a digitized frontispiece containing key data such as the title, author, and publisher, is available. The study relies on a curated subset of 5,900 Arabic-language records, including 2,200 items with digitized frontispieces, extracted from the La Pira Library's catalogue and used as a controlled evaluation dataset. On the basis of this dataset, a constrained evaluation protocol was implemented to test whether large language models can generate accurate subject topics for a work using only textual information extracted from its frontispiece, without access to external retrieval mechanisms. Model outputs are assessed using an LLM-as-a-judge evaluation protocol anchored to a curated ground-truth topic dataset, enabling systematic comparison of generated topics against expert-derived annotations. The analysis evaluates topic alignment and systematises recurring AI-induced error (hallucination) typologies, including semantic distortion, fabrication, overgeneralisation of subject categories, and misattribution driven by authorial or canonical associations.
Overall, this work proposes a replicable evaluation approach for assessing LLM behaviour in low-context metadata generation tasks and contributes a benchmark dataset and analytical strategy relevant to digital libraries managing heterogeneous and resource-constrained collections.