Recent advancements in large language models (LLMs) have expanded their applicability to specialized fields such as religious studies. Customized AI models, utilizing tools like GPT Builder to draw from authoritative collections such as Sahih al-Bukhari or the Qur'an, have been explored for addressing queries related to Islamic teachings. However, evaluations reveal significant limitations, including hallucinations, reference inaccuracies, and difficulties in strictly adhering to designated sources, especially when faced with fabricated Ahadith.
This study proposes a novel framework to assess and improve the ability of pre-trained LLMs to rely exclusively on Sahih al-Bukhari as a source. The evaluation involves testing the models with queries based on fabricated Ahadith to determine whether they can identify these as absent from Sahih al-Bukhari and comply with the required constraints. The findings underscore the necessity of anchoring AI systems in structured, authenticated datasets for sensitive domains like religious studies.