This paper investigates how specific terms in prompts influence relevance evaluation using large language models (LLMs) like GPT-3.5 and GPT-4. The study reveals that using the term 'answer' is more effective for relevance assessments compared to 'relevant,' and emphasizes the importance of balancing the scope of 'relevance' in prompt design. By examining the performance of differing prompts in both few-shot and zero-shot settings, the research provides critical insights into prompt engineering for improved relevance evaluation.
Related topics: