Leveraging Mental Simulation for Dynamic Experience Retrieval in LLM Agents & Rethinking Self-Enhancement Bias in LLMs

Roger Chang, Hung-Ru Lin
NTU CSIE

Time: 2025-03-05, 12:00 - 13:00 ET
Location: Olsson 105 and Zoom

Abstract

Roger Chang

Title: MENSA: Leveraging Mental Simulation for Dynamic Experience Retrieval in LLM Agents

MENSA enhances decision-making in interactive text environments by enabling LLM agents to anticipate and satisfy action preconditions, a challenge that often hinders their performance. Inspired by human mental simulation, MENSA generates forecasts using the concept of “mental simulation” and retrieves relevant past experiences to guide decisions—without requiring fine-tuning. This approach leads to significant improvements, surpassing previous SOTA methods when using large models like GPT-4o-mini while also boosting performance across smaller models like Phi-3-mini. This talk explores the role of mental simulation in LLMs and how MENSA advances sequential decision-making.

Hung-Ru Lin

Title: Rethinking Self-Enhancement Bias in LLMs: A More Reliable Measurement Method

Large language models (LLMs) are increasingly used as evaluators, but can we trust their self-assessments? Self-enhancement bias—where models overrate their own responses—poses a serious challenge to reliable evaluation. Accurately measuring this bias is crucial to prevent misleading assessments when using biased models for evaluation tasks. However, we found that existing methods for measuring self-enhancement bias are confounded by response quality and model capability—lower-quality responses tend to exhibit higher bias scores, and this issue worsens in less capable models. As a result, current approaches may not accurately capture true self-enhancement bias. To address this, we introduce SALIERI, a method that pairs responses of similar quality to isolate true bias. Experiments on the Summeval dataset show that SALIERI significantly reduces measurement distortions, leading to more reliable assessments across models like Gemma 7B and Llama 3 8B. In this talk, I will discuss the implications of these findings for LLM evaluation and future research directions.

Bio

Roger Chang

Roger Chang is currently a master’s student in the Department of CSIE at National Taiwan University, primarily researching the applications of large language models (LLMs) and AI agents. His master’s thesis focuses on executing complex scientific experimental tasks, exploring how AI agents, with LLMs as their decision-making core, can better leverage past experiences to improve long-term performance in complex tasks aiming to lower the capability threshold of LLMs. Roger graduated from the Department of EE at National Cheng Kung University Department of EE and previously worked as a Technology Strategy Intern at Delta Electronics, focusing on cutting-edge technologies in LLMs, LLM agents, robotics, and related fields.

Hung-Ru Lin

Hung-Ru Lin is a graduate student in Computer Science at National Taiwan University, specializing in Natural Language Processing (NLP). His research focuses on evaluating large language models (LLMs), with an emphasis on bias in NLG evaluation.