伦敦政治经济学院在读博士生徐尔瀚:大模型的双重鲁棒对齐
报告摘要
This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying
立即观看