RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models

Jiongxiao Wang | Junlin Wu | Muhao Chen | Yevgeniy Vorobeychik | Chaowei Xiao |

Paper Details:

Month: August
Year: 2024
Location: Bangkok, Thailand
Venue: ACL |