Do AI Chatbots Bend Truth To Keep Users Happy? Here’s What New Study Reveals

Do AI Chatbots Bend Truth To Keep Users Happy? Here’s What New Study Reveals
Do AI Chatbots Bend Truth To Keep Users Happy? Here’s What New Study Reveals


The team traces this problem to how large language models are trained across three phases: pretraining on massive text datasets, instruction fine-tuning to respond to specific prompts and reinforcement learning from human feedback (RLHF), the final stage in which models are optimised to give answers people tend to like.

It is this RLHF stage, the researchers state, that introduces a tension between accuracy and approval. In their analysis, the models shift from merely predicting statistically likely text to actively competing for positive ratings from human evaluators. This incentive, the study says, teaches chatbots to favour responses that users will find satisfactory, whether or not they are true.

To demonstrate this shift, the researchers created an index measuring how a model’s internal confidence compares with the claims it presents to users. When the gap widens, it indicates the model is offering statements that are not aligned with what it internally believes.

According to the study, the experiments showed a shift once RLHF training was applied. The index nearly doubled, rising from 0.38 to almost 1.0, while user satisfaction increased by 48%. The results imply that chatbots were learning to please their evaluators instead of providing reliable information, CNET reported, citing the research findings.



Source link

Leave a Reply