๐จ ๐๐๐บ๐ฎ๐ป ๐๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ ๐ณ๐ผ๐ฟ ๐๐ ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด: ๐ก๐ผ๐ ๐๐ต๐ฒ ๐ด๐ผ๐น๐ฑ๐ฒ๐ป ๐ด๐ผ๐ผ๐๐ฒ ๐๐ฒ ๐๐ต๐ผ๐๐ด๐ต๐?
Iโve just read a great paper where Cohere researchers raises significant questions about using Human feedback to evaluate AI language models.
Human feedback is often regarded as the gold standard for judging AI performance, but it turns out, it might be more like fool's gold : the study reveals that our human judgments are easily swayed by factors that have nothing to do with actual AI performance.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐ง Test several models: Llama-2, Falcon-40B, Cohere Command 6 and 52B ๐ โโ๏ธ Refusing to answer tanks AI ratings more than getting facts wrong. We apparently prefer a wrong answer to no answer!
๐ช Confidence is key (even when it shouldn't be): More assertive AI responses are seen as more factual, even when they're not. This could be pushing AI development in the wrong direction, with systems like RLHF.
๐ญ The assertiveness trap: As AI responses get more confident-sounding, non-expert annotators become less likely to notice when they're wrong or inconsistent.
And a consequence of the above:
๐ ๐ฅ๐๐๐ ๐บ๐ถ๐ด๐ต๐ ๐ฏ๐ฎ๐ฐ๐ธ๐ณ๐ถ๐ฟ๐ฒ: Using human feedback to train AI (Reinforcement Learning from Human Feedback) could accidentally make AI more overconfident and less accurate.
This paper means we need to think carefully about how we evaluate and train AI systems to ensure we're rewarding correctness over apparences of it like confident talk.
โ๏ธ Chatbot Arenaโs ELO leaderboard, based on crowdsourced answers from average joes like you and me, might become completely irrelevant as models will become smarter and smarter.
Read the paper ๐
Human Feedback is not Gold Standard (2309.16349)https://huggingface.co/papers/2309.16349
Iโve just read a great paper where Cohere researchers raises significant questions about using Human feedback to evaluate AI language models.
Human feedback is often regarded as the gold standard for judging AI performance, but it turns out, it might be more like fool's gold : the study reveals that our human judgments are easily swayed by factors that have nothing to do with actual AI performance.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐ง Test several models: Llama-2, Falcon-40B, Cohere Command 6 and 52B ๐ โโ๏ธ Refusing to answer tanks AI ratings more than getting facts wrong. We apparently prefer a wrong answer to no answer!
๐ช Confidence is key (even when it shouldn't be): More assertive AI responses are seen as more factual, even when they're not. This could be pushing AI development in the wrong direction, with systems like RLHF.
๐ญ The assertiveness trap: As AI responses get more confident-sounding, non-expert annotators become less likely to notice when they're wrong or inconsistent.
And a consequence of the above:
๐ ๐ฅ๐๐๐ ๐บ๐ถ๐ด๐ต๐ ๐ฏ๐ฎ๐ฐ๐ธ๐ณ๐ถ๐ฟ๐ฒ: Using human feedback to train AI (Reinforcement Learning from Human Feedback) could accidentally make AI more overconfident and less accurate.
This paper means we need to think carefully about how we evaluate and train AI systems to ensure we're rewarding correctness over apparences of it like confident talk.
โ๏ธ Chatbot Arenaโs ELO leaderboard, based on crowdsourced answers from average joes like you and me, might become completely irrelevant as models will become smarter and smarter.
Read the paper ๐
Human Feedback is not Gold Standard (2309.16349)https://huggingface.co/papers/2309.16349