Share and discover more about AI with social posts from the community.huggingface/OpenAi
๐จ ๐๐๐บ๐ฎ๐ป ๐๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ ๐ณ๐ผ๐ฟ ๐๐ ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด: ๐ก๐ผ๐ ๐๐ต๐ฒ ๐ด๐ผ๐น๐ฑ๐ฒ๐ป ๐ด๐ผ๐ผ๐๐ฒ ๐๐ฒ ๐๐ต๐ผ๐๐ด๐ต๐?
Iโve just read a great paper where Cohere researchers raises significant questions about using Human feedback to evaluate AI language models.
Human feedback is often regarded as the gold standard for judging AI performance, but it turns out, it might be more like fool's gold : the study reveals that our human judgments are easily swayed by factors that have nothing to do with actual AI performance.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐ง Test several models: Llama-2, Falcon-40B, Cohere Command 6 and 52B ๐ โโ๏ธ Refusing to answer tanks AI ratings more than getting facts wrong. We apparently prefer a wrong answer to no answer!
๐ช Confidence is key (even when it shouldn't be): More assertive AI responses are seen as more factual, even when they're not. This could be pushing AI development in the wrong direction, with systems like RLHF.
๐ญ The assertiveness trap: As AI responses get more confident-sounding, non-expert annotators become less likely to notice when they're wrong or inconsistent.
And a consequence of the above:
๐ ๐ฅ๐๐๐ ๐บ๐ถ๐ด๐ต๐ ๐ฏ๐ฎ๐ฐ๐ธ๐ณ๐ถ๐ฟ๐ฒ: Using human feedback to train AI (Reinforcement Learning from Human Feedback) could accidentally make AI more overconfident and less accurate.
This paper means we need to think carefully about how we evaluate and train AI systems to ensure we're rewarding correctness over apparences of it like confident talk.
โ๏ธ Chatbot Arenaโs ELO leaderboard, based on crowdsourced answers from average joes like you and me, might become completely irrelevant as models will become smarter and smarter.
Read the paper ๐
Human Feedback is not Gold Standard (2309.16349)https://huggingface.co/papers/2309.16349
Iโve just read a great paper where Cohere researchers raises significant questions about using Human feedback to evaluate AI language models.
Human feedback is often regarded as the gold standard for judging AI performance, but it turns out, it might be more like fool's gold : the study reveals that our human judgments are easily swayed by factors that have nothing to do with actual AI performance.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐ง Test several models: Llama-2, Falcon-40B, Cohere Command 6 and 52B ๐ โโ๏ธ Refusing to answer tanks AI ratings more than getting facts wrong. We apparently prefer a wrong answer to no answer!
๐ช Confidence is key (even when it shouldn't be): More assertive AI responses are seen as more factual, even when they're not. This could be pushing AI development in the wrong direction, with systems like RLHF.
๐ญ The assertiveness trap: As AI responses get more confident-sounding, non-expert annotators become less likely to notice when they're wrong or inconsistent.
And a consequence of the above:
๐ ๐ฅ๐๐๐ ๐บ๐ถ๐ด๐ต๐ ๐ฏ๐ฎ๐ฐ๐ธ๐ณ๐ถ๐ฟ๐ฒ: Using human feedback to train AI (Reinforcement Learning from Human Feedback) could accidentally make AI more overconfident and less accurate.
This paper means we need to think carefully about how we evaluate and train AI systems to ensure we're rewarding correctness over apparences of it like confident talk.
โ๏ธ Chatbot Arenaโs ELO leaderboard, based on crowdsourced answers from average joes like you and me, might become completely irrelevant as models will become smarter and smarter.
Read the paper ๐
Human Feedback is not Gold Standard (2309.16349)https://huggingface.co/papers/2309.16349
Hyperfast Contextual Custom LLM with Agents, Multitokens, Explainable AI, and Distillation https://mltblog.com/4dNPSnB
New additions to this ground-breaking system include multi-token distillation when processing prompts, agents to meet user intent, more NLP, and a command prompt menu accepting both standard prompts and various actions.
I also added several illustrations, featuring xLLM in action with a full session and sample commands to fine-tune in real-time. All the code, input sources (anonymized corporate corpus from fortune 100 company), contextual backend tables including embeddings, are on GitHub. My system has zero weight, no transformer, and no neural network. It relies on explainable AI, does not require training, is fully reproducible, and fits in memory. Yet your prompts can retrieve relevant full text entities from the corpus with no latency โ including URLs, categories, titles, email addresses, and so on โ thanks to well-designed architecture.
Read more, get the code, paper and everything for free, at https://mltblog.com/4dNPSnB
โฆ
New additions to this ground-breaking system include multi-token distillation when processing prompts, agents to meet user intent, more NLP, and a command prompt menu accepting both standard prompts and various actions.
I also added several illustrations, featuring xLLM in action with a full session and sample commands to fine-tune in real-time. All the code, input sources (anonymized corporate corpus from fortune 100 company), contextual backend tables including embeddings, are on GitHub. My system has zero weight, no transformer, and no neural network. It relies on explainable AI, does not require training, is fully reproducible, and fits in memory. Yet your prompts can retrieve relevant full text entities from the corpus with no latency โ including URLs, categories, titles, email addresses, and so on โ thanks to well-designed architecture.
Read more, get the code, paper and everything for free, at https://mltblog.com/4dNPSnB
โฆ
Zero-shot VQA evaluation of Docmatix using LLM - do we need to fine-tune?
While developing Docmatix, we found that fine-tuning Florence-2 performed well on the DocVQA task, but still scored low on the benchmark. To improve the benchmark score, we had to further fine-tune the model on the DocVQA dataset to learn the grammatical style of the benchmark. Interestingly, the human evaluators felt that the additional fine-tuning seemed to perform worse than fine-tuning on Docmatix alone, so we decided to only use the additional fine-tuned model for ablation experiments and publicly release the model fine-tuned on Docmatix alone. Although the answers generated by the model are semantically consistent with the reference answers (as shown in Figure 1), the benchmark scores are low. This raises the question: should we fine-tune the model to improve performance on existing metrics, or should we develop new metrics that are more consistent with human perception?
While developing Docmatix, we found that fine-tuning Florence-2 performed well on the DocVQA task, but still scored low on the benchmark. To improve the benchmark score, we had to further fine-tune the model on the DocVQA dataset to learn the grammatical style of the benchmark. Interestingly, the human evaluators felt that the additional fine-tuning seemed to perform worse than fine-tuning on Docmatix alone, so we decided to only use the additional fine-tuned model for ablation experiments and publicly release the model fine-tuned on Docmatix alone. Although the answers generated by the model are semantically consistent with the reference answers (as shown in Figure 1), the benchmark scores are low. This raises the question: should we fine-tune the model to improve performance on existing metrics, or should we develop new metrics that are more consistent with human perception?
๐
AI Event Scheduler - Streamline event creation with this AI Chrome extension, saving time and reducing manual errors.
๐ Cokeep - Transform bookmarks into collaborative spaces with AI organization, summarization, and team sharing capabilities.
๐จ Crayon AI - Unleash creativity with an all-in-one AI image toolbox, with generation, editing, and optimization for all skill levels.
๐ฅ Tailwind Genie - Generate responsive UI designs with AI, streamlining web development using Tailwind CSS.
๐ค Video Ai Hug - Transform static photos into personalized hugging videos, bringing cherished moments to life.
๐ Postin - Supercharge your LinkedIn presence with AI-crafted posts, smart management, and engagement-boosting strategies.
๐ Metastory AI v2.2 - Enhance project management with this v2.2 update from Metastory AI that now has Jira integration, project publishing, and an improved editor for streamlined collaboration.
๐ Beloga - Intelligently capture and seamlessly search across Notion, GDrive, notes, the internet and more simultaneously with a digital brain thatโs designed to help amplify your knowledge.
Sick of feeling like a broken record, endlessly repeating instructions?
Itโs time to let AI do the talking. Meet Guidde - your GPT-powered ally that transforms even the most complex tasks into crystal-clear, AI-generated video documentation at lightning speed.
Seamlessly share or embed your guides anywhere, hassle-free
Say goodbye to dry documentation and hello to beautiful guides
Reclaim precious time generating documentation 11x faster with AI
Best of all, it only takes 3 steps:
Install the free guidde Chrome extension
Click โCaptureโ in the extension and โStopโ when done
Sit back and let AI handle the rest, then share your guide
Itโs time to let AI do the talking. Meet Guidde - your GPT-powered ally that transforms even the most complex tasks into crystal-clear, AI-generated video documentation at lightning speed.
Seamlessly share or embed your guides anywhere, hassle-free
Say goodbye to dry documentation and hello to beautiful guides
Reclaim precious time generating documentation 11x faster with AI
Best of all, it only takes 3 steps:
Install the free guidde Chrome extension
Click โCaptureโ in the extension and โStopโ when done
Sit back and let AI handle the rest, then share your guide
๐ AI Police Cams - Between July and August, AI cameras used in two UK counties detected over 2,000 people not wearing seat belts on three roads, including 109 children. One case involved an unrestrained toddler sitting on a woman's lap in the front passenger seat. Not only are AI-powered cameras being used for seat belts, theyโre also being used to catch litterers.
๐ง Qwen - New updates have been made to Qwenโs AI models across multiple modalities. Qwen2-VL is a new vision-language model capable of understanding high-resolution images and 20+ minute videos; Qwen2-Audio is for processing voice inputs; and Qwen-Agent, is an approach to expand 8K context models to handle 1M tokens.
๐น Wyze - A new AI-powered search feature from Wyze allows users to search through their camera footage using keywords and natural language queries. Instead of manually scrolling through recorded events, users can now search for specific objects, people, or activities like "truck," "delivery person," or even more detailed requests like "show me my cat in the backyard."
Celebrating huggingface's acquisition of huggingface.com at a high price.
sequelbox
posted an update
2 days ago
Post
499
new synthetic general chat dataset! meet Supernova, a dataset using prompts from UltraFeedback and responses from Llama 3.1 405b Instruct:
sequelbox/Supernova
new model(s) using the Supernova dataset will follow next week, along with Other Things. (One of these will be a newly updated version of Enigma, utilizing the next version of
sequelbox/Tachibana
with approximately 2x the rows!)
posted an update
2 days ago
Post
499
new synthetic general chat dataset! meet Supernova, a dataset using prompts from UltraFeedback and responses from Llama 3.1 405b Instruct:
sequelbox/Supernova
new model(s) using the Supernova dataset will follow next week, along with Other Things. (One of these will be a newly updated version of Enigma, utilizing the next version of
sequelbox/Tachibana
with approximately 2x the rows!)
just published a demo for Salesforce's new Function Calling Model Salesforce/xLAM
-
Tonic/Salesforce-Xlam-7b-r
-
Tonic/On-Device-Function-Calling
just try em out, and it comes with on-deviceversion too ! cool ! ๐
-
Tonic/Salesforce-Xlam-7b-r
-
Tonic/On-Device-Function-Calling
just try em out, and it comes with on-deviceversion too ! cool ! ๐