Share and discover more about AI with social posts from the community.huggingface/OpenAi
Figure 02 is the second-generation humanoid robot released by the artificial intelligence robot startup company Figure. The following is some information about it and an analysis of its possible impact on the job market, etc.:
Characteristics and capabilities of Figure 02:
Hardware aspect:
The appearance adopts an exoskeleton structure, integrating the power supply and computing power wiring inside the body, improving reliability and packaging compactness.
It is equipped with a fourth-generation hand device, with 16 degrees of freedom and strength comparable to that of humans, capable of carrying up to 25 kilograms of weight, and can flexibly perform various human-like tasks.
It has 6 RGB cameras (located on the head, chest and back respectively), and has "superhuman" vision.
The internal battery pack capacity has increased to 2.25 kWh. Its founder hopes that it can achieve an actual effective working time of more than 20 hours per day (but currently the official website shows that the battery life is only 5 hours. The 20 hours might be the inferred limit working time of "charging + working").
It is driven by a motor, with a height of 5 feet 6 inches and a weight of 70 kilograms.
Software and intelligence aspect:
It is equipped with an on-board visual language model (VLM), enabling it to perform rapid common-sense visual reasoning.
Compared to the previous generation product, the on-board computing and AI reasoning capabilities have tripled, allowing many real-world AI tasks to be executed completely independently.
It is equipped with a specially customized speech-to-speech reasoning model from the company's investor OpenAI. The default UI is speech, and it communicates with humans through the on-board microphone and speaker. #AI
Radxa Launches New Single-Board Computers Featuring Rockchip RK3588S2 and RK3582 Chips, Starting at $30
Radxa has announced the launch of its latest single-board computers (SBCs), the Radxa ROCK 5C and the Radxa ROCK 5C Lite. These credit card-sized devices are designed to cater to various computing needs, with prices starting at just $30 for the Lite version and $50 for the standard ROCK 5C. Both models are currently available for pre-order and are set to begin shipping on April 10th 2024.
Better Alignment with Instruction Back-and-Forth Translation

Abstract
We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. Fine-tuning with the resulting (backtranslated instruction, rewritten response) pairs yields higher win rates on AlpacaEval than using other common instruction datasets such as Humpback, ShareGPT, Open Orca, Alpaca-GPT4 and Self-instruct. We also demonstrate that rewriting the responses with an LLM outperforms direct distillation, and the two generated text distributions exhibit significant distinction in embedding space.

https://arxiv.org/abs/2408.04614
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
https://arxiv.org/abs/2408.04594

Abstract
High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components.
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Abstract
3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc. This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes, all from the user's casual prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and convenient way to convey the user's design intention in the content creation process. To circumvent the data-deficient challenge in learning (i.e. the lack of large training data of 3D scenes), our method leverages a pre-trained 2D denoising diffusion model to generate a 2D image of the scene as the conceptual guidance. In this process, we adopt the isometric projection mode to factor out unknown camera poses while obtaining the scene layout. From the generated isometric image, we use a pre-trained image understanding method to segment the image into meaningful parts, such as off-ground objects, trees, and buildings, and extract the 2D scene layout. These segments and layouts are subsequently fed into a procedural content generation (PCG) engine, such as a 3D video game engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can be seamlessly integrated into a game development environment and is readily playable. Extensive tests demonstrate that our method can efficiently generate high-quality and interactive 3D game scenes with layouts that closely follow the user's intention.
https://arxiv.org/abs/2408.04567
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities.
Transformer Explainer: Interactive Learning of Text-Generative Models
Abstract
Transformers have revolutionized machine learning, yet their inner workings remain opaque to many. We present Transformer Explainer, an interactive visualization tool designed for non-experts to learn about Transformers through the GPT-2 model. Our tool helps users understand complex Transformer concepts by integrating a model overview and enabling smooth transitions across abstraction levels of mathematical operations and model structures. It runs a live GPT-2 instance locally in the user's browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. Our tool requires no installation or special hardware, broadening the public's education access to modern generative AI techniques. Our open-sourced tool is available at https://poloclub.github.io/transformer-explainer/. A video demo is available at https://youtu.be/ECR4oAwocjs.
https://arxiv.org/abs/2408.04619
Better Alignment with Instruction Back-and-Forth Translation
Authors:

Thao Nguyen
,

Jeffrey Li
,
Sewoong Oh
,

Ludwig Schmidt
,

Jason Weston
,
Luke Zettlemoyer
,
Xian Li
Abstract
We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. https://arxiv.org/abs/2408.04614
Post——OpenAI

See new posts
Conversation
OpenAI
@OpenAI
We’re rolling out the ability for ChatGPT Free users to create up to two images per day with DALL·E 3.

Just ask ChatGPT to create an image for a slide deck, personalize a card for a friend, or show you what something looks like.
Aiola/whisper-medusa-v1 was trained on the LibriSpeech dataset to perform audio translation.
Whisper Medusa
Whisper is an advanced encoder-decoder model for speech transcription and translation, processing audio through encoding and decoding stages. Given its large size and slow inference speed, various optimization strategies like Faster-Whisper and Speculative Decoding have been proposed to enhance performance. Our Medusa model builds on Whisper by predicting multiple tokens per iteration, which significantly improves speed with small degradation in WER. We train and evaluate our model on the LibriSpeech dataset, demonstrating speed improvements.

Training Details
aiola/whisper-medusa-v1 was trained on the LibriSpeech dataset to perform audio translation. The Medusa heads were optimized for English, so for optimal performance and speed improvements, please use English audio only.

Usage
To use whisper-medusa-v1 install whisper-medusa repo following the README instructions.https://huggingface.co/aiola/whisper-medusa-v1 aiola/whisper-medusa-v1 · Hugging Face
illusion diffusion - a arborvitae Collection
Illusion Diffusion HQ 🌀
Generate stunning high quality illusion artwork with Stable Diffusion

Illusion Diffusion is back up with a safety checker! Because I have been asked, if you would like to support me, consider using deforum.studio

A space by AP Follow me on Twitter with big contributions from multimodalart

This project works by using Monster Labs QR Control Net. Given a prompt and your pattern, we use a QR code conditioned controlnet to create a stunning illusion! Credit to: MrUgleh for discovering the workflow :)https://huggingface.co/spaces/AP123/IllusionDiffusion IllusionDiffusion - a Hugging Face Space by AP123
The page https://www.chatpdf.com/?via=saasinfopro is likely related to a tool or service that allows users to interact with PDF documents using chat functionality. It probably enables users to ask questions about the content of the PDF, and the tool would provide answers based on the information within the document. This could be useful for quickly extracting information from large or complex PDFs without having to manually search through the entire document. However, without accessing the actual page, this is a general assumption based on the common functionality associated with such a domain name.
Here are the latest models I mentioned earlier,

LLaMA: A large language model developed by Meta, with excellent understanding and generation capabilities.
PaLM: Google's latest language model, with powerful natural language processing capabilities.
Claude: An AI chatbot developed by Anthropic, capable of engaging in natural language conversations.
DALL-E: A text-to-image model that can generate images based on text descriptions.
Stable Diffusion: A text-to-image model that can generate high-quality images.
GPT-4: OpenAI's latest language model, with powerful natural language processing capabilities.
Bard: Google's latest chatbot, capable of engaging in natural language conversations.
These models are all recent developments, with impressive capabilities and potential applications.
Hugging Face ซื้อกิจการ XetHub สตาร์ทอัปสตอเรจสำหรับงาน AI
Hugging Face ประกาศซื้อกิจการ XetHub สตาร์ทอัปที่พัฒนาสตอเรจสำหรับงานแอพพลิเคชัน Machine Learning และ AI ซึ่งสามารถรองรับความต้องการของ Hugging Face ได้

Yucheng Low ซีอีโอ XetHub กล่าวว่าวิสัยทัศน์ของ Hugging Face คือการนำโมเดล AI มาเผยแพร่ให้กับทุกคน ซึ่งต้องการพื้นที่รองรับการเก็บข้อมูลและการเข้าถึง การที่ XetHub เข้ามาเป็นส่วนหนึ่งของ Hugging Face จึงช่วยเติมเต็มวิสัยทัศน์นี้สำหรับอนาคตของ AI ได้ ส่วน Julien Chaumond ซีทีโอ Hugging Face บอกว่าความท้าทายจากนี้คือขนาด Repository ที่ใหญ่มากขึ้น ปัจจุบัน Hugging Face มี Repo ถึง 1.3 ล้านโมเดล, ข้อมูล 450k datasets, รีเควสวันละ 1B และแบนด์วิธที่ Cloudfront วันละ 6PB

Hugging Face ไม่ได้เปิดเผยมูลค่าของดีลดังกล่าว แต่บอกว่าพนักงาน XetHub ทั้ง 14 คน จะเข้ามาร่วมงานกับ Hugging Face

XetHub ก่อตั้งในปี 2021 โดย Yucheng Low และ Rajat Arya ซึ่งทั้งสองคนเคยทำงานที่ Turi สตาร์ทอัปด้าน AI ที่แอปเปิลซื้อกิจการในปี 2016 ต่อมาทั้งสองคนลาออกจากแอปเปิลและมาก่อตั้ง XetHub

ที่มา: GeekWire
GOOD Hugging Face buys AI file management platform XetHub
Hugging Face recently acquired XetHub. With this, the hosting platform for LLMs is buying a start-up that helps developers manage the files they need to run AI projects.

The exact amount of the acquisition was not disclosed, but it is said to be Hugging Face’s largest to date. Among other things, this means that the acquisition amount is higher than the $10 million the LLM hosting platform paid for AI training specialist Argilla in June this year.

Code file management platform
With XetHub, Hugging Face gets its hands on a start-up that provides a platform for developers to store the code files and other technical solutions they use for AI projects. The main focus here is machine learning.

In addition, this platform offers several other tools to manage the ML stack. These include version control, keeping an eye on assets, making changes in a single place and more. The platform also offers (collaboration) productivity tools that should make it easier for developers to work with these specific files, edit them, improve performance and make better comparisons with other tooling.https://www.techzine.eu/news/applications/123327/hugging-face-buys-ai-file-management-platform-xethub/ Hugging Face buys AI file management platform XetHub
What is the best LLM for RAG systems? 🤔

In a business setting, it will be the one that gives the best performance at a great price! 💼💰

And maybe it should be easy to fine-tune, cheap to fine-tune... FREE to fine-tune? 😲

That's @Google Gemini 1.5 Flash! 🚀🌟

It now supports fine-tuning, and the inference cost is the same as the base model! <coughs LORA adopters> 🤭🤖

But is it any good? 🤷‍♂️
On the LLM Hallucination Index, Gemini 1.5 Flash achieved great context adherence scores of 0.94, 1, and 0.92 across short, medium, and long contexts. 📊🎯

Google has finally given a model that is free to tune and offers an excellent balance between performance and cost. ⚖️👌

Happy tuning... 🎶🔧

Gemini 1.5 Flash: https://developers.googleblog.com/en/gemini-15-flash-updates-google-ai-studio-gemini-api/ 🔗

LLM Hallucination Index: https://www.rungalileo.io/hallucinationindex 🔗

So the base model must be expensive? 💸
For the base model, the input price is reduced by 78% to $0.075/1 million tokens and the output price by 71% to $0.3/1 million tokens. 📉💵 Google for Developers Blog - News about Web, Mobile, AI and Cloud
Check Your Redirects and Status Codes
Quickly analyze 301, 302 redirects and other HTTP status codes to optimize your website's performance and SEO.
https://redirect-checker.girff.com/
Redirect Checker - Analyze Website Redirects and HTTP Status Codes
Use our free Redirect Checker tool to analyze your website's redirects, HTTP status codes, and headers. Optimize your SEO and improve user experience.
A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
MiniCPM-V 2.6
MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:

🔥 Leading Performance. MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding.

🖼 Multi Image Understanding and In-context Learning. MiniCPM-V 2.6 can also perform conversation and reasoning over multiple images. It achieves state-of-the-art performance on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.

🎬 Video Understanding. MiniCPM-V 2.6 can also accept video inputs, performing conversation and providing dense captions for spatial-temporal information. It outperforms GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B on Video-MME with/without subtitles.

💪 Strong OCR Capability and Others. MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. Based on the the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports multilingual capabilities on English, Chinese, German, French, Italian, Korean, etc.

🚀 Superior Efficiency. In addition to its friendly size, MiniCPM-V 2.6 also shows state-of-the-art token density (i.e., number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support real-time video understanding on end-side devices such as iPad.

💫 Easy Usage. MiniCPM-V 2.6 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with Gradio and (6) online web demo.https://huggingface.co/openbmb/MiniCPM-V-2_6 openbmb/MiniCPM-V-2_6 · Hugging Face
Kling AI Video is FINALLY Public (All Countries), Free to Use and MIND BLOWING - Full Tutorial > https://youtu.be/zcpqAxYV1_w

You probably seen those mind blowing AI made videos. And the day has arrived. The famous Kling AI is now worldwide available for free. In this tutorial video I will show you how to register for free with just email to Kling AI and use its mind blowing text to video animation, image to video animation and text to image, and image to image capabilities. This video will show you non-cherry pick results so you will know the actual quality and capability of the model unlike those extremely cherry pick example demos. Still, #KlingAI is the only #AI model that competes with OpenAI's #SORA and it is real to use.

🔗 Kling AI Official Website ⤵️
▶️ https://www.klingai.com/



🔗 Our GitHub Repository ⤵️
▶️ https://github.com/FurkanGozukara/Stable-Diffusion
I just had a masterclass in open-source collaboration with the release of Llama 3.1 🦙🤗

Meta dropped Llama 3.1, and seeing firsthand the Hugging Face team working to integrate it is nothing short of impressive. Their swift integration, comprehensive documentation, and innovative tools showcase the power of open-source teamwork.

For the curious minds:

📊 Check out independent evaluations:
open-llm-leaderboard/open_llm_leaderboard


🧠 Deep dive into the tech: https://huggingface.co/blog/llama31

👨‍🍳 Try different recipes (including running 8B on free Colab!): https://github.com/huggingface/huggingface-llama-recipes

📈 Visualize open vs. closed LLM progress:
andrewrreed/closed-vs-open-arena-elo


🤖 Generate synthetic data with distilabel, thanks to the new license allowing the use of outputs to train other LLMs https://huggingface.co/blog/llama31#synthetic-data-generation-with-distilabel

💡 Pro tip: Experience the 405B version for free on HuggingChat, now with tool-calling capabilities! https://huggingface.co/chat/

#OpenSourceAI #AIInnovation Llama 3.1 - 405B, 70B & 8B with multilinguality and long context