Share and discover more about AI with social posts from the community.huggingface/OpenAi
XLSCOUT Unveils ParaEmbed 2.0: a Powerful Embedding Model Tailored for Patents and IP with Expert Support from Hugging Face
XLSCOUT, a Toronto-based leader in the use of AI in intellectual property (IP), has developed a powerful proprietary embedding model called ParaEmbed 2.0 stemming from an ambitious collaboration with Hugging Face’s Expert Support Program. The collaboration focuses on applying state-of-the-art AI technologies and open-source models to enhance the understanding and analysis of complex patent documents including patent-specific terminology, context, and relationships. This allows XLSCOUT’s products to offer the best performance for drafting patent applications, patent invalidation searches, and ensuring ideas are novel compared to previously available patents and literature.https://huggingface.co/blog/xlscout-case-study XLSCOUT Unveils ParaEmbed 2.0: a Powerful Embedding Model Tailored for Patents and IP with Expert Support from Hugging Face
A SmolLM - blazingly fast and remarkably powerful
TL;DR
This blog post introduces SmolLM, a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. It covers data curation, model evaluation, and usage.

Introduction
There is increasing interest in small language models that can operate on local devices. This trend involves techniques such as distillation or quantization to compress large models, as well as training small models from scratch on large datasets. These approaches enable novel applications while dramatically reducing inference costs and improving user privacy.

Microsoft's Phi series, Alibaba's Qwen2 (less than 2B), and Meta's MobileLLM demonstrate that small models can achieve impressive results when designed and trained thoughtfully. However, most of the details about the data curation and training of these models are not publicly available.https://huggingface.co/blog/smollm SmolLM - blazingly fast and remarkably powerful
Powerful ASR + diarization + speculative decoding with Hugging Face
Whisper is one of the best open source speech recognition models and definitely the one most widely used. Hugging Face Inference Endpoints make it very easy to deploy any Whisper model out of the box. However, if you’d like to introduce additional features, like a diarization pipeline to identify speakers, or assisted generation for speculative decoding, things get trickier. The reason is that you need to combine Whisper with additional models, while still exposing a single API endpoint.https://huggingface.co/blog/asr-diarization Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Model card for image captioning pretrained on COCO dataset - base architecture (with ViT large backbone).

BLIP.gif
Pull figure from BLIP official repo
About-Open source Claude Artifacts – built with Llama 3.1 405B
Llama Coder
An open source Claude Artifacts – generate small apps with one prompt. Powered by Llama 3 405B & Together.ai.

Tech stack
Llama 3.1 405B from Meta for the LLM
Together AI for LLM inference
Sandpack for the code sandbox
Next.js app router with Tailwind
Helicone for observability
Plausible for website analyticshttps://github.com/nutlope/llamacoder
App is running again after a quick restart.

sourceoftruthdata/sot_autotrain_dreambooth_v1.1
This is the new SOTA image generation model that is super popular this week after auraflow! 😮
🌟Save traffic:
1. flux-schnell is a 4-step model, which can be used commercially. I think it can run with 16G video memory (welcome to correct); but is the quality improvement that big (roughly equivalent to SD3 medium?)
2. flux-dev is a 20-step model, which can be used by individuals, with super good quality, no need to write tags, can understand natural language, and the quality is close to MJ (about 90%?); FP16 model requires 24G video memory; FP8 model theoretically requires 12G video memory, but I haven't tested the actual usage and quality, welcome everyone's feedback.
3. flux-pro is not open source, only API calls.

black-forest-labs – Replicate

🧐black-forest-labs provides multiple advanced image generation models, including top prompt following, visual quality, image detail and output diversity models.

➡️Release link: https://replicate.com/black-forest-labs
➡️comfyui: https://comfyanonymous.github.io/ComfyUI_examples/flux/

Highlights
🌟 flux-pro: Cutting-edge image generation model with excellent prompt following and image quality.
⚡️ flux-schnell: The fastest image generation model designed for local development and personal use, with 74.6K runs.
🤖 flux-dev: A 12 billion parameter modified flow transformer for generating images from text descriptions, with 23.5K runs.
XLabs-AI/flux-controlnet-canny · Hugging Face
This repository provides a checkpoint with trained ControlNet Canny model for FLUX.1-dev model by Black Forest Labs

Training details
XLabs AI team is happy to publish fune-tuning Flux scripts, including:

LoRA 🔥
ControlNet 🔥
See our github for train script and train configs.

Training dataset
https://huggingface.co/XLabs-AI/flux-controlnet-canny?continueFlag=581a750ca897b60a8587e69e05615925 XLabs-AI/flux-controlnet-canny · Hugging Face
Flux-based controlnet model has been released!

XLabs-AI/flux-controlnet-canny · Hugging Face
🧐 This is a ControlNet Canny training checkpoint for the FLUX.1-dev model, supporting image generation tasks.
➡️Link: Web link
Key points
📦 This model is based on Black Forest Labs' FLUX.1-dev, and fine-tuning scripts for LoRA and ControlNet are provided.
📝 The training dataset contains images and corresponding JSON files with text prompts.
🖥 Inference usage examples include command line operations to support image generation based on prompts.
🔒 The model is distributed using the FLUX.1 [dev] non-commercial license.
Figure 02 is the second-generation humanoid robot released by the artificial intelligence robot startup company Figure. The following is some information about it and an analysis of its possible impact on the job market, etc.:
Characteristics and capabilities of Figure 02:
Hardware aspect:
The appearance adopts an exoskeleton structure, integrating the power supply and computing power wiring inside the body, improving reliability and packaging compactness.
It is equipped with a fourth-generation hand device, with 16 degrees of freedom and strength comparable to that of humans, capable of carrying up to 25 kilograms of weight, and can flexibly perform various human-like tasks.
It has 6 RGB cameras (located on the head, chest and back respectively), and has "superhuman" vision.
The internal battery pack capacity has increased to 2.25 kWh. Its founder hopes that it can achieve an actual effective working time of more than 20 hours per day (but currently the official website shows that the battery life is only 5 hours. The 20 hours might be the inferred limit working time of "charging + working").
It is driven by a motor, with a height of 5 feet 6 inches and a weight of 70 kilograms.
Software and intelligence aspect:
It is equipped with an on-board visual language model (VLM), enabling it to perform rapid common-sense visual reasoning.
Compared to the previous generation product, the on-board computing and AI reasoning capabilities have tripled, allowing many real-world AI tasks to be executed completely independently.
It is equipped with a specially customized speech-to-speech reasoning model from the company's investor OpenAI. The default UI is speech, and it communicates with humans through the on-board microphone and speaker. #AI
Radxa Launches New Single-Board Computers Featuring Rockchip RK3588S2 and RK3582 Chips, Starting at $30
Radxa has announced the launch of its latest single-board computers (SBCs), the Radxa ROCK 5C and the Radxa ROCK 5C Lite. These credit card-sized devices are designed to cater to various computing needs, with prices starting at just $30 for the Lite version and $50 for the standard ROCK 5C. Both models are currently available for pre-order and are set to begin shipping on April 10th 2024.
Better Alignment with Instruction Back-and-Forth Translation

Abstract
We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. Fine-tuning with the resulting (backtranslated instruction, rewritten response) pairs yields higher win rates on AlpacaEval than using other common instruction datasets such as Humpback, ShareGPT, Open Orca, Alpaca-GPT4 and Self-instruct. We also demonstrate that rewriting the responses with an LLM outperforms direct distillation, and the two generated text distributions exhibit significant distinction in embedding space.

https://arxiv.org/abs/2408.04614
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
https://arxiv.org/abs/2408.04594

Abstract
High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components.
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Abstract
3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc. This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes, all from the user's casual prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and convenient way to convey the user's design intention in the content creation process. To circumvent the data-deficient challenge in learning (i.e. the lack of large training data of 3D scenes), our method leverages a pre-trained 2D denoising diffusion model to generate a 2D image of the scene as the conceptual guidance. In this process, we adopt the isometric projection mode to factor out unknown camera poses while obtaining the scene layout. From the generated isometric image, we use a pre-trained image understanding method to segment the image into meaningful parts, such as off-ground objects, trees, and buildings, and extract the 2D scene layout. These segments and layouts are subsequently fed into a procedural content generation (PCG) engine, such as a 3D video game engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can be seamlessly integrated into a game development environment and is readily playable. Extensive tests demonstrate that our method can efficiently generate high-quality and interactive 3D game scenes with layouts that closely follow the user's intention.
https://arxiv.org/abs/2408.04567
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities.
Transformer Explainer: Interactive Learning of Text-Generative Models
Abstract
Transformers have revolutionized machine learning, yet their inner workings remain opaque to many. We present Transformer Explainer, an interactive visualization tool designed for non-experts to learn about Transformers through the GPT-2 model. Our tool helps users understand complex Transformer concepts by integrating a model overview and enabling smooth transitions across abstraction levels of mathematical operations and model structures. It runs a live GPT-2 instance locally in the user's browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. Our tool requires no installation or special hardware, broadening the public's education access to modern generative AI techniques. Our open-sourced tool is available at https://poloclub.github.io/transformer-explainer/. A video demo is available at https://youtu.be/ECR4oAwocjs.
https://arxiv.org/abs/2408.04619
Better Alignment with Instruction Back-and-Forth Translation
Authors:

Thao Nguyen
,

Jeffrey Li
,
Sewoong Oh
,

Ludwig Schmidt
,

Jason Weston
,
Luke Zettlemoyer
,
Xian Li
Abstract
We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. https://arxiv.org/abs/2408.04614
Post——OpenAI

See new posts
Conversation
OpenAI
@OpenAI
We’re rolling out the ability for ChatGPT Free users to create up to two images per day with DALL·E 3.

Just ask ChatGPT to create an image for a slide deck, personalize a card for a friend, or show you what something looks like.
Aiola/whisper-medusa-v1 was trained on the LibriSpeech dataset to perform audio translation.
Whisper Medusa
Whisper is an advanced encoder-decoder model for speech transcription and translation, processing audio through encoding and decoding stages. Given its large size and slow inference speed, various optimization strategies like Faster-Whisper and Speculative Decoding have been proposed to enhance performance. Our Medusa model builds on Whisper by predicting multiple tokens per iteration, which significantly improves speed with small degradation in WER. We train and evaluate our model on the LibriSpeech dataset, demonstrating speed improvements.

Training Details
aiola/whisper-medusa-v1 was trained on the LibriSpeech dataset to perform audio translation. The Medusa heads were optimized for English, so for optimal performance and speed improvements, please use English audio only.

Usage
To use whisper-medusa-v1 install whisper-medusa repo following the README instructions.https://huggingface.co/aiola/whisper-medusa-v1 aiola/whisper-medusa-v1 · Hugging Face