Share and discover more about AI with social posts from the community.huggingface/OpenAi
Finetune Stable Diffusion Models with DDPO via TRL
Introduction
Diffusion models (e.g., DALL-E 2, Stable Diffusion) are a class of generative models that are widely successful at generating images most notably of the photorealistic kind. However, the images generated by these models may not always be on par with human preference or human intention. Thus arises the alignment problem i.e. how does one go about making sure that the outputs of a model are aligned with human preferences like “quality” or that outputs are aligned with intent that is hard to express via prompts? This is where Reinforcement Learning comes into the picture.

In the world of Large Language Models (LLMs), Reinforcement learning (RL) has proven to become a very effective tool for aligning said models to human preferences. It’s one of the main recipes behind the superior performance of systems like ChatGPT. More precisely, RL is the critical ingredient of Reinforcement Learning from Human Feedback (RLHF), which makes ChatGPT chat like human beings.
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU
We are excited to officially release the integration of trl with peft to make Large Language Model (LLM) fine-tuning with Reinforcement Learning more accessible to anyone! In this post, we explain why this is a competitive alternative to existing fine-tuning approaches.

Note peft is a general tool that can be applied to many ML use-cases but it’s particularly interesting for RLHF as this method is especially memory-hungry!

If you want to directly deep dive into the code, check out the example scripts directly on the documentation page of TRL.
How to Install and Use the Hugging Face Unity API
The Hugging Face Unity API is an easy-to-use integration of the Hugging Face Inference API, allowing developers to access and use Hugging Face AI models in their Unity projects. In this blog post, we'll walk through the steps to install and use the Hugging Face Unity API.

Installation
Open your Unity project
Go to Window -> Package Manager
Click + and select Add Package from git URL
Enter https://github.com/huggingface/unity-api.git
Once installed, the Unity API wizard should pop up. If not, go to Window -> Hugging Face API Wizard


Enter your API key. Your API key can be created in your Hugging Face account settings.
Test the API key by clicking Test API key in the API Wizard.
Optionally, change the model endpoints to change which model to use. The model endpoint for any model that supports the inference API can be found by going to the model on the Hugging Face website, clicking Deploy -> Inference API, and copying the url from the API_URL field.
Configure advanced settings if desired. For up-to-date information, visit the project repository at https://github.com/huggingface/unity-api
To see examples of how to use the API, click Install Examples. You can now close the API Wizard. GitHub - huggingface/unity-api
AI Speech Recognition in Unity
Open Source AI Game Jam

Introduction
This tutorial guides you through the process of implementing state-of-the-art Speech Recognition in your Unity game using the Hugging Face Unity API. This feature can be used for giving commands, speaking to an NPC, improving accessibility, or any other functionality where converting spoken words to text may be useful.

To try Speech Recognition in Unity for yourself, check out the live demo in itch.io.

Prerequisites
This tutorial assumes basic knowledge of Unity. It also requires you to have installed the Hugging Face Unity API. For instructions on setting up the API, check out our earlier blog post.
How to host a Unity game in a Space
Did you know you can host a Unity game in a Hugging Face Space? No? Well, you can!

Hugging Face Spaces are an easy way to build, host, and share demos. While they are typically used for Machine Learning demos, they can also host playable Unity games. Here are some examples:

Huggy
Farming Game
Unity API Demo
Here's how you can host your own Unity game in a Space.

Step 1: Create a Space using the Static HTML template
First, navigate to Hugging Face Spaces to create a space.



Select the "Static HTML" template, give your Space a name, and create it.



Step 2: Use Git to Clone the Space
Clone your newly created Space to your local machine using Git. You can do this by running the following command in your terminal or command prompt:

git clone https://huggingface.co/spaces/{your-username}/{your-space-name}
Step 3: Open your Unity Project
Open the Unity project you want to host in your Space.



Step 4: Switch the Build Target to WebGL
Navigate to File > Build Settings and switch the Build Target to WebGL.



Step 5: Open Player Settings
In the Build Settings window, click the "Player Settings" button to open the Player Settings panel.



Step 6: Optionally, Download the Hugging Face Unity WebGL Template
You can enhance your game's appearance in a Space by downloading the Hugging Face Unity WebGL template, available here. Just download the repository and drop it in your project files.

Then, in the Player Settings panel, switch the WebGL template to Hugging Face. To do so, in Player Settings, click "Resolution and Presentation", then select the Hugging Face WebGL template.



Step 7: Change the Compression Format to Disabled
In the Player Settings panel, navigate to the "Publishing Settings" section and change the Compression Format to "Disabled".



Step 8: Build your Project
Return to the Build Settings window and click the "Build" button. Choose a location to save your build files, and Unity will build the project for WebGL.



Step 9: Copy the Contents of the Build Folder
After the build process is finished, navigate to the folder containing your build files. Copy the files in the build folder to the repository you cloned in Step 2.



Step 10: Enable Git-LFS for Large File Storage
Navigate to your repository. Use the following commands to track large build files.

git lfs install
git lfs track Build/*
Step 11: Push your Changes
Finally, use the following Git commands to push your changes:

git add .
git commit -m "Add Unity WebGL build files"
git push
Done!
Make LLM Fine-tuning 2x faster with Unsloth and 🤗 TRL
Pulling your hair out because LLM fine-tuning is taking forever? In this post, we introduce a lightweight tool developed by the community to make LLM fine-tuning go super fast!

Before diving into Unsloth, it may be helpful to read our QLoRA blog post, or be familiar with LLM fine-tuning using the 🤗 PEFT library.

Unsloth - 2x faster, -40% memory usage, 0% accuracy degradation
Unsloth is a lightweight library for faster LLM fine-tuning which is fully compatible with the Hugging Face ecosystem (Hub, transformers, PEFT, TRL). The library is actively developed by the Unsloth team (Daniel and Michael) and the open source community. The library supports most NVIDIA GPUs –from GTX 1070 all the way up to H100s–, and can be used with the entire trainer suite from the TRL library (SFTTrainer, DPOTrainer, PPOTrainer). At the time of writing, Unsloth supports the Llama (CodeLlama, Yi, etc) and Mistral architectures.

Unsloth works by overwriting some parts of the modeling code with optimized operations. By manually deriving backpropagation steps and rewriting all Pytorch modules into Triton kernels, Unsloth can both reduce memory usage and make fine-tuning faster. Crucially, accuracy degradation is 0% with respect to normal QLoRA, because no approximations are made in the optimized code.
AI Policy @🤗: Comments on U.S. National AI Research Resource Interim Report
In late June 2022, Hugging Face submitted a response to the White House Office of Science and Technology Policy and National Science Foundation’s Request for Information on a roadmap for implementing the National Artificial Intelligence Research Resource (NAIRR) Task Force’s interim report findings. As a platform working to democratize machine learning by empowering all backgrounds to contribute to AI, we strongly support NAIRR’s efforts.
Using Machine Learning to Aid Survivors and Race through Time
On February 6, 2023, earthquakes measuring 7.7 and 7.6 hit South Eastern Turkey, affecting 10 cities and resulting in more than 42,000 deaths and 120,000 injured as of February 21.

A few hours after the earthquake, a group of programmers started a Discord server to roll out an application called afetharita, literally meaning, disaster map. This application would serve search & rescue teams and volunteers to find survivors and bring them help. The need for such an app arose when survivors posted screenshots of texts with their addresses and what they needed (including rescue) on social media. Some survivors also tweeted what they needed so their relatives knew they were alive and that they need rescue. Needing to extract information from these tweets, we developed various applications to turn them into structured data and raced against time in developing and deploying these apps.
Deep Dive: Vision Transformers On Hugging Face Optimum Graphcore
This blog post will show how easy it is to fine-tune pre-trained Transformer models for your dataset using the Hugging Face Optimum library on Graphcore Intelligence Processing Units (IPUs). As an example, we will show a step-by-step guide and provide a notebook that takes a large, widely-used chest X-ray dataset and trains a vision transformer (ViT) model.

Introducing vision transformer (ViT) models
In 2017 a group of Google AI researchers published a paper introducing the transformer model architecture. Characterised by a novel self-attention mechanism, transformers were proposed as a new and efficient group of models for language applications. Indeed, in the last five years, transformers have seen explosive popularity and are now accepted as the de facto standard for natural language processing (NLP).
A Dive into Vision-Language Models
Human learning is inherently multi-modal as jointly leveraging multiple senses helps us understand and analyze new information better. Unsurprisingly, recent advances in multi-modal learning take inspiration from the effectiveness of this process to create models that can process and link information using various modalities such as image, video, text, audio, body gestures, facial expressions, and physiological signals.

Since 2021, we’ve seen an increased interest in models that combine vision and language modalities (also called joint vision-language models), such as OpenAI’s CLIP. Joint vision-language models have shown particularly impressive capabilities in very challenging tasks such as image captioning, text-guided image generation and manipulation, and visual question-answering. This field continues to evolve, and so does its effectiveness in improving zero-shot generalization leading to various practical use cases.
Kakao Brain’s Open Source ViT, ALIGN, and the New COYO Text-Image Dataset
Kakao Brain and Hugging Face are excited to release a new open-source image-text dataset COYO of 700 million pairs and two new visual language models trained on it, ViT and ALIGN. This is the first time ever the ALIGN model is made public for free and open-source use and the first release of ViT and ALIGN models that come with the train dataset.

Kakao Brain’s ViT and ALIGN models follow the same architecture and hyperparameters as provided in the original respective Google models but are trained on the open source COYO dataset. Google’s ViT and ALIGN models, while trained on huge datasets (ViT trained on 300 million images and ALIGN trained on 1.8 billion image-text pairs respectively), cannot be replicated because the datasets are not public. This contribution is particularly valuable to researchers who want to reproduce visual language modeling with access to the data as well.
What is a Vision Language Model?
Vision language models are broadly defined as multimodal models that can learn from images and text. They are a type of generative models that take image and text inputs, and generate text outputs. Large vision language models have good zero-shot capabilities, generalize well, and can work with many types of images, including documents, web pages, and more. The use cases include chatting about images, image recognition via instructions, visual question answering, document understanding, image captioning, and others. Some vision language models can also capture spatial properties in an image. These models can output bounding boxes or segmentation masks when prompted to detect or segment a particular subject, or they can localize different entities or answer questions about their relative or absolute positions. There’s a lot of diversity within the existing set of large vision language models, the data they were trained on, how they encode images, and, thus, their capabilities.
VQ-Diffusion
Vector Quantized Diffusion (VQ-Diffusion) is a conditional latent diffusion model developed by the University of Science and Technology of China and Microsoft. Unlike most commonly studied diffusion models, VQ-Diffusion's noising and denoising processes operate on a quantized latent space, i.e., the latent space is composed of a discrete set of vectors. Discrete diffusion models are less explored than their continuous counterparts and offer an interesting point of comparison with autoregressive (AR) models.

Hugging Face model card
Hugging Face Spaces
Original Implementation
Paper
Demo
🧨 Diffusers lets you run VQ-Diffusion with just a few lines of code.
Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models
Open In Colab
Transformer-based encoder-decoder models were proposed in Vaswani et al. (2017) and have recently experienced a surge of interest, e.g. Lewis et al. (2019), Raffel et al. (2019), Zhang et al. (2020), Zaheer et al. (2020), Yan et al. (2020).

Similar to BERT and GPT2, massive pre-trained encoder-decoder models have shown to significantly boost performance on a variety of sequence-to-sequence tasks Lewis et al. (2019), Raffel et al. (2019). However, due to the enormous computational cost attached to pre-training encoder-decoder models, the development of such models is mainly limited to large companies and institutes.

In Leveraging Pre-trained Checkpoints for Sequence Generation Tasks (2020), Sascha Rothe, Shashi Narayan and Aliaksei Severyn initialize encoder-decoder model with pre-trained encoder and/or decoder-only checkpoints (e.g. BERT, GPT2) to skip the costly pre-training. The authors show that such warm-started encoder-decoder models yield competitive results to large pre-trained encoder-decoder models, such as T5, and Pegasus on multiple sequence-to-sequence tasks at a fraction of the training cost.

In this notebook, we will explain in detail how encoder-decoder models can be warm-started, give practical tips based on Rothe et al. (2020), and finally go over a complete code example showing how to warm-start encoder-decoder models with 🤗Transformers.

This notebook is divided into 4 parts:

Introduction - Short summary of pre-trained language models in NLP and the need for warm-starting encoder-decoder models.
Warm-starting encoder-decoder models (Theory) - Illustrative explanation on how encoder-decoder models are warm-started?
Warm-starting encoder-decoder models (Analysis) - Summary of Leveraging Pre-trained Checkpoints for Sequence Generation Tasks (2020) - What model combinations are effective to warm-start encoder-decoder models; How does it differ from task to task?
Warm-starting encoder-decoder models with 🤗Transformers (Practice) - Complete code example showcasing in-detail how to use the EncoderDecoderModel framework to warm-start transformer-based encoder-decoder models.
It is highly recommended (probably even necessary) to have read this blog post about transformer-based encoder-decoder models.

Let's start by giving some back-ground on warm-starting encoder-decoder models.
AI Watermarking 101: Tools and Techniques
In recent months, we've seen multiple news stories involving ‘deepfakes’, or AI-generated content: from images of Taylor Swift to videos of Tom Hanks and recordings of US President Joe Biden. Whether they are selling products, manipulating images of people without their consent, supporting phishing for private information, or creating misinformation materials intended to mislead voters, deepfakes are increasingly being shared on social media platforms. This enables them to be quickly propagated and have a wider reach and therefore, the potential to cause long-lasting damage.

In this blog post, we will describe approaches to carry out watermarking of AI-generated content, discuss their pros and cons, and present some of the tools available on the Hugging Face Hub for adding/detecting watermarks.
Boosting Wav2Vec2 with n-grams in 🤗 Transformers
Open In Colab
Wav2Vec2 is a popular pre-trained model for speech recognition. Released in September 2020 by Meta AI Research, the novel architecture catalyzed progress in self-supervised pretraining for speech recognition, e.g. G. Ng et al., 2021, Chen et al, 2021, Hsu et al., 2021 and Babu et al., 2021. On the Hugging Face Hub, Wav2Vec2's most popular pre-trained checkpoint currently amounts to over 250,000 monthly downloads.

Using Connectionist Temporal Classification (CTC), pre-trained Wav2Vec2-like checkpoints are extremely easy to fine-tune on downstream speech recognition tasks. In a nutshell, fine-tuning pre-trained Wav2Vec2 checkpoints works as follows:

A single randomly initialized linear layer is stacked on top of the pre-trained checkpoint and trained to classify raw audio input to a sequence of letters. It does so by:

extracting audio representations from the raw audio (using CNN layers),
processing the sequence of audio representations with a stack of transformer layers, and,
classifying the processed audio representations into a sequence of output letters.
Previously audio classification models required an additional language model (LM) and a dictionary to transform the sequence of classified audio frames to a coherent transcription. Wav2Vec2's architecture is based on transformer layers, thus giving each processed audio representation context from all other audio representations. In addition, Wav2Vec2 leverages the CTC algorithm for fine-tuning, which solves the problem of alignment between a varying "input audio length"-to-"output text length" ratio.
From screenshots to HTML code: Introducing the WebSight dataset
In the world of web development, turning designs into functional websites usually involves a lot of coding and careful testing. What if we could simplify this process, making it possible to convert web designs into working websites more easily and quickly? WebSight is a new dataset that aims at building AI systems capable of transforming screenshots to HTML code.

The challenge
Turning a website design or screenshot into HTML code usually needs an experienced developer. But what if this could be more efficient? Motivated by this question, we investigated how vision-language models (VLMs) could be used in web development to create low-code solutions that improve efficiency.

Today, the main challenge towards that goal is the lack of high-quality datasets tailored for this task. WebSight aims to fill that gap.
Speculative Decoding for 2x Faster Whisper Inference
Open In Colab
Open AI's Whisper is a general purpose speech transcription model that achieves state-of-the-art results across a range of different benchmarks and audio conditions. The latest large-v3 model tops the OpenASR Leaderboard, ranking as the best open-source speech transcription model for English. The model also demonstrates strong multilingual performance, achieving less than 30% word error rate (WER) on 42 of the 58 languages tested in the Common Voice 15 dataset.

While the transcription accuracy is exceptional, the inference time is very slow. A 1 hour audio clip takes upwards of 6 minutes to transcribe on a 16GB T4 GPU, even after leveraging inference optimisations like flash attention, half-precision, and chunking.

In this blog post, we demonstrate how Speculative Decoding can be employed to reduce the inference time of Whisper by a factor of 2, while mathematically ensuring exactly the same outputs are achieved from the model. As a result, this method provides a perfect drop-in replacement for existing Whisper pipelines, since it provides free 2x speed-up while maintaining the same accuracy. For a more streamlined version of the blog post with fewer explanations but all the code, see the accompanying Google Colab.
How NuminaMath Won the 1st AIMO Progress Prize
This year, Numina and Hugging Face collaborated to compete in the 1st Progress Prize of the AI Math Olympiad (AIMO). This competition involved fine-tuning open LLMs to solve difficult math problems that high school students use to train for the International Math Olympiad. We’re excited to share that our model — NuminaMath 7B TIR — was the winner and managed to solve 29 out of 50 problems on the private test set 🥳!
Leveraging Hugging Face for complex generative AI use casess
In this conversation, Jeff Boudier asks Waseem Alshikh, Co-founder and CTO of Writer, about their journey from a Hugging Face user, to a customer and now an open source model contributor.

why was Writer started?
what are the biggest misconceptions in Generative AI today?
why is Writer now contributing open source models?
what has been the value of the Hugging Face Expert Acceleration Program service for Writer?
how it Writer approaching production on CPU and GPU to serve LLMs at scale?
how important is efficiency and using CPUs for production?
<iframe width="100%" style="aspect-ratio: 16 / 9;" src="https://www.youtube-nocookie.com/embed/t8Ek1aOtaQw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
If you’re interested in Hugging Face Expert Acceleration Program for your company, please contact us here - our team will contact you to discuss your requirements!