HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
Public Policy at Hugging Face
AI Policy at Hugging Face is a multidisciplinary and cross-organizational workstream. Instead of being part of a vertical communications or global affairs organization, our policy work is rooted in the expertise of our many researchers and developers, from Ethics and Society Regulars and the legal team to machine learning engineers working on healthcare, art, and evaluations.

What we work on is informed by our Hugging Face community needs and experiences on the Hub. We champion responsible openness, investing heavily in ethics-forward research, transparency mechanisms, platform safeguards, and translate our lessons to policy.

So what have we shared with policymakers?
Pollen-Vision: Unified interface for Zero-Shot vision models in robotics
[!NOTE] This is a guest blog post by the Pollen Robotics team. We are the creators of Reachy, an open-source humanoid robot designed for manipulation in the real world.

In the context of autonomous behaviors, the essence of a robot's usability lies in its ability to understand and interact with its environment. This understanding primarily comes from visual perception, which enables robots to identify objects, recognize people, navigate spaces, and much more.

We're excited to share the initial launch of our open-source pollen-vision library, a first step towards empowering our robots with the autonomy to grasp unknown objects. This library is a carefully curated collection of vision models chosen for their direct applicability to robotics. Pollen-vision is designed for ease of installation and use, composed of independent modules that can be combined to create a 3D object detection pipeline, getting the position of the objects in 3D space (x, y, z).

We focused on selecting zero-shot models, eliminating the need for any training, and making these tools instantly usable right out of the box.

Our initial release is focused on 3D object detection—laying the groundwork for tasks like robotic grasping by providing a reliable estimate of objects' spatial coordinates. Currently limited to positioning within a 3D space (not extending to full 6D pose estimation), this functionality establishes a solid foundation for basic robotic manipulation tasks.
Porting fairseq wmt19 translation system to transformers
A guest blog post by Stas Bekman
This article is an attempt to document how fairseq wmt19 translation system was ported to transformers.

I was looking for some interesting project to work on and Sam Shleifer suggested I work on porting a high quality translator.

I read the short paper: Facebook FAIR's WMT19 News Translation Task Submission that describes the original system and decided to give it a try.

Initially, I had no idea how to approach this complex project and Sam helped me to break it down to smaller tasks, which was of a great help.

I chose to work with the pre-trained en-ru/ru-en models during porting as I speak both languages. It'd have been much more difficult to work with de-en/en-de pairs as I don't speak German, and being able to evaluate the translation quality by just reading and making sense of the outputs at the advanced stages of the porting process saved me a lot of time.

Also, as I did the initial porting with the en-ru/ru-en models, I was totally unaware that the de-en/en-de models used a merged vocabulary, whereas the former used 2 separate vocabularies of different sizes. So once I did the more complicated work of supporting 2 separate vocabularies, it was trivial to get the merged vocabulary to work.
Preference Tuning LLMs with Direct Preference Optimization Methods
Addendum

After consulting with the authors of the IPO paper, we discovered that the implementation of IPO in TRL was incorrect; in particular, the loss over the log-likelihoods of the completions needs to be averaged instead of summed. We have added a fix in this PR and re-run the experiments. The results are now consistent with the paper, with IPO on par with DPO and performing better than KTO in the paired preference setting. We have updated the post to reflect these new results.

TL;DR

We evaluate three promising methods to align language models without reinforcement learning (or preference tuning) on a number of models and hyperparameter settings. In particular we train using different hyperparameters and evaluate on:

Direct Preference Optimization (DPO)
Identity Preference Optimisation (IPO)
Kahneman-Tversky Optimisation (KTO)
https://github.com/huggingface/blog/blob/main/pref-tuning.md blog/pref-tuning.md at main · huggingface/blog
Experimenting with Automatic PII Detection on the Hub using Presidio
At Hugging Face, we've noticed a concerning trend in machine learning (ML) datasets hosted on our Hub: Undocumented private information about individuals. This poses some unique challenges for ML practitioners. In this blog post, we'll explore different types of datasets containing a type of private information known as Personally Identifying Information (PII), the issues they present, and a new feature we're experimenting with on the Dataset Hub to help address these challenges.

Types of Datasets with PII
We noticed two types of datasets that contain PII:

Annotated PII datasets: Datasets like PII-Masking-300k by Ai4Privacy are specifically designed to train PII Detection Models, which are used to detect and mask PII. For example, these models can help with online content moderation or provide anonymized databases.
Pre-training datasets: These are large-scale datasets, often terabytes in size, that are typically obtained through web crawls. While these datasets are generally filtered to remove certain types of PII, small amounts of sensitive information can still slip through the cracks due to the sheer volume of data and the imperfections of PII Detection Models.https://github.com/huggingface/blog/blob/main/presidio-pii-detection.md blog/presidio-pii-detection.md at main · huggingface/blog
Pre-Training BERT with Hugging Face Transformers and Habana Gaudi
In this Tutorial, you will learn how to pre-train BERT-base from scratch using a Habana Gaudi-based DL1 instance on AWS to take advantage of the cost-performance benefits of Gaudi. We will use the Hugging Face Transformers, Optimum Habana and Datasets libraries to pre-train a BERT-base model using masked-language modeling, one of the two original BERT pre-training tasks. Before we get started, we need to set up the deep learning environment.

View Code
You will learn how to:

Prepare the dataset
Train a Tokenizer
Preprocess the dataset
Pre-train BERT on Habana Gaudi
Note: Steps 1 to 3 can/should be run on a different instance size since those are CPU intensive tasks.
Going multimodal: How Prezi is leveraging the Hub and the Expert Support Program to accelerate their ML roadmap
Everybody knows that a great visual is worth a thousand words. The team at Prezi, a visual communications software company, is putting this insight into practice with their Prezi presentations that combine images and text in highly dynamic presentations.

Prezi has joined the Hugging Face Expert Support Program to fully leverage modern machine learning's potential. Over the past months, Hugging Face has supported Prezi in integrating smaller, more efficient open-source models into their ML workflows. This cooperation started at a perfect time, as multimodal models are becoming increasingly capable.

We recently sat down with Máté Börcsök, a backend engineer at Prezi, to talk about their experience in the Expert Support Program. In this short video, Máté walks us through some of their machine learning work and shares their experience collaborating with our team via the Expert Support Program.

<iframe width="100%" style="aspect-ratio: 16 / 9;" src="https://www.youtube.com/embed/pM6D0tRoIbI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
If you'd like to accelerate your machine learning roadmap with the help of our experts, as Máté and his team did, visit hf.co/support to learn more about our Expert Support Program and request a quote.
Introducing our new pricing
As you might have noticed, our pricing page has changed a lot recently.

First of all, we are sunsetting the Paid tier of the Inference API service. The Inference API will still be available for everyone to use for free. But if you're looking for a fast, enterprise-grade inference as a service, we recommend checking out our brand new solution for this: Inference Endpoints.

Along with Inference Endpoints, we've recently introduced hardware upgrades for Spaces, which allows running ML demos with the hardware of your choice. No subscription is required to use these services; you only need to add a credit card to your account from your billing settings. You can also attach a payment method to any of your organizations.

Your billing settings centralize everything about our paid services. From there, you can manage your personal PRO subscription, update your payment method, and visualize your usage for the past three months. Usage for all our paid services and subscriptions will be charged at the start of each month, and a consolidated invoice will be available for your records.

TL;DR: At HF we monetize by providing simple access to compute for AI, with services like AutoTrain, Spaces and Inference Endpoints, directly accessible from the Hub. Read more about our pricing and billing system.

If you have any questions, feel free to reach out. We welcome your feedback 🔥
Introducing Prodigy-HF
Prodigy is an annotation tool made by Explosion, a company well known as the creators of spaCy. It's a fully scriptable product with a large community around it. The product has many features, including tight integration with spaCy and active learning capabilities. But the main feature of the product is that it is programmatically customizable with Python.

To foster this customisability, Explosion has started releasing plugins. These plugins integrate with third-party tools in an open way that encourages users to work on bespoke annotation workflows. However, one customization specifically deserves to be celebrated explicitly. Last week, Explosion introduced Prodigy-HF, which offers code recipes that directly integrate with the Hugging Face stack. It's been a much-requested feature on the Prodigy support forum, so we're super excited to have it out there.
Putting RL back in RLHF
We are excited to introduce the RLOO (REINFORCE Leave One-Out) Trainer in TRL. As an alternative to PPO, RLOO is a new online RLHF training algorithm designed to be more accessible and easier to implement. In particular, RLOO requires less GPU memory and takes less wall time to converge. As shown in the figures below:

🤑RLOO uses approximately 50-70% less vRAM than PPO, depending on the model size
🚀RLOO runs 2x faster than PPO with 1B models and up to 3x faster than PPO with 6.9B models.
🔥RLOO performs competitively to PPO in terms of the response win rate (judged by GPT4) and consistently outperforms popular offline methods like DPO.
With RLOO, we bring Reinforcement Learning back into RLHF, enabling the community to explore online RL methods more easily. This is exciting because more and more studies have shown that online RL is more effective than offline methods such as DPO (https://arxiv.org/abs/2402.04792, https://arxiv.org/abs/2405.08448).
From PyTorch DDP to Accelerate to Trainer, mastery of distributed training with ease
General Overview
This tutorial assumes you have a basic understanding of PyTorch and how to train a simple model. It will showcase training on multiple GPUs through a process called Distributed Data Parallelism (DDP) through three different levels of increasing abstraction:

Native PyTorch DDP through the pytorch.distributed module
Utilizing 🤗 Accelerate's light wrapper around pytorch.distributed that also helps ensure the code can be run on a single GPU and TPUs with zero code changes and miminimal code changes to the original code
Utilizing 🤗 Transformer's high-level Trainer API which abstracts all the boilerplate code and supports various devices and distributed scenarios
Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel
In this post we will look at how we can leverage Accelerate Library for training large models which enables users to leverage the latest features of PyTorch FullyShardedDataParallel (FSDP).

Motivation 🤗
With the ever increasing scale, size and parameters of the Machine Learning (ML) models, ML practitioners are finding it difficult to train or even load such large models on their hardware. On one hand, it has been found that large models learn quickly (data and compute efficient) and are significantly more performant when compared to smaller models [1]; on the other hand, it becomes prohibitive to train such models on most of the available hardware.

Distributed training is the key to enable training such large ML models. There have been major recent advances in the field of Distributed Training at Scale. Few the most notable advances are given below:
Hugging Face on PyTorch / XLA TPUs: Faster and cheaper training
Open In Colab

Training Your Favorite Transformers on Cloud TPUs using PyTorch / XLA
The PyTorch-TPU project originated as a collaborative effort between the Facebook PyTorch and Google TPU teams and officially launched at the 2019 PyTorch Developer Conference 2019. Since then, we’ve worked with the Hugging Face team to bring first-class support to training on Cloud TPUs using PyTorch / XLA. This new integration enables PyTorch users to run and scale up their models on Cloud TPUs while maintaining the exact same Hugging Face trainers interface.

This blog post provides an overview of changes made in the Hugging Face library, what the PyTorch / XLA library does, an example to get you started training your favorite transformers on Cloud TPUs, and some performance benchmarks. If you can’t wait to get started with TPUs, please skip ahead to the “Train Your Transformer on Cloud TPUs” section - we handle all the PyTorch / XLA mechanics for you within the Trainer module!
Block Sparse Matrices for Smaller and Faster Language Models
Saving space and time, one zero at a time
In previous blog posts we introduced sparse matrices and what they could do to improve neural networks.

The basic assumption is that full dense layers are often overkill and can be pruned without a significant loss in precision. In some cases sparse linear layers can even improve precision or/and generalization.

The main issue is that currently available code that supports sparse algebra computation is severely lacking efficiency. We are also still waiting for official PyTorch support.

That's why we ran out of patience and took some time this summer to address this "lacuna". Today, we are excited to release the extension pytorch_block_sparse.

By itself, or even better combined with other methods like distillation and quantization, this library enables networks which are both smaller and faster, something Hugging Face considers crucial to let anybody use neural networks in production at low cost, and to improve the experience for the end user.
Memory-efficient Diffusion Transformers with Quanto and Diffusers
Over the past few months, we have seen an emergence in the use of Transformer-based diffusion backbones for high-resolution text-to-image (T2I) generation. These models use the transformer architecture as the building block for the diffusion process, instead of the UNet architecture that was prevalent in many of the initial diffusion models. Thanks to the nature of Transformers, these backbones show good scalability, with models ranging from 0.6B to 8B parameters.

As models become larger, memory requirements increase. The problem intensifies because a diffusion pipeline usually consists of several components: a text encoder, a diffusion backbone, and an image decoder. Furthermore, modern diffusion pipelines use multiple text encoders – for example, there are three in the case of Stable Diffusion 3. It takes 18.765 GB of GPU memory to run SD3 inference using FP16 precision.

These high memory requirements can make it difficult to use these models with consumer GPUs, slowing adoption and making experimentation harder. In this post, we show how to improve the memory efficiency of Transformer-based diffusion pipelines by leveraging Quanto's quantization utilities from the Diffusers library.
Quanto: a PyTorch quantization backend for Optimum
Quantization is a technique to reduce the computational and memory costs of evaluating Deep Learning Models by representing their weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, which is crucial for deploying Large Language Models on consumer devices. It also enables specific optimizations for lower bitwidth datatypes, such as int8 or float8 matrix multiplications on CUDA devices.

Many open-source libraries are available to quantize pytorch Deep Learning Models, each providing very powerful features, yet often restricted to specific model configurations and devices.

Also, although they are based on the same design principles, they are unfortunately often incompatible with one another.

Today, we are excited to introduce quanto, a PyTorch quantization backend for Optimum.
Fine-tuning Llama 2 70B using PyTorch FSDP
Introduction
In this blog post, we will look at how to fine-tune Llama 2 70B using PyTorch FSDP and related best practices. We will be leveraging Hugging Face Transformers, Accelerate and TRL. We will also learn how to use Accelerate with SLURM.

Fully Sharded Data Parallelism (FSDP) is a paradigm in which the optimizer states, gradients and parameters are sharded across devices. During the forward pass, each FSDP unit performs an all-gather operation to get the complete weights, computation is performed followed by discarding the shards from other devices. After the forward pass, the loss is computed followed by the backward pass. In the backward pass, each FSDP unit performs an all-gather operation to get the complete weights, with computation performed to get the local gradients. These local gradients are averaged and sharded across the devices via a reduce-scatter operation so that each device can update the parameters of its shard.
Retrieval Augmented Generation with Huggingface Transformers and Ray
A guest blog post by Amog Kamsetty from the Anyscale team
Huggingface Transformers recently added the Retrieval Augmented Generation (RAG) model, a new NLP architecture that leverages external documents (like Wikipedia) to augment its knowledge and achieve state of the art results on knowledge-intensive tasks. In this blog post, we introduce the integration of Ray, a library for building scalable applications, into the RAG contextual document retrieval mechanism. This speeds up retrieval calls by 2x and improves the scalability of RAG distributed fine-tuning.

What is Retrieval Augmented Generation (RAG)?
Hyperparameter Search with Transformers and Ray Tune
A guest blog post by Richard Liaw from the Anyscale team
With cutting edge research implementations, thousands of trained models easily accessible, the Hugging Face transformers library has become critical to the success and growth of natural language processing today.

For any machine learning model to achieve good performance, users often need to implement some form of parameter tuning. Yet, nearly everyone (1, 2) either ends up disregarding hyperparameter tuning or opting to do a simplistic grid search with a small search space.

However, simple experiments are able to show the benefit of using an advanced tuning technique. Below is a recent experiment run on a BERT model from Hugging Face transformers on the RTE dataset. Genetic optimization techniques like PBT can provide large performance improvements compared to standard hyperparameter optimization techniques.
Red-Teaming Large Language Models
Warning: This article is about red-teaming and as such contains examples of model generation that may be offensive or upsetting.

Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, earlier versions of GPT3 were known to exhibit sexist behaviors (see below) and biases against Muslims,