Share and discover more about AI with social posts from the community.huggingface/OpenAi
Pre-Training BERT with Hugging Face Transformers and Habana Gaudi
In this Tutorial, you will learn how to pre-train BERT-base from scratch using a Habana Gaudi-based DL1 instance on AWS to take advantage of the cost-performance benefits of Gaudi. We will use the Hugging Face Transformers, Optimum Habana and Datasets libraries to pre-train a BERT-base model using masked-language modeling, one of the two original BERT pre-training tasks. Before we get started, we need to set up the deep learning environment.

View Code
You will learn how to:

Prepare the dataset
Train a Tokenizer
Preprocess the dataset
Pre-train BERT on Habana Gaudi
Note: Steps 1 to 3 can/should be run on a different instance size since those are CPU intensive tasks.
Going multimodal: How Prezi is leveraging the Hub and the Expert Support Program to accelerate their ML roadmap
Everybody knows that a great visual is worth a thousand words. The team at Prezi, a visual communications software company, is putting this insight into practice with their Prezi presentations that combine images and text in highly dynamic presentations.

Prezi has joined the Hugging Face Expert Support Program to fully leverage modern machine learning's potential. Over the past months, Hugging Face has supported Prezi in integrating smaller, more efficient open-source models into their ML workflows. This cooperation started at a perfect time, as multimodal models are becoming increasingly capable.

We recently sat down with Máté Börcsök, a backend engineer at Prezi, to talk about their experience in the Expert Support Program. In this short video, Máté walks us through some of their machine learning work and shares their experience collaborating with our team via the Expert Support Program.

<iframe width="100%" style="aspect-ratio: 16 / 9;" src="https://www.youtube.com/embed/pM6D0tRoIbI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
If you'd like to accelerate your machine learning roadmap with the help of our experts, as Máté and his team did, visit hf.co/support to learn more about our Expert Support Program and request a quote.
Introducing our new pricing
As you might have noticed, our pricing page has changed a lot recently.

First of all, we are sunsetting the Paid tier of the Inference API service. The Inference API will still be available for everyone to use for free. But if you're looking for a fast, enterprise-grade inference as a service, we recommend checking out our brand new solution for this: Inference Endpoints.

Along with Inference Endpoints, we've recently introduced hardware upgrades for Spaces, which allows running ML demos with the hardware of your choice. No subscription is required to use these services; you only need to add a credit card to your account from your billing settings. You can also attach a payment method to any of your organizations.

Your billing settings centralize everything about our paid services. From there, you can manage your personal PRO subscription, update your payment method, and visualize your usage for the past three months. Usage for all our paid services and subscriptions will be charged at the start of each month, and a consolidated invoice will be available for your records.

TL;DR: At HF we monetize by providing simple access to compute for AI, with services like AutoTrain, Spaces and Inference Endpoints, directly accessible from the Hub. Read more about our pricing and billing system.

If you have any questions, feel free to reach out. We welcome your feedback 🔥
Introducing Prodigy-HF
Prodigy is an annotation tool made by Explosion, a company well known as the creators of spaCy. It's a fully scriptable product with a large community around it. The product has many features, including tight integration with spaCy and active learning capabilities. But the main feature of the product is that it is programmatically customizable with Python.

To foster this customisability, Explosion has started releasing plugins. These plugins integrate with third-party tools in an open way that encourages users to work on bespoke annotation workflows. However, one customization specifically deserves to be celebrated explicitly. Last week, Explosion introduced Prodigy-HF, which offers code recipes that directly integrate with the Hugging Face stack. It's been a much-requested feature on the Prodigy support forum, so we're super excited to have it out there.
Putting RL back in RLHF
We are excited to introduce the RLOO (REINFORCE Leave One-Out) Trainer in TRL. As an alternative to PPO, RLOO is a new online RLHF training algorithm designed to be more accessible and easier to implement. In particular, RLOO requires less GPU memory and takes less wall time to converge. As shown in the figures below:

🤑RLOO uses approximately 50-70% less vRAM than PPO, depending on the model size
🚀RLOO runs 2x faster than PPO with 1B models and up to 3x faster than PPO with 6.9B models.
🔥RLOO performs competitively to PPO in terms of the response win rate (judged by GPT4) and consistently outperforms popular offline methods like DPO.
With RLOO, we bring Reinforcement Learning back into RLHF, enabling the community to explore online RL methods more easily. This is exciting because more and more studies have shown that online RL is more effective than offline methods such as DPO (https://arxiv.org/abs/2402.04792, https://arxiv.org/abs/2405.08448).
From PyTorch DDP to Accelerate to Trainer, mastery of distributed training with ease
General Overview
This tutorial assumes you have a basic understanding of PyTorch and how to train a simple model. It will showcase training on multiple GPUs through a process called Distributed Data Parallelism (DDP) through three different levels of increasing abstraction:

Native PyTorch DDP through the pytorch.distributed module
Utilizing 🤗 Accelerate's light wrapper around pytorch.distributed that also helps ensure the code can be run on a single GPU and TPUs with zero code changes and miminimal code changes to the original code
Utilizing 🤗 Transformer's high-level Trainer API which abstracts all the boilerplate code and supports various devices and distributed scenarios
Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel
In this post we will look at how we can leverage Accelerate Library for training large models which enables users to leverage the latest features of PyTorch FullyShardedDataParallel (FSDP).

Motivation 🤗
With the ever increasing scale, size and parameters of the Machine Learning (ML) models, ML practitioners are finding it difficult to train or even load such large models on their hardware. On one hand, it has been found that large models learn quickly (data and compute efficient) and are significantly more performant when compared to smaller models [1]; on the other hand, it becomes prohibitive to train such models on most of the available hardware.

Distributed training is the key to enable training such large ML models. There have been major recent advances in the field of Distributed Training at Scale. Few the most notable advances are given below:
Hugging Face on PyTorch / XLA TPUs: Faster and cheaper training
Open In Colab

Training Your Favorite Transformers on Cloud TPUs using PyTorch / XLA
The PyTorch-TPU project originated as a collaborative effort between the Facebook PyTorch and Google TPU teams and officially launched at the 2019 PyTorch Developer Conference 2019. Since then, we’ve worked with the Hugging Face team to bring first-class support to training on Cloud TPUs using PyTorch / XLA. This new integration enables PyTorch users to run and scale up their models on Cloud TPUs while maintaining the exact same Hugging Face trainers interface.

This blog post provides an overview of changes made in the Hugging Face library, what the PyTorch / XLA library does, an example to get you started training your favorite transformers on Cloud TPUs, and some performance benchmarks. If you can’t wait to get started with TPUs, please skip ahead to the “Train Your Transformer on Cloud TPUs” section - we handle all the PyTorch / XLA mechanics for you within the Trainer module!
Block Sparse Matrices for Smaller and Faster Language Models
Saving space and time, one zero at a time
In previous blog posts we introduced sparse matrices and what they could do to improve neural networks.

The basic assumption is that full dense layers are often overkill and can be pruned without a significant loss in precision. In some cases sparse linear layers can even improve precision or/and generalization.

The main issue is that currently available code that supports sparse algebra computation is severely lacking efficiency. We are also still waiting for official PyTorch support.

That's why we ran out of patience and took some time this summer to address this "lacuna". Today, we are excited to release the extension pytorch_block_sparse.

By itself, or even better combined with other methods like distillation and quantization, this library enables networks which are both smaller and faster, something Hugging Face considers crucial to let anybody use neural networks in production at low cost, and to improve the experience for the end user.
Memory-efficient Diffusion Transformers with Quanto and Diffusers
Over the past few months, we have seen an emergence in the use of Transformer-based diffusion backbones for high-resolution text-to-image (T2I) generation. These models use the transformer architecture as the building block for the diffusion process, instead of the UNet architecture that was prevalent in many of the initial diffusion models. Thanks to the nature of Transformers, these backbones show good scalability, with models ranging from 0.6B to 8B parameters.

As models become larger, memory requirements increase. The problem intensifies because a diffusion pipeline usually consists of several components: a text encoder, a diffusion backbone, and an image decoder. Furthermore, modern diffusion pipelines use multiple text encoders – for example, there are three in the case of Stable Diffusion 3. It takes 18.765 GB of GPU memory to run SD3 inference using FP16 precision.

These high memory requirements can make it difficult to use these models with consumer GPUs, slowing adoption and making experimentation harder. In this post, we show how to improve the memory efficiency of Transformer-based diffusion pipelines by leveraging Quanto's quantization utilities from the Diffusers library.
Quanto: a PyTorch quantization backend for Optimum
Quantization is a technique to reduce the computational and memory costs of evaluating Deep Learning Models by representing their weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, which is crucial for deploying Large Language Models on consumer devices. It also enables specific optimizations for lower bitwidth datatypes, such as int8 or float8 matrix multiplications on CUDA devices.

Many open-source libraries are available to quantize pytorch Deep Learning Models, each providing very powerful features, yet often restricted to specific model configurations and devices.

Also, although they are based on the same design principles, they are unfortunately often incompatible with one another.

Today, we are excited to introduce quanto, a PyTorch quantization backend for Optimum.
Fine-tuning Llama 2 70B using PyTorch FSDP
Introduction
In this blog post, we will look at how to fine-tune Llama 2 70B using PyTorch FSDP and related best practices. We will be leveraging Hugging Face Transformers, Accelerate and TRL. We will also learn how to use Accelerate with SLURM.

Fully Sharded Data Parallelism (FSDP) is a paradigm in which the optimizer states, gradients and parameters are sharded across devices. During the forward pass, each FSDP unit performs an all-gather operation to get the complete weights, computation is performed followed by discarding the shards from other devices. After the forward pass, the loss is computed followed by the backward pass. In the backward pass, each FSDP unit performs an all-gather operation to get the complete weights, with computation performed to get the local gradients. These local gradients are averaged and sharded across the devices via a reduce-scatter operation so that each device can update the parameters of its shard.
Retrieval Augmented Generation with Huggingface Transformers and Ray
A guest blog post by Amog Kamsetty from the Anyscale team
Huggingface Transformers recently added the Retrieval Augmented Generation (RAG) model, a new NLP architecture that leverages external documents (like Wikipedia) to augment its knowledge and achieve state of the art results on knowledge-intensive tasks. In this blog post, we introduce the integration of Ray, a library for building scalable applications, into the RAG contextual document retrieval mechanism. This speeds up retrieval calls by 2x and improves the scalability of RAG distributed fine-tuning.

What is Retrieval Augmented Generation (RAG)?
Hyperparameter Search with Transformers and Ray Tune
A guest blog post by Richard Liaw from the Anyscale team
With cutting edge research implementations, thousands of trained models easily accessible, the Hugging Face transformers library has become critical to the success and growth of natural language processing today.

For any machine learning model to achieve good performance, users often need to implement some form of parameter tuning. Yet, nearly everyone (1, 2) either ends up disregarding hyperparameter tuning or opting to do a simplistic grid search with a small search space.

However, simple experiments are able to show the benefit of using an advanced tuning technique. Below is a recent experiment run on a BERT model from Hugging Face transformers on the RTE dataset. Genetic optimization techniques like PBT can provide large performance improvements compared to standard hyperparameter optimization techniques.
Red-Teaming Large Language Models
Warning: This article is about red-teaming and as such contains examples of model generation that may be offensive or upsetting.

Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, earlier versions of GPT3 were known to exhibit sexist behaviors (see below) and biases against Muslims,
AutoGen from @Microsoft is crazy! 🚀 It's an open-source framework that allows LLM agents to chat with each other to solve your tasks. 🤖💬

They use the Assistant-Agent and User-Proxy-Agent framework! 🛠

As the name suggests, the Assistant-Agent does the work, and the User-Proxy-Agent behaves like a human, guiding the Assistant-Agent and double-checking its work! 🧑‍💻

Both Assistant-Agent and User-Proxy-Agent can be the same or different LLMs. 🤔🔄

AutoGen is an open-source programming framework for building AI agents and facilitating cooperation among multiple agents to solve tasks. 🌟

This is truly amazing for building agentic AI quickly! 🚀

GitHub: https://github.com/microsoft/autogen 🔗 GitHub - microsoft/autogen: A programming framework for agentic AI 🤖
How To Use On Windows
Just extract files into like c:/BiRefNet_v1

Double click Windows_Install.bat file and it will generate a isolated virtual environment and install requirements

It will automatically download models into your Hugging Face cache (best model under 1 GB)

Then start and use the Gradio APP with Windows_Start_App.bat

Cloud How To Use
Massed Compute, RunPod has instructions txt files. Follow them

Kaggle has all the instructions 1 by 1

On Kaggle set resolution 1024x1024 or you will get out of memory error
BiRefNet State Of The Art Newest Very Best Background Batch Remover APP

Official repo : https://github.com/ZhengPeng7/BiRefNet

Download APP and installers from : https://www.patreon.com/posts/109913645

Hugging Face Demo :
ZhengPeng7/BiRefNet_demo


I have developed a very advanced Gradio APP for this with full proper file saving and batch processing. Also my version removes BG and saves as transparent background. GitHub - ZhengPeng7/BiRefNet: [CAAI AIR'24] Bilateral Reference for High-Resolution Dichotomous Image Segmentation
How to use SD3 Mdeium with SwarmUI • Run SD3 Medium Locally With Swarm UI
Download SDXL Union ControlNet (rename it any way you want. I just called it union) https://huggingface.co/xinsir/control...
Comfyui manager https://github.com/ltdrdata/ComfyUI-M...
The Reformer - Pushing the limits of language modeling
Open In Colab

How the Reformer uses less than 8GB of RAM to train on sequences of half a million tokens
The Reformer model as introduced by Kitaev, Kaiser et al. (2020) is one of the most memory-efficient transformer models for long sequence modeling as of today.

Recently, long sequence modeling has experienced a surge of interest as can be seen by the many submissions from this year alone - Beltagy et al. (2020), Roy et al. (2020), Tay et al., Wang et al. to name a few. The motivation behind long sequence modeling is that many tasks in NLP, e.g. summarization, question answering, require the model to process longer input sequences than models, such as BERT, are able to handle. In tasks that require the model to process a large input sequence, long sequence models do not have to cut the input sequence to avoid memory overflow and thus have been shown to outperform standard "BERT"-like models cf. Beltagy et al. (2020).
https://github.com/huggingface/blog/blob/main/reformer.md blog/reformer.md at main · huggingface/blog