HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator
With the Generative AI (GenAI) revolution in full swing, text-generation with open-source transformer models like Llama 2 has become the talk of the town. AI enthusiasts as well as developers are looking to leverage the generative abilities of such models for their own use cases and applications. This article shows how easy it is to generate text with the Llama 2 family of models (7b, 13b and 70b) using Optimum Habana and a custom pipeline class – you'll be able to run the models with just a few lines of code!

This custom pipeline class has been designed to offer great flexibility and ease of use. Moreover, it provides a high level of abstraction and performs end-to-end text-generation which involves pre-processing and post-processing. There are multiple ways to use the pipeline - you can run the run_pipeline.py script from the Optimum Habana repository, add the pipeline class to your own python scripts, or initialize LangChain classes with it.
aster TensorFlow models in Hugging Face Transformers
Open In Colab
In the last few months, the Hugging Face team has been working hard on improving Transformers’ TensorFlow models to make them more robust and faster. The recent improvements are mainly focused on two aspects:

Computational performance: BERT, RoBERTa, ELECTRA and MPNet have been improved in order to have a much faster computation time. This gain of computational performance is noticeable for all the computational aspects: graph/eager mode, TF Serving and for CPU/GPU/TPU devices.
TensorFlow Serving: each of these TensorFlow model can be deployed with TensorFlow Serving to benefit of this gain of computational performance for inference.
Faster Text Generation with TensorFlow and XLA
TL;DR: Text Generation on 🤗 transformers using TensorFlow can now be compiled with XLA. It is up to 100x faster than before, and even faster than PyTorch -- check the colab below! Open In Colab

Text Generation
As the quality of large language models increased, so did our expectations of what those models could do. Especially since the release of OpenAI's GPT-2, models with text generation capabilities have been in the spotlight. And for legitimate reasons -- these models can be used to summarize, translate, and they even have demonstrated zero-shot learning capabilities on some language tasks. This blog post will show how to take the most of this technology with TensorFlow.

The 🤗 transformers library started with NLP models, so it is natural that text generation is of utmost importance to us. It is part of Hugging Face democratization efforts to ensure it is accessible, easily controllable, and efficient. There is a previous blog post about the different types of text generation. Nevertheless, below there's a quick recap of the core functionality -- feel free to skip it if you're familiar with our generate function and want to jump straight into TensorFlow's specificities.

Let's start with the basics. Text generation can be deterministic or stochastic, depending on the do_sample flag. By default it's set to False, causing the output to be deterministic, which is also known as Greedy Decoding. When it's set to True, also known as Sampling, the output will be stochastic, but you can still obtain reproducible results through the seed argument (with the same format as in stateless TensorFlow random number generation). As a rule of thumb, you want deterministic generation if you wish to obtain factual information from the model and stochastic generation if you're aiming at more creative outputs.
Training a language model with 🤗 Transformers using TensorFlow and TPUs
Introduction
TPU training is a useful skill to have: TPU pods are high-performance and extremely scalable, making it easy to train models at any scale from a few tens of millions of parameters up to truly enormous sizes: Google’s PaLM model (over 500 billion parameters!) was trained entirely on TPU pods.

We’ve previously written a tutorial and a Colab example showing small-scale TPU training with TensorFlow and introducing the core concepts you need to understand to get your model working on TPU. This time, we’re going to step that up another level and train a masked language model from scratch using TensorFlow and TPU, including every step from training your tokenizer and preparing your dataset through to the final model training and uploading. This is the kind of task that you’ll probably want a dedicated TPU node (or VM) for, rather than just Colab, and so that’s where we’ll focus.

As in our Colab example, we’re taking advantage of TensorFlow's very clean TPU support via XLA and TPUStrategy. We’ll also be benefiting from the fact that the majority of the TensorFlow models in 🤗 Transformers are fully XLA-compatible. So surprisingly, little work is needed to get them to run on TPU.

Unlike our Colab example, however, this example is designed to be scalable and much closer to a realistic training run -- although we only use a BERT-sized model by default, the code could be expanded to a much larger model and a much more powerful TPU pod slice by changing a few configuration options.
Benchmarking Text Generation Inference
In this blog we will be exploring Text Generation Inference’s (TGI) little brother, the TGI Benchmarking tool. It will help us understand how to profile TGI beyond simple throughput to better understand the tradeoffs to make decisions on how to tune your deployment for your needs. If you have ever felt like LLM deployments cost too much or if you want to tune your deployment to improve performance this blog is for you!

I’ll show you how to do this in a convenient Hugging Face Space. You can take the results and use it on an Inference Endpoint or other copy of the same hardware.
From OpenAI to Open LLMs with Messages API on Hugging Face
We are excited to introduce the Messages API to provide OpenAI compatibility with Text Generation Inference (TGI) and Inference Endpoints.

Starting with version 1.4.0, TGI offers an API compatible with the OpenAI Chat Completion API. The new Messages API allows customers and users to transition seamlessly from OpenAI models to open LLMs. The API can be directly used with OpenAI's client libraries or third-party tools, like LangChain or LlamaIndex.

"The new Messages API with OpenAI compatibility makes it easy for Ryght's real-time GenAI orchestration platform to switch LLM use cases from OpenAI to open models. Our migration from GPT4 to Mixtral/Llama2 on Inference Endpoints is effortless, and now we have a simplified workflow with more control over our AI solutions." - Johnny Crupi, CTO at Ryght
The new Messages API is also now available in Inference Endpoints, on both dedicated and serverless flavors. To get you started quickly, we’ve included detailed examples of how to:

Create an Inference Endpoint
Using Inference Endpoints with OpenAI client libraries
Integrate with LangChain and LlamaIndex
Limitations: The Messages API does not currently support function calling and will only work for LLMs with a chat_template defined in their tokenizer configuration, like in the case of Mixtral 8x7B Instruct.
The Age of Machine Learning As Code Has Arrived
The 2021 edition of the State of AI Report came out last week. So did the Kaggle State of Machine Learning and Data Science Survey. There's much to be learned and discussed in these reports, and a couple of takeaways caught my attention.

"AI is increasingly being applied to mission critical infrastructure like national electric grids and automated supermarket warehousing calculations during pandemics. However, there are questions about whether the maturity of the industry has caught up with the enormity of its growing deployment."

There's no denying that Machine Learning-powered applications are reaching into every corner of IT. But what does that mean for companies and organizations? How do we build rock-solid Machine Learning workflows? Should we all hire 100 Data Scientists ? Or 100 DevOps engineers?

"Transformers have emerged as a general purpose architecture for ML. Not just for Natural Language Processing, but also Speech, Computer Vision or even protein structure prediction."

Old timers have learned the hard way that there is no silver bullet in IT. Yet, the Transformer architecture is indeed very efficient on a wide variety of Machine Learning tasks. But how can we all keep up with the frantic pace of innovation in Machine Learning? Do we really need expert skills to leverage these state of the art models? Or is there a shorter path to creating business value in less time?

Well, here's what I think.
The Partnership: Amazon SageMaker and Hugging Face
Today, we announce a strategic partnership between Hugging Face and Amazon to make it easier for companies to leverage State of the Art Machine Learning models, and ship cutting-edge NLP features faster.

Through this partnership, Hugging Face is leveraging Amazon Web Services as its Preferred Cloud Provider to deliver services to its customers.

As a first step to enable our common customers, Hugging Face and Amazon are introducing new Hugging Face Deep Learning Containers (DLCs) to make it easier than ever to train Hugging Face Transformer models in Amazon SageMaker.

To learn how to access and use the new Hugging Face DLCs with the Amazon SageMaker Python SDK, check out the guides and resources below.

On July 8th, 2021 we extended the Amazon SageMaker integration to add easy deployment and inference of Transformers models. If you want to learn how you can deploy Hugging Face models easily with Amazon SageMaker take a look at the new blog post and the documentation.
he N Implementation Details of RLHF with PPO
RLHF / ChatGPT has been a popular research topic these days. In our quest to research more on RLHF, this blog post attempts to do a reproduction of OpenAI’s 2019 original RLHF codebase at openai/lm-human-preferences. Despite its “tensorflow-1.x-ness,” OpenAI’s original codebase is very well-evaluated and benchmarked, making it a good place to study RLHF implementation engineering details.

We aim to:

reproduce OAI’s results in stylistic tasks and match the learning curves of openai/lm-human-preferences.
present a checklist of implementation details, similar to the spirit of The 37 Implementation Details of Proximal Policy Optimization; Debugging RL, Without the Agonizing Pain.
provide a simple-to-read and minimal reference implementation of RLHF;
Probabilistic Time Series Forecasting with 🤗 Transformers
<script async defer src="https://unpkg.com/medium-zoom-element@0/dist/medium-zoom-element.min.js"></script> Open In Colab
Introduction
Time series forecasting is an essential scientific and business problem and as such has also seen a lot of innovation recently with the use of deep learning based models in addition to the classical methods. An important difference between classical methods like ARIMA and novel deep learning methods is the following.
Google Cloud TPUs made available to Hugging Face users
Google Cloud TPUs made available to Hugging Face users

We're excited to share some great news! AI builders are now able to accelerate their applications with Google Cloud TPUs on Hugging Face Inference Endpoints and Spaces!

For those who might not be familiar, TPUs are custom-made AI hardware designed by Google. They are known for their ability to scale cost-effectively and deliver impressive performance across various AI workloads. This hardware has played a crucial role in some of Google's latest innovations, including the development of the Gemma 2 open models. We are excited to announce that TPUs will now be available for use in Inference Endpoints and Spaces.

This is a big step in our ongoing collaboration to provide you with the best tools and resources for your AI projects. We're really looking forward to seeing what amazing things you'll create with this new capability!
Train your first Decision Transformer
In a previous post, we announced the launch of Decision Transformers in the transformers library. This new technique of using a Transformer as a Decision-making model is getting increasingly popular.

So today, you’ll learn to train your first Offline Decision Transformer model from scratch to make a half-cheetah run. We'll train it directly on a Google Colab that you can find here 👉 https://github.com/huggingface/blog/blob/main/notebooks/101_train-decision-transformers.ipynb

*An "expert" Decision Transformers model, learned using offline RL in the Gym HalfCheetah environment.*
Sounds exciting? Let's get started!

What are Decision Transformers?
Training Decision Transformers
Loading the dataset and building the Custom Data Collator
Training the Decision Transformer model with a 🤗 transformers Trainer
Conclusion
What’s next?
References blog/notebooks/101_train-decision-transformers.ipynb at main · huggingface/blog
Easily Train Models with H100 GPUs on NVIDIA DGX Cloud
Today, we are thrilled to announce the launch of Train on DGX Cloud, a new service on the Hugging Face Hub, available to Enterprise Hub organizations. Train on DGX Cloud makes it easy to use open models with the accelerated compute infrastructure of NVIDIA DGX Cloud. Together, we built Train on DGX Cloud so that Enterprise Hub users can easily access the latest NVIDIA H100 Tensor Core GPUs, to fine-tune popular Generative AI models like Llama, Mistral, and Stable Diffusion, in just a few clicks within the Hugging Face Hub.

Thumbnail
Optimizing Stable Diffusion for Intel CPUs with NNCF and 🤗 Optimum
Latent Diffusion models are game changers when it comes to solving text-to-image generation problems. Stable Diffusion is one of the most famous examples that got wide adoption in the community and industry. The idea behind the Stable Diffusion model is simple and compelling: you generate an image from a noise vector in multiple small steps refining the noise to a latent image representation. This approach works very well, but it can take a long time to generate an image if you do not have access to powerful GPUs.

Through the past five years, OpenVINO Toolkit encapsulated many features for high-performance inference. Initially designed for Computer Vision models, it still dominates in this domain showing best-in-class inference performance for many contemporary models, including Stable Diffusion. However, optimizing Stable Diffusion models for resource-constraint applications requires going far beyond just runtime optimizations. And this is where model optimization capabilities from OpenVINO Neural Network Compression Framework (NNCF) come into play.

In this blog post, we will outline the problems of optimizing Stable Diffusion models and propose a workflow that substantially reduces the latency of such models when running on a resource-constrained HW such as CPU. In particular, we achieved 5.1x inference acceleration and 4x model footprint reduction compared to PyTorch.
Training and Finetuning Embedding Models with Sentence Transformers v3
Sentence Transformers is a Python library for using and training embedding models for a wide range of applications, such as retrieval augmented generation, semantic search, semantic textual similarity, paraphrase mining, and more. Its v3.0 update is the largest since the project's inception, introducing a new training approach. In this blogpost, I'll show you how to use it to finetune Sentence Transformer models to improve their performance on specific tasks. You can also use this method to train new Sentence Transformer models from scratch.

Finetuning Sentence Transformers now involves several components, including datasets, loss functions, training arguments, evaluators, and the new trainer itself. I'll go through each of these components in detail and provide examples of how to use them to train effective models.
rain your ControlNet with diffusers 🧨
Introduction
ControlNet is a neural network structure that allows fine-grained control of diffusion models by adding extra conditions. The technique debuted with the paper Adding Conditional Control to Text-to-Image Diffusion Models, and quickly took over the open-source diffusion community author's release of 8 different conditions to control Stable Diffusion v1-5, including pose estimations, depth maps, canny edges, sketches, and more.
Finetune Stable Diffusion Models with DDPO via TRL
Introduction
Diffusion models (e.g., DALL-E 2, Stable Diffusion) are a class of generative models that are widely successful at generating images most notably of the photorealistic kind. However, the images generated by these models may not always be on par with human preference or human intention. Thus arises the alignment problem i.e. how does one go about making sure that the outputs of a model are aligned with human preferences like “quality” or that outputs are aligned with intent that is hard to express via prompts? This is where Reinforcement Learning comes into the picture.

In the world of Large Language Models (LLMs), Reinforcement learning (RL) has proven to become a very effective tool for aligning said models to human preferences. It’s one of the main recipes behind the superior performance of systems like ChatGPT. More precisely, RL is the critical ingredient of Reinforcement Learning from Human Feedback (RLHF), which makes ChatGPT chat like human beings.
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU
We are excited to officially release the integration of trl with peft to make Large Language Model (LLM) fine-tuning with Reinforcement Learning more accessible to anyone! In this post, we explain why this is a competitive alternative to existing fine-tuning approaches.

Note peft is a general tool that can be applied to many ML use-cases but it’s particularly interesting for RLHF as this method is especially memory-hungry!

If you want to directly deep dive into the code, check out the example scripts directly on the documentation page of TRL.
How to Install and Use the Hugging Face Unity API
The Hugging Face Unity API is an easy-to-use integration of the Hugging Face Inference API, allowing developers to access and use Hugging Face AI models in their Unity projects. In this blog post, we'll walk through the steps to install and use the Hugging Face Unity API.

Installation
Open your Unity project
Go to Window -> Package Manager
Click + and select Add Package from git URL
Enter https://github.com/huggingface/unity-api.git
Once installed, the Unity API wizard should pop up. If not, go to Window -> Hugging Face API Wizard


Enter your API key. Your API key can be created in your Hugging Face account settings.
Test the API key by clicking Test API key in the API Wizard.
Optionally, change the model endpoints to change which model to use. The model endpoint for any model that supports the inference API can be found by going to the model on the Hugging Face website, clicking Deploy -> Inference API, and copying the url from the API_URL field.
Configure advanced settings if desired. For up-to-date information, visit the project repository at https://github.com/huggingface/unity-api
To see examples of how to use the API, click Install Examples. You can now close the API Wizard. GitHub - huggingface/unity-api
AI Speech Recognition in Unity
Open Source AI Game Jam

Introduction
This tutorial guides you through the process of implementing state-of-the-art Speech Recognition in your Unity game using the Hugging Face Unity API. This feature can be used for giving commands, speaking to an NPC, improving accessibility, or any other functionality where converting spoken words to text may be useful.

To try Speech Recognition in Unity for yourself, check out the live demo in itch.io.

Prerequisites
This tutorial assumes basic knowledge of Unity. It also requires you to have installed the Hugging Face Unity API. For instructions on setting up the API, check out our earlier blog post.