Accelerating Stable Diffusion Inference on Intel CPUs
Recently, we introduced the latest generation of Intel Xeon CPUs (code name Sapphire Rapids), its new hardware features for deep learning acceleration, and how to use them to accelerate distributed fine-tuning and inference for natural language processing Transformers.

In this post, we're going to show you different techniques to accelerate Stable Diffusion models on Sapphire Rapids CPUs. A follow-up post will do the same for distributed fine-tuning.

At the time of writing, the simplest way to get your hands on a Sapphire Rapids server is to use the Amazon EC2 R7iz instance family. As it's still in preview, you have to sign up to get access. Like in previous posts, I'm using an r7iz.metal-16xl instance (64 vCPU, 512GB RAM) with an Ubuntu 20.04 AMI (ami-07cd3e6c4915b2d18).

Let's get started! Code samples are available on Gitlab.
Stable Diffusion XL on Mac with Advanced Core ML Quantization
Stable Diffusion XL was released yesterday and it’s awesome. It can generate large (1024x1024) high quality images; adherence to prompts has been improved with some new tricks; it can effortlessly produce very dark or very bright images thanks to the latest research on noise schedulers; and it’s open source!

The downside is that the model is much bigger, and therefore slower and more difficult to run on consumer hardware. Using the latest release of the Hugging Face diffusers library, you can run Stable Diffusion XL on CUDA hardware in 16 GB of GPU RAM, making it possible to use it on Colab’s free tier.

The past few months have shown that people are very clearly interested in running ML models locally for a variety of reasons, including privacy, convenience, easier experimentation, or unmetered use. We’ve been working hard at both Apple and Hugging Face to explore this space. We’ve shown how to run Stable Diffusion on Apple Silicon, or how to leverage the latest advancements in Core ML to improve size and performance with 6-bit palettization.

For Stable Diffusion XL we’ve done a few things:

Ported the base model to Core ML so you can use it in your native Swift apps.
Updated Apple’s conversion and inference repo so you can convert the models yourself, including any fine-tunes you’re interested in.
Updated Hugging Face’s demo app to show how to use the new Core ML Stable Diffusion XL models downloaded from the Hub.
Explored mixed-bit palettization, an advanced compression technique that achieves important size reductions while minimizing and controlling the quality loss you incur. You can apply the same technique to your own models too!
Stable Diffusion with 🧨 Diffusers
Stable Diffusion 🎨 ...using 🧨 Diffusers

Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database. LAION-5B is the largest, freely accessible multi-modal dataset that currently exists.

In this post, we want to show how to use Stable Diffusion with the 🧨 Diffusers library, explain how the model works and finally dive a bit deeper into how diffusers allows one to customize the image generation pipeline.

Note: It is highly recommended to have a basic understanding of how diffusion models work. If diffusion models are completely new to you, we recommend reading one of the following blog posts:

The Annotated Diffusion Model
Getting started with 🧨 Diffusers
Now, let's get started by generating some images 🎨.#Stable Diffusion
Stable Diffusion in JAX / Flax !
🤗 Hugging Face Diffusers supports Flax since version 0.5.1! This allows for super fast inference on Google TPUs, such as those available in Colab, Kaggle or Google Cloud Platform.

This post shows how to run inference using JAX / Flax. If you want more details about how Stable Diffusion works or want to run it in GPU, please refer to this Colab notebook.

If you want to follow along, click the button above to open this post as a Colab notebook.

First, make sure you are using a TPU backend. If you are running this notebook in Colab, select Runtime in the menu above, then select the option "Change runtime type" and then select TPU under the Hardware accelerator setting.

Note that JAX is not exclusive to TPUs, but it shines on that hardware because each TPU server has 8 TPU accelerators working in parallel.#Stable Diffusion
StackLLaMA: A hands-on guide to train LLaMA with RLHF
Models such as ChatGPT, GPT-4, and Claude are powerful language models that have been fine-tuned using a method called Reinforcement Learning from Human Feedback (RLHF) to be better aligned with how we expect them to behave and would like to use them.

In this blog post, we show all the steps involved in training a LlaMa model to answer questions on Stack Exchange with RLHF through a combination of:

Supervised Fine-tuning (SFT)
Reward / preference modeling (RM)
Reinforcement Learning from Human Feedback (RLHF)
Creating a Coding Assistant with StarCoder
If you’re a software developer, chances are that you’ve used GitHub Copilot or ChatGPT to solve programming tasks such as translating code from one language to another or generating a full implementation from a natural language query like “Write a Python program to find the Nth Fibonacci number”. Although impressive in their capabilities, these proprietary systems typically come with several drawbacks, including a lack of transparency on the public data used to train them and the inability to adapt them to your domain or codebase.

StarCoder: A State-of-the-Art LLM for Code
Introducing StarCoder
StarCoder and StarCoderBase are Large Language Models for Code (Code LLMs) trained on permissively licensed data from GitHub, including from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. Similar to LLaMA, we trained a ~15B parameter model for 1 trillion tokens. We fine-tuned StarCoderBase model for 35B Python tokens, resulting in a new model that we call StarCoder.

We found that StarCoderBase outperforms existing open Code LLMs on popular programming benchmarks and matches or surpasses closed models such as code-cushman-001 from OpenAI (the original Codex model that powered early versions of GitHub Copilot). With a context length of over 8,000 tokens, the StarCoder models can process more input than any other open LLM, enabling a wide range of interesting applications. For example, by prompting the StarCoder models with a series of dialogues, we enabled them to act as a technical assistant. In addition, the models can be used to autocomplete code, make modifications to code via instructions, and explain a code snippet in natural language. We take several important steps towards a safe open model release, including an improved PII redaction pipeline, a novel attribution tracing tool, and make StarCoder publicly available under an improved version of the OpenRAIL license. The updated license simplifies the process for companies to integrate the model into their products. We believe that with its strong performance, the StarCoder models will serve as a solid foundation for the community to use and adapt it to their use-cases and products.
StarCoder2 and The Stack v2
BigCode is releasing StarCoder2, the next generation of transparently trained open code LLMs. All StarCoder2 variants were trained on The Stack v2, a new large and high-quality code dataset. We release all models, datasets, and the processing as well as the training code. Check out the paper for details.

What is StarCoder2?
Hosting your Models and Datasets on Hugging Face Spaces using Streamlit
Showcase your Datasets and Models using Streamlit on Hugging Face Spaces
Streamlit allows you to visualize datasets and build demos of Machine Learning models in a neat way. In this blog post we will walk you through hosting models and datasets and serving your Streamlit applications in Hugging Face Spaces.#streamlit-spaces
Summer At Hugging Face 😎
Summer is now officially over and these last few months have been quite busy at Hugging Face. From new features in the Hub to research and Open Source development, our team has been working hard to empower the community through open and collaborative technology.

In this blog post you'll catch up on everything that happened at Hugging Face in June, July and August!
Supercharged Customer Service with Machine Learning
In this blog post, we will simulate a real-world customer service use case and use tools machine learning tools of the Hugging Face ecosystem to address it.

We strongly recommend using this notebook as a template/example to solve your real-world use case.

Defining Task, Dataset & Model
Releasing Swift Transformers: Run On-Device LLMs in Apple Devices
I have a lot of respect for iOS/Mac developers. I started writing apps for iPhones in 2007, when not even APIs or documentation existed. The new devices adopted some unfamiliar decisions in the constraint space, with a combination of power, screen real estate, UI idioms, network access, persistence, and latency that was different to what we were used to before. Yet, this community soon managed to create top-notch applications that felt at home with the new paradigm.

I believe that ML is a new way to build software, and I know that many Swift developers want to incorporate AI features in their apps. The ML ecosystem has matured a lot, with thousands of models that solve a wide variety of problems. Moreover, LLMs have recently emerged as almost general-purpose tools – they can be adapted to new domains as long as we can model our task to work on text or text-like data. We are witnessing a defining moment in computing history, where LLMs are going out of research labs and becoming computing tools for everybody.

However, using an LLM model such as Llama in an app involves several tasks which many people face and solve alone. We have been exploring this space and would love to continue working on it with the community. We aim to create a set of tools and building blocks that help developers build faster.

Today, we are publishing this guide to go through the steps required to run a model such as Llama 2 on your Mac using Core ML. We are also releasing alpha libraries and tools to support developers in the journey. We are calling all Swift developers interested in ML – is that all Swift developers? – to contribute with PRs, bug reports, or opinions to improve this together.

Synthetic data: save money, time and carbon with open source
Should you fine-tune your own model or use an LLM API? Creating your own model puts you in full control but requires expertise in data collection, training, and deployment. LLM APIs are much easier to use but force you to send your data to a third party and create costly dependencies on LLM providers. This blog post shows how you can combine the convenience of LLMs with the control and efficiency of customized models.

we show how to use an open-source LLM to create synthetic data to train your customized model in a few steps. Our resulting custom RoBERTa model can analyze a large news corpus for around $2.7 compared to $3061 with GPT4; emits around 0.12 kg CO2 compared to very roughly 735 to 1100 kg CO2 with GPT4; with a latency of 0.13 seconds compared to often multiple seconds with GPT4; while performing on par with GPT4 at identifying investor sentiment (both 94% accuracy and 0.94 F1 macro).
Efficient Controllable Generation for SDXL with T2I-Adapters

T2I-Adapter is an efficient plug-and-play model that provides extra guidance to pre-trained text-to-image models while freezing the original large text-to-image models. T2I-Adapter aligns internal knowledge in T2I models with external control signals. We can train various adapters according to different conditions and achieve rich control and editing effects.

As a contemporaneous work, ControlNet has a similar function and is widely used. However, it can be computationally expensive to run. This is because, during each denoising step of the reverse diffusion process, both the ControlNet and UNet need to be run. In addition, ControlNet emphasizes the importance of copying the UNet encoder as a control model, resulting in a larger parameter number. Thus, the generation is bottlenecked by the size of the ControlNet (the larger, the slower the process becomes).

Efficient Table Pre-training without Real Data: An Introduction to TAPEX
In recent years, language model pre-training has achieved great success via leveraging large-scale textual data. By employing pre-training tasks such as masked language modeling, these models have demonstrated surprising performance on several downstream tasks. However, the dramatic gap between the pre-training task (e.g., language modeling) and the downstream task (e.g., table question answering) makes existing pre-training not efficient enough. In practice, we often need an extremely large amount of pre-training data to obtain promising improvement, even for domain-adaptive pretraining. How might we design a pre-training task to close the gap, and thus accelerate pre-training?
Hugging Face's TensorFlow Philosophy
Despite increasing competition from PyTorch and JAX, TensorFlow remains the most-used deep learning framework. It also differs from those other two libraries in some very important ways. In particular, it’s quite tightly integrated with its high-level API Keras, and its data loading library tf.data.

There is a tendency among PyTorch engineers (picture me staring darkly across the open-plan office here) to see this as a problem to be overcome; their goal is to figure out how to make TensorFlow get out of their way so they can use the low-level training and data-loading code they’re used to. This is entirely the wrong way to approach TensorFlow! Keras is a great high-level API. If you push it out of the way in any project bigger than a couple of modules you’ll end up reproducing most of its functionality yourself when you realize you need it.

As refined, respected and highly attractive TensorFlow engineers, we want to use the incredible power and flexibility of cutting-edge models, but we want to handle them with the tools and API we’re familiar with. This blogpost will be about the choices we make at Hugging Face to enable that, and what to expect from the framework as a TensorFlow programmer.
Hugging Face Text Generation Inference available for AWS Inferentia2
We are excited to announce the general availability of Hugging Face Text Generation Inference (TGI) on AWS Inferentia2 and Amazon SageMaker.

Text Generation Inference (TGI), is a purpose-built solution for deploying and serving Large Language Models (LLMs) for production workloads at scale. TGI enables high-performance text generation using Tensor Parallelism and continuous batching for the most popular open LLMs, including Llama, Mistral, and more. Text Generation Inference is used in production by companies such as Grammarly, Uber, Deutsche Telekom, and many more.

The integration of TGI into Amazon SageMaker, in combination with AWS Inferentia2, presents a powerful solution and viable alternative to GPUs for building production LLM applications. The seamless integration ensures easy deployment and maintenance of models, making LLMs more accessible and scalable for a wide range of production use cases.

With the new TGI for AWS Inferentia2 on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly-concurrent, low-latency LLM experiences like HuggingChat, OpenAssistant, and Serverless Endpoints for LLMs on the Hugging Face Hub.
Deploying TensorFlow Vision Models in Hugging Face with TF Serving
In the past few months, the Hugging Face team and external contributors added a variety of vision models in TensorFlow to Transformers. This list is growing comprehensively and already includes state-of-the-art pre-trained models like Vision Transformer, Masked Autoencoders, RegNet, ConvNeXt, and many others!

When it comes to deploying TensorFlow models, you have got a variety of options. Depending on your use case, you may want to expose your model as an endpoint or package it in an application itself. TensorFlow provides tools that cater to each of these different scenarios.

In this post, you'll see how to deploy a Vision Transformer (ViT) model (for image classification) locally using TensorFlow Serving (TF Serving). This will allow developers to expose the model either as a REST or gRPC endpoint. Moreover, TF Serving supports many deployment-specific features off-the-shelf such as model warmup, server-side batching, etc.

To get the complete working code shown throughout this post, refer to the Colab Notebook shown at the beginning.
Text-to-Video: The Task, Challenges and the Current State

Video samples generated with ModelScope.

Text-to-video is next in line in the long list of incredible advances in generative models. As self-descriptive as it is, text-to-video is a fairly new computer vision task that involves generating a sequence of images from text descriptions that are both temporally and spatially consistent. While this task might seem extremely similar to text-to-image, how do they differ from text-to-image models, and what kind of performance can we expect from them?

We will start by reviewing the differences between the text-to-video and text-to-image tasks, , we will cover the most recent developments in text-to-video models, exploring how these methods work and what they are capable of. Finally, we will talk about what we are working on at Hugging Face to facilitate the integration and use of these models and share some cool demos and resources both on and outside of the Hugging Face Hub. #Text-to-Video
Making a web app generator with open ML models
As more code generation models become publicly available, it is now possible to do text-to-web and even text-to-app in ways that we couldn't imagine before.

This tutorial presents a direct approach to AI web content generation by streaming and rendering the content all in one go.

Try the live demo here! → Webapp Factory