Share and discover more about AI with social posts from the community.huggingface/OpenAi
Optimizing your LLM in production
Open In Colab
Note: This blog post is also available as a documentation page on Transformers.

Large Language Models (LLMs) such as GPT3/4, Falcon, and LLama are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries. Deploying these models in real-world tasks remains challenging, however:

To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see Kaplan et al, Wei et. al). This consequently amplifies the memory demands for inference.
In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.
The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.https://github.com/huggingface/blog/blob/main/optimize-llm.md blog/optimize-llm.md at main · huggingface/blog
Optimizing a Text-To-Speech model using 🤗 Transformers

🤗 Transformers provides many of the latest state-of-the-art (SoTA) models across domains and tasks. To get the best performance from these models, they need to be optimized for inference speed and memory usage.

The 🤗 Hugging Face ecosystem offers precisely such ready & easy to use optimization tools that can be applied across the board to all the models in the library. This makes it easy to reduce memory footprint and improve inference with just a few extra lines of code.

In this hands-on tutorial, I'll demonstrate how you can optimize Bark, a Text-To-Speech (TTS) model supported by 🤗 Transformers, based on three simple optimizations. These optimizations rely solely on the Transformers, Optimum and Accelerate libraries from the 🤗 ecosystem.

This tutorial is also a demonstration of how one can benchmark a non-optimized model and its varying optimizations.

For a more streamlined version of the tutorial with fewer explanations but all the code, see the accompanying Google Colab.

This blog post is organized as follows:https://github.com/huggingface/blog/blob/main/optimizing-bark.md blog/optimizing-bark.md at main · huggingface/blog
Accelerated Inference with Optimum and Transformers Pipelines
Inference has landed in Optimum with support for Hugging Face Transformers pipelines, including text-generation using ONNX Runtime.

The adoption of BERT and Transformers continues to grow. Transformer-based models are now not only achieving state-of-the-art performance in Natural Language Processing but also for Computer Vision, Speech, and Time-Series. 💬 🖼 🎤

Companies are now moving from the experimentation and research phase to the production phase in order to use Transformer models for large-scale workloads. But by default BERT and its friends are relatively slow, big, and complex models compared to traditional Machine Learning algorithms.

To solve this challenge, we created Optimum – an extension of Hugging Face Transformers to accelerate the training and inference of Transformer models like BERT.https://github.com/huggingface/blog/blob/main/optimum-inference.md blog/optimum-inference.md at main · huggingface/blog
Optimum-NVIDIA on Hugging Face enables blazingly fast LLM inference in just 1 line of code
Large Language Models (LLMs) have revolutionized natural language processing and are increasingly deployed to solve complex problems at scale. Achieving optimal performance with these models is notoriously challenging due to their unique and intense computational demands. Optimized performance of LLMs is incredibly valuable for end users looking for a snappy and responsive experience, as well as for scaled deployments where improved throughput translates to dollars saved.

That's where the Optimum-NVIDIA inference library comes in. Available on Hugging Face, Optimum-NVIDIA dramatically accelerates LLM inference on the NVIDIA platform through an extremely simple API. By changing just a single line of code, you can unlock up to 28x faster inference and 1,200 tokens/second on the NVIDIA platform.

Optimum-NVIDIA is the first Hugging Face inference library to benefit from the new float8 format supported on the NVIDIA Ada Lovelace and Hopper architectures. FP8, in addition to the advanced compilation capabilities of NVIDIA TensorRT-LLM software software, dramatically accelerates LLM inference.https://github.com/huggingface/blog/blob/main/optimum-nvidia.md blog/optimum-nvidia.md at main · huggingface/blog
Optimum + ONNX Runtime: Easier, Faster training for your Hugging Face models
Introduction
Transformer based models in language, vision and speech are getting larger to support complex multi-modal use cases for the end customer. Increasing model sizes directly impact the resources needed to train these models and scale them as the size increases. Hugging Face and Microsoft’s ONNX Runtime teams are working together to build advancements in finetuning large Language, Speech and Vision models. Hugging Face’s Optimum library, through its integration with ONNX Runtime for training, provides an open solution to improve training times by 35% or more for many popular Hugging Face models. We present details of both Hugging Face Optimum and the ONNX Runtime Training ecosystem, with performance numbers highlighting the benefits of using the Optimum library.https://github.com/huggingface/blog/blob/main/optimum-onnxruntime-training.md blog/optimum-onnxruntime-training.md at main · huggingface/blog
Accelerating over 130,000 Hugging Face models with ONNX Runtime
What is ONNX Runtime?
ONNX Runtime is a cross-platform machine learning tool that can be used to accelerate a wide variety of models, particularly those with ONNX support.https://github.com/huggingface/blog/blob/main/ort-accelerating-hf-models.md blog/ort-accelerating-hf-models.md at main · huggingface/blog
Open-Source Text Generation & LLM Ecosystem at Hugging Face
[Updated on July 24, 2023: Added Llama 2.]

Text generation and conversational technologies have been around for ages. Earlier challenges in working with these technologies were controlling both the coherence and diversity of the text through inference parameters and discriminative biases. More coherent outputs were less creative and closer to the original training data and sounded less human. Recent developments overcame these challenges, and user-friendly UIs enabled everyone to try these models out. Services like ChatGPT have recently put the spotlight on powerful models like GPT-4 and caused an explosion of open-source alternatives like Llama to go mainstream. We think these technologies will be around for a long time and become more and more integrated into everyday products.

This post is divided into the following sections:

Brief background on text generation
Licensing
Tools in the Hugging Face Ecosystem for LLM Serving
Parameter Efficient Fine Tuning (PEFT)https://github.com/huggingface/blog/blob/main/os-llms.md blog/os-llms.md at main · huggingface/blog
Overview of natively supported quantization schemes in 🤗 Transformers
We aim to give a clear overview of the pros and cons of each quantization scheme supported in transformers to help you decide which one you should go for.

Currently, quantizing models are used for two main purposes:

Running inference of a large model on a smaller device
Fine-tune adapters on top of quantized models
So far, two integration efforts have been made and are natively supported in transformers : bitsandbytes and auto-gptq. Note that some additional quantization schemes are also supported in the 🤗 optimum library, but this is out of scope for this blogpost.

To learn more about each of the supported schemes, please have a look at one of the resources shared below. Please also have a look at the appropriate sections of the documentation.

Note also that the details shared below are only valid for PyTorch models, this is currently out of scope for Tensorflow and Flax/JAX models.
Creating Privacy Preserving AI with Substra
With the recent rise of generative techniques, machine learning is at an incredibly exciting point in its history. The models powering this rise require even more data to produce impactful results, and thus it’s becoming increasingly important to explore new methods of ethically gathering data while ensuring that data privacy and security remain a top priority.

In many domains that deal with sensitive information, such as healthcare, there often isn’t enough high quality data accessible to train these data-hungry models. Datasets are siloed in different academic centers and medical institutions and are difficult to share openly due to privacy concerns about patient and proprietary information. Regulations that protect patient data such as HIPAA are essential to safeguard individuals’ private health information, but they can limit the progress of machine learning research as data scientists can’t access the volume of data required to effectively train their models. Technologies that work alongside existing regulations by proactively protecting patient data will be crucial to unlocking these silos and accelerating the pace of machine learning research and deployment in these domains.

This is where Federated Learning comes in. Check out the space we’ve created with Substra to learn more!https://github.com/huggingface/blog/blob/main/owkin-substra.md blog/owkin-substra.md at main · huggingface/blog
Welcome PaddlePaddle to the Hugging Face Hub
We are happy to share an open source collaboration between Hugging Face and PaddlePaddle on a shared mission to advance and democratize AI through open source!

First open sourced by Baidu in 2016, PaddlePaddle enables developers of all skill levels to adopt and implement Deep Learning at scale. As of Q4 2022, PaddlePaddle is being used by more than 5.35 million developers and 200,000 enterprises, ranking first in terms of market share among Deep Learning platforms in China. PaddlePaddle features popular open source repositories such as the Paddle Deep Learning Framework, model libraries across different modalities (e.g. PaddleOCR, PaddleDetection, PaddleNLP, PaddleSpeech), PaddleSlim for model compression, FastDeploy for model deployment and many more.
PaliGemma – Google's Cutting-Edge Open Vision Language Model
Updated on 23-05-2024: We have introduced a few changes to the transformers PaliGemma implementation around fine-tuning, which you can find in this notebook.

PaliGemma is a new family of vision language models from Google. PaliGemma can take in an image and a text and output text.

The team at Google has released three types of models: the pretrained (pt) models, the mix models, and the fine-tuned (ft) models, each with different resolutions and available in multiple precisions for convenience.

All models are released in the Hugging Face Hub model repositories with their model cards and licenses and have transformers integration.https://github.com/huggingface/blog/blob/main/paligemma.md blog/paligemma.md at main · huggingface/blog
Panel on Hugging Face
We are thrilled to announce the collaboration between Panel and Hugging Face! 🎉 We have integrated a Panel template in Hugging Face Spaces to help you get started building Panel apps and deploy them on Hugging Face effortlessly.



What does Panel offer?
Panel is an open-source Python library that lets you easily build powerful tools, dashboards and complex applications entirely in Python. It has a batteries-included philosophy, putting the PyData ecosystem, powerful data tables and much more at your fingertips. High-level reactive APIs and lower-level callback based APIs ensure you can quickly build exploratory applications, but you aren’t limited if you build complex, multi-page apps with rich interactivity. Panel is a member of the HoloViz ecosystem, your gateway into a connected ecosystem of data exploration tools. Panel, like the other HoloViz tools, is a NumFocus-sponsored project, with support from Anaconda and Blackstone.https://github.com/huggingface/blog/blob/main/panel-on-hugging-face.md blog/panel-on-hugging-face.md at main · huggingface/blog
PatchTSMixer in HuggingFace - Getting Started
<script async defer src="https://unpkg.com/medium-zoom-element@0/dist/medium-zoom-element.min.js"></script> Open In Colab


PatchTSMixer is a lightweight time-series modeling approach based on the MLP-Mixer architecture. It is proposed in TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting by IBM Research authors Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.

For effective mindshare and to promote open-sourcing - IBM Research joins hands with the HuggingFace team to release this model in the Transformers library.

In the Hugging Face implementation, we provide PatchTSMixer’s capabilities to effortlessly facilitate lightweight mixing across patches, channels, and hidden features for effective multivariate time-series modeling. It also supports various attention mechanisms starting from simple gated attention to more complex self-attention blocks that can be customized accordingly.
Patch Time Series Transformer in Hugging Face - Getting Started
<script async defer src="https://unpkg.com/medium-zoom-element@0/dist/medium-zoom-element.min.js"></script> Open In Colab
In this blog, we provide examples of how to get started with PatchTST. We first demonstrate the forecasting capability of PatchTST on the Electricity data. We will then demonstrate the transfer learning capability of PatchTST by using the previously trained model to do zero-shot forecasting on the electrical transformer (ETTh1) dataset. The zero-shot forecasting performance will denote the test performance of the model in the target domain, without any training on the target domain. Subsequently, we will do linear probing and (then) finetuning of the pretrained model on the train part of the target data, and will validate the forecasting performance on the test part of the target data.

The PatchTST model was proposed in A Time Series is Worth 64 Words: Long-term Forecasting with Transformers by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam and presented at ICLR 2023.
PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware
Motivation
Large Language Models (LLMs) based on the transformer architecture, like GPT, T5, and BERT have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. They have also started foraying into other domains, such as Computer Vision (CV) (VIT, Stable Diffusion, LayoutLM) and Audio (Whisper, XLS-R). The conventional paradigm is large-scale pretraining on generic web-scale data, followed by fine-tuning to downstream tasks. Fine-tuning these pretrained LLMs on downstream datasets results in huge performance gains when compared to using the pretrained LLMs out-of-the-box (zero-shot inference, for example).

However, as models get larger and larger, full fine-tuning becomes infeasible to train on consumer hardware. In addition, storing and deploying fine-tuned models independently for each downstream task becomes very expensive, because fine-tuned models are the same size as the original pretrained model. Parameter-Efficient Fine-tuning (PEFT) approaches are meant to address both problems!
PEFT welcomes new merging methods
Model merging has quickly become the de-facto standard of pushing the performance limits of large language models. On the Open LLM Leaderboard, we continue to notice merged models topping up the charts. Our very own Omar Sanseviero, made a little sprint on model merging and discovered interesting findings.

The typical way of model merging, so far, has been to take a set of models and merge them. This post gives a nice primer on this topic. Generally, for merging multiple models, we first download their checkpoints and then perform merging. Depending on the merge algorithm and the sizes of the underlying model, this process can be quite memory-intensive. The mergekit library provides optimized ways for handling this, making the process manageable on limited memory.https://github.com/huggingface/blog/blob/main/peft_merging.md blog/peft_merging.md at main · huggingface/blog
Perceiver IO: a scalable, fully-attentional model that works on any modality
TLDR
We've added Perceiver IO to Transformers, the first Transformer-based neural network that works on all kinds of modalities (text, images, audio, video, point clouds,...) and combinations thereof. Take a look at the following Spaces to view some examples:

predicting optical flow between images
classifying images.
We also provide several notebooks.

Below, you can find a technical explanation of the model.
Personal Copilot: Train Your Own Coding Assistant
In the ever-evolving landscape of programming and software development, the quest for efficiency and productivity has led to remarkable innovations. One such innovation is the emergence of code generation models such as Codex, StarCoder and Code Llama. These models have demonstrated remarkable capabilities in generating human-like code snippets, thereby showing immense potential as coding assistants.

However, while these pre-trained models can perform impressively across a range of tasks, there's an exciting possibility lying just beyond the horizon: the ability to tailor a code generation model to your specific needs. Think of personalized coding assistants which could be leveraged at an enterprise scale.

In this blog post we show how we created HugCoder 🤗, a code LLM fine-tuned on the code contents from the public repositories of the huggingface GitHub organization. We will discuss our data collection workflow,
Let’s begin 🚀
A Chatbot on your Laptop: Phi-2 on Intel Meteor Lake
Because of their impressive abilities, large language models (LLMs) require significant computing power, which is seldom available on personal computers. Consequently, we have no choice but to deploy them on powerful bespoke AI servers hosted on-premises or in the cloud.

Why local LLM inference is desirable
What if we could run state-of-the-art open-source LLMs on a typical personal computer? Wouldn't we enjoy benefits like:

Increased privacy: our data would not be sent to an external API for inference.
Lower latency: we would save network round trips.
Offline work: we could work without network connectivity (a frequent flyer's dream!).
Lower cost: we wouldn't spend any money on API calls or model hosting.
Customizability: each user could find the models that best fit the tasks they work on daily, and they could even fine-tune them or use local Retrieval-Augmented Generation (RAG) to increase relevance.
Building a Playlist Generator with Sentence Transformers
<script async defer src="https://unpkg.com/medium-zoom-element@0/dist/medium-zoom-element.min.js"></script>
A short while ago I published a playlist generator that I’d built using Sentence Transformers and Gradio, and I followed that up with a reflection on how I try to use my projects as effective learning experiences. But how did I actually build the playlist generator? In this post we’ll break down that project and look at two technical details: how the embeddings were generated, and how the multi-step Gradio demo was built.

<iframe src="https://nimaboscarino-playlist-generator.hf.space" frameBorder="0" width="1400" height="690" title="Gradio app" class="p-0 flex-grow space-iframe" allow="accelerometer; ambient-light-sensor; autoplay; battery; camera; document-domain; encrypted-media; fullscreen; geolocation; gyroscope; layout-animations; legacy-image-formats; magnetometer; microphone; midi; oversized-images; payment; picture-in-picture; publickey-credentials-get; sync-xhr; usb; vr ; wake-lock; xr-spatial-tracking" sandbox="allow-forms allow-modals allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-downloads"></iframe>
As we’ve explored in previous posts on the Hugging Face blog, Sentence Transformers (ST) is a library that gives us tools to generate sentence embeddings, which have a variety of uses. Since I had access to a dataset of song lyrics, I decided to leverage ST’s semantic search functionality to generate playlists from a given text prompt. Specifically, the goal was to create an embedding from the prompt, use that embedding for a semantic search across a set of pre-generated lyrics embeddings to generate a relevant set of songs. This would all be wrapped up in a Gradio app using the new Blocks API, hosted on Hugging Face Spaces.

We’ll be looking at a slightly advanced use of Gradio, so if you’re a beginner to the library I recommend reading the Introduction to Blocks before tackling the Gradio-specific parts of this post. Also, note that while I won’t be releasing the lyrics dataset, the lyrics embeddings are available on the Hugging Face Hub for you to play around with. Let’s jump in! 🪂