HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
StackLLaMA: A hands-on guide to train LLaMA with RLHF
Models such as ChatGPT, GPT-4, and Claude are powerful language models that have been fine-tuned using a method called Reinforcement Learning from Human Feedback (RLHF) to be better aligned with how we expect them to behave and would like to use them.

In this blog post, we show all the steps involved in training a LlaMa model to answer questions on Stack Exchange with RLHF through a combination of:

Supervised Fine-tuning (SFT)
Reward / preference modeling (RM)
Reinforcement Learning from Human Feedback (RLHF)
#stackllama
Creating a Coding Assistant with StarCoder
If you’re a software developer, chances are that you’ve used GitHub Copilot or ChatGPT to solve programming tasks such as translating code from one language to another or generating a full implementation from a natural language query like “Write a Python program to find the Nth Fibonacci number”. Although impressive in their capabilities, these proprietary systems typically come with several drawbacks, including a lack of transparency on the public data used to train them and the inability to adapt them to your domain or codebase.

Fortunately, there are now several high-quality open-source alternatives! These include SalesForce’s CodeGen Mono 16B for Python, or Replit’s 3B parameter model trained on 20 programming languages. https://github.com/huggingface/blog/blob/main/starchat-alpha.md blog/starchat-alpha.md at main · huggingface/blog
StarCoder: A State-of-the-Art LLM for Code
Introducing StarCoder
StarCoder and StarCoderBase are Large Language Models for Code (Code LLMs) trained on permissively licensed data from GitHub, including from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. Similar to LLaMA, we trained a ~15B parameter model for 1 trillion tokens. We fine-tuned StarCoderBase model for 35B Python tokens, resulting in a new model that we call StarCoder.

We found that StarCoderBase outperforms existing open Code LLMs on popular programming benchmarks and matches or surpasses closed models such as code-cushman-001 from OpenAI (the original Codex model that powered early versions of GitHub Copilot). With a context length of over 8,000 tokens, the StarCoder models can process more input than any other open LLM, enabling a wide range of interesting applications. For example, by prompting the StarCoder models with a series of dialogues, we enabled them to act as a technical assistant. In addition, the models can be used to autocomplete code, make modifications to code via instructions, and explain a code snippet in natural language. We take several important steps towards a safe open model release, including an improved PII redaction pipeline, a novel attribution tracing tool, and make StarCoder publicly available under an improved version of the OpenRAIL license. The updated license simplifies the process for companies to integrate the model into their products. We believe that with its strong performance, the StarCoder models will serve as a solid foundation for the community to use and adapt it to their use-cases and products.
StarCoder2 and The Stack v2
StarCoder2
BigCode is releasing StarCoder2, the next generation of transparently trained open code LLMs. All StarCoder2 variants were trained on The Stack v2, a new large and high-quality code dataset. We release all models, datasets, and the processing as well as the training code. Check out the paper for details.

What is StarCoder2?
StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective. https://github.com/huggingface/blog/blob/main/starcoder2.md #StarCoder2 blog/starcoder2.md at main · huggingface/blog
Hosting your Models and Datasets on Hugging Face Spaces using Streamlit
Showcase your Datasets and Models using Streamlit on Hugging Face Spaces
Streamlit allows you to visualize datasets and build demos of Machine Learning models in a neat way. In this blog post we will walk you through hosting models and datasets and serving your Streamlit applications in Hugging Face Spaces.#streamlit-spaces
Summer At Hugging Face 😎
Summer is now officially over and these last few months have been quite busy at Hugging Face. From new features in the Hub to research and Open Source development, our team has been working hard to empower the community through open and collaborative technology.

In this blog post you'll catch up on everything that happened at Hugging Face in June, July and August!
Supercharged Customer Service with Machine Learning
Open In Colab
In this blog post, we will simulate a real-world customer service use case and use tools machine learning tools of the Hugging Face ecosystem to address it.

We strongly recommend using this notebook as a template/example to solve your real-world use case.

Defining Task, Dataset & Model
Before jumping into the actual coding part, it's important to have a clear definition of the use case that you would like to automate or partly automate. A clear definition of the use case helps identify the most suitable task, dataset to use, and model to apply for your use case.https://github.com/huggingface/blog/blob/main/supercharge-customer-service-with-machine-learning.md blog/supercharge-customer-service-with-machine-learning.md at main · huggingface/blog
Releasing Swift Transformers: Run On-Device LLMs in Apple Devices
I have a lot of respect for iOS/Mac developers. I started writing apps for iPhones in 2007, when not even APIs or documentation existed. The new devices adopted some unfamiliar decisions in the constraint space, with a combination of power, screen real estate, UI idioms, network access, persistence, and latency that was different to what we were used to before. Yet, this community soon managed to create top-notch applications that felt at home with the new paradigm.

I believe that ML is a new way to build software, and I know that many Swift developers want to incorporate AI features in their apps. The ML ecosystem has matured a lot, with thousands of models that solve a wide variety of problems. Moreover, LLMs have recently emerged as almost general-purpose tools – they can be adapted to new domains as long as we can model our task to work on text or text-like data. We are witnessing a defining moment in computing history, where LLMs are going out of research labs and becoming computing tools for everybody.

However, using an LLM model such as Llama in an app involves several tasks which many people face and solve alone. We have been exploring this space and would love to continue working on it with the community. We aim to create a set of tools and building blocks that help developers build faster.

Today, we are publishing this guide to go through the steps required to run a model such as Llama 2 on your Mac using Core ML. We are also releasing alpha libraries and tools to support developers in the journey. We are calling all Swift developers interested in ML – is that all Swift developers? – to contribute with PRs, bug reports, or opinions to improve this together.

Let's go! https://github.com/huggingface/blog/blob/main/swift-coreml-llm.md blog/swift-coreml-llm.md at main · huggingface/blog
Synthetic data: save money, time and carbon with open source
tl;dr
Should you fine-tune your own model or use an LLM API? Creating your own model puts you in full control but requires expertise in data collection, training, and deployment. LLM APIs are much easier to use but force you to send your data to a third party and create costly dependencies on LLM providers. This blog post shows how you can combine the convenience of LLMs with the control and efficiency of customized models.

we show how to use an open-source LLM to create synthetic data to train your customized model in a few steps. Our resulting custom RoBERTa model can analyze a large news corpus for around $2.7 compared to $3061 with GPT4; emits around 0.12 kg CO2 compared to very roughly 735 to 1100 kg CO2 with GPT4; with a latency of 0.13 seconds compared to often multiple seconds with GPT4; while performing on par with GPT4 at identifying investor sentiment (both 94% accuracy and 0.94 F1 macro).
Efficient Controllable Generation for SDXL with T2I-Adapters


T2I-Adapter is an efficient plug-and-play model that provides extra guidance to pre-trained text-to-image models while freezing the original large text-to-image models. T2I-Adapter aligns internal knowledge in T2I models with external control signals. We can train various adapters according to different conditions and achieve rich control and editing effects.

As a contemporaneous work, ControlNet has a similar function and is widely used. However, it can be computationally expensive to run. This is because, during each denoising step of the reverse diffusion process, both the ControlNet and UNet need to be run. In addition, ControlNet emphasizes the importance of copying the UNet encoder as a control model, resulting in a larger parameter number. Thus, the generation is bottlenecked by the size of the ControlNet (the larger, the slower the process becomes).

#SDXL
Efficient Table Pre-training without Real Data: An Introduction to TAPEX
In recent years, language model pre-training has achieved great success via leveraging large-scale textual data. By employing pre-training tasks such as masked language modeling, these models have demonstrated surprising performance on several downstream tasks. However, the dramatic gap between the pre-training task (e.g., language modeling) and the downstream task (e.g., table question answering) makes existing pre-training not efficient enough. In practice, we often need an extremely large amount of pre-training data to obtain promising improvement, even for domain-adaptive pretraining. How might we design a pre-training task to close the gap, and thus accelerate pre-training?
Hugging Face's TensorFlow Philosophy
Introduction
Despite increasing competition from PyTorch and JAX, TensorFlow remains the most-used deep learning framework. It also differs from those other two libraries in some very important ways. In particular, it’s quite tightly integrated with its high-level API Keras, and its data loading library tf.data.

There is a tendency among PyTorch engineers (picture me staring darkly across the open-plan office here) to see this as a problem to be overcome; their goal is to figure out how to make TensorFlow get out of their way so they can use the low-level training and data-loading code they’re used to. This is entirely the wrong way to approach TensorFlow! Keras is a great high-level API. If you push it out of the way in any project bigger than a couple of modules you’ll end up reproducing most of its functionality yourself when you realize you need it.

As refined, respected and highly attractive TensorFlow engineers, we want to use the incredible power and flexibility of cutting-edge models, but we want to handle them with the tools and API we’re familiar with. This blogpost will be about the choices we make at Hugging Face to enable that, and what to expect from the framework as a TensorFlow programmer.
Hugging Face Text Generation Inference available for AWS Inferentia2
We are excited to announce the general availability of Hugging Face Text Generation Inference (TGI) on AWS Inferentia2 and Amazon SageMaker.

Text Generation Inference (TGI), is a purpose-built solution for deploying and serving Large Language Models (LLMs) for production workloads at scale. TGI enables high-performance text generation using Tensor Parallelism and continuous batching for the most popular open LLMs, including Llama, Mistral, and more. Text Generation Inference is used in production by companies such as Grammarly, Uber, Deutsche Telekom, and many more.

The integration of TGI into Amazon SageMaker, in combination with AWS Inferentia2, presents a powerful solution and viable alternative to GPUs for building production LLM applications. The seamless integration ensures easy deployment and maintenance of models, making LLMs more accessible and scalable for a wide range of production use cases.

With the new TGI for AWS Inferentia2 on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly-concurrent, low-latency LLM experiences like HuggingChat, OpenAssistant, and Serverless Endpoints for LLMs on the Hugging Face Hub.
Deploying TensorFlow Vision Models in Hugging Face with TF Serving
Open In Colab
In the past few months, the Hugging Face team and external contributors added a variety of vision models in TensorFlow to Transformers. This list is growing comprehensively and already includes state-of-the-art pre-trained models like Vision Transformer, Masked Autoencoders, RegNet, ConvNeXt, and many others!

When it comes to deploying TensorFlow models, you have got a variety of options. Depending on your use case, you may want to expose your model as an endpoint or package it in an application itself. TensorFlow provides tools that cater to each of these different scenarios.

In this post, you'll see how to deploy a Vision Transformer (ViT) model (for image classification) locally using TensorFlow Serving (TF Serving). This will allow developers to expose the model either as a REST or gRPC endpoint. Moreover, TF Serving supports many deployment-specific features off-the-shelf such as model warmup, server-side batching, etc.

To get the complete working code shown throughout this post, refer to the Colab Notebook shown at the beginning.
Text-to-Video: The Task, Challenges and the Current State

Video samples generated with ModelScope.

Text-to-video is next in line in the long list of incredible advances in generative models. As self-descriptive as it is, text-to-video is a fairly new computer vision task that involves generating a sequence of images from text descriptions that are both temporally and spatially consistent. While this task might seem extremely similar to text-to-image, how do they differ from text-to-image models, and what kind of performance can we expect from them?

We will start by reviewing the differences between the text-to-video and text-to-image tasks, , we will cover the most recent developments in text-to-video models, exploring how these methods work and what they are capable of. Finally, we will talk about what we are working on at Hugging Face to facilitate the integration and use of these models and share some cool demos and resources both on and outside of the Hugging Face Hub. #Text-to-Video
Making a web app generator with open ML models
As more code generation models become publicly available, it is now possible to do text-to-web and even text-to-app in ways that we couldn't imagine before.

This tutorial presents a direct approach to AI web content generation by streaming and rendering the content all in one go.

Try the live demo here! → Webapp Factory
Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator
With the Generative AI (GenAI) revolution in full swing, text-generation with open-source transformer models like Llama 2 has become the talk of the town. AI enthusiasts as well as developers are looking to leverage the generative abilities of such models for their own use cases and applications. This article shows how easy it is to generate text with the Llama 2 family of models (7b, 13b and 70b) using Optimum Habana and a custom pipeline class – you'll be able to run the models with just a few lines of code!

This custom pipeline class has been designed to offer great flexibility and ease of use. Moreover, it provides a high level of abstraction and performs end-to-end text-generation which involves pre-processing and post-processing. There are multiple ways to use the pipeline - you can run the run_pipeline.py script from the Optimum Habana repository, add the pipeline class to your own python scripts, or initialize LangChain classes with it.
aster TensorFlow models in Hugging Face Transformers
Open In Colab
In the last few months, the Hugging Face team has been working hard on improving Transformers’ TensorFlow models to make them more robust and faster. The recent improvements are mainly focused on two aspects:

Computational performance: BERT, RoBERTa, ELECTRA and MPNet have been improved in order to have a much faster computation time. This gain of computational performance is noticeable for all the computational aspects: graph/eager mode, TF Serving and for CPU/GPU/TPU devices.
TensorFlow Serving: each of these TensorFlow model can be deployed with TensorFlow Serving to benefit of this gain of computational performance for inference.
Faster Text Generation with TensorFlow and XLA
TL;DR: Text Generation on 🤗 transformers using TensorFlow can now be compiled with XLA. It is up to 100x faster than before, and even faster than PyTorch -- check the colab below! Open In Colab

Text Generation
As the quality of large language models increased, so did our expectations of what those models could do. Especially since the release of OpenAI's GPT-2, models with text generation capabilities have been in the spotlight. And for legitimate reasons -- these models can be used to summarize, translate, and they even have demonstrated zero-shot learning capabilities on some language tasks. This blog post will show how to take the most of this technology with TensorFlow.

The 🤗 transformers library started with NLP models, so it is natural that text generation is of utmost importance to us. It is part of Hugging Face democratization efforts to ensure it is accessible, easily controllable, and efficient. There is a previous blog post about the different types of text generation. Nevertheless, below there's a quick recap of the core functionality -- feel free to skip it if you're familiar with our generate function and want to jump straight into TensorFlow's specificities.

Let's start with the basics. Text generation can be deterministic or stochastic, depending on the do_sample flag. By default it's set to False, causing the output to be deterministic, which is also known as Greedy Decoding. When it's set to True, also known as Sampling, the output will be stochastic, but you can still obtain reproducible results through the seed argument (with the same format as in stateless TensorFlow random number generation). As a rule of thumb, you want deterministic generation if you wish to obtain factual information from the model and stochastic generation if you're aiming at more creative outputs.
Training a language model with 🤗 Transformers using TensorFlow and TPUs
Introduction
TPU training is a useful skill to have: TPU pods are high-performance and extremely scalable, making it easy to train models at any scale from a few tens of millions of parameters up to truly enormous sizes: Google’s PaLM model (over 500 billion parameters!) was trained entirely on TPU pods.

We’ve previously written a tutorial and a Colab example showing small-scale TPU training with TensorFlow and introducing the core concepts you need to understand to get your model working on TPU. This time, we’re going to step that up another level and train a masked language model from scratch using TensorFlow and TPU, including every step from training your tokenizer and preparing your dataset through to the final model training and uploading. This is the kind of task that you’ll probably want a dedicated TPU node (or VM) for, rather than just Colab, and so that’s where we’ll focus.

As in our Colab example, we’re taking advantage of TensorFlow's very clean TPU support via XLA and TPUStrategy. We’ll also be benefiting from the fact that the majority of the TensorFlow models in 🤗 Transformers are fully XLA-compatible. So surprisingly, little work is needed to get them to run on TPU.

Unlike our Colab example, however, this example is designed to be scalable and much closer to a realistic training run -- although we only use a BERT-sized model by default, the code could be expanded to a much larger model and a much more powerful TPU pod slice by changing a few configuration options.