HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
We Raised $100 Million for Open & Collaborative Machine Learning 🚀
Today we have some exciting news to share! Hugging Face has raised $100 Million in Series C funding 🔥🔥🔥 led by Lux Capital with major participations from Sequoia, Coatue and support of existing investors Addition, a_capital, SV Angel, Betaworks, AIX Ventures, Kevin Durant, Rich Kleiman from Thirty Five Ventures, Olivier Pomel (co-founder & CEO at Datadog) and more.

Series C

We've come a long way since we first open sourced PyTorch BERT in 2018 and are just getting started! 🙌

Machine learning is becoming the default way to build technology. When you think about your average day, machine learning is everywhere: from your Zoom background, to searching on Google, to ordering an Uber or writing an email with auto-complete --it's all machine learning.

Hugging Face is now the fastest growing community & most used platform for machine learning! With 100,000 pre-trained models & 10,000 datasets hosted on the platform for NLP,
SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit


SetFitABSA is an efficient technique to detect the sentiment towards specific aspects within the text.

Aspect-Based Sentiment Analysis (ABSA) is the task of detecting the sentiment towards specific aspects within the text. For example, in the sentence, "This phone has a great screen, but its battery is too small", the aspect terms are "screen" and "battery" and the sentiment polarities towards them are Positive and Negative, respectively.

ABSA is widely used by organizations for extracting valuable insights by analyzing customer feedback towards aspects of products or services in various domains. However, labeling training data for ABSA is a tedious task because of the fine-grained nature (token level) of manually identifying aspects within the training samples.

Intel Labs and Hugging Face are excited to introduce SetFitABSA, a framework for few-shot training of domain-specific ABSA models;
SetFit: Efficient Few-Shot Learning Without Prompts


SetFit is significantly more sample efficient and robust to noise than standard fine-tuning.

Few-shot learning with pretrained language models has emerged as a promising solution to every data scientist's nightmare: dealing with data that has few to no labels 😱.

Together with our research partners at Intel Labs and the UKP Lab, Hugging Face is excited to introduce SetFit: an efficient framework for few-shot fine-tuning of Sentence Transformers. SetFit achieves high accuracy with little labeled data - for example, with only 8 labeled examples per class on the Customer Reviews (CR) sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples 🤯!

Compared to other few-shot learning methods, SetFit has several unique features:
SmolLM - blazingly fast and remarkably powerful
TL;DR
This blog post introduces SmolLM, a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. It covers data curation, model evaluation, and usage.

Introduction
There is increasing interest in small language models that can operate on local devices. This trend involves techniques such as distillation or quantization to compress large models, as well as training small models from scratch on large datasets. These approaches enable novel applications while dramatically reducing inference costs and improving user privacy.

Microsoft's Phi series, Alibaba's Qwen2 (less than 2B), and Meta's MobileLLM demonstrate that small models can achieve impressive results when designed and trained thoughtfully. However, most of the details about the data curation and training of these models are not publicly available.

In this blog post, we're excited to introduce SmolLM, a series of state-of-the-art small language models available in three sizes: 135M, 360M, and 1.7B parameters. These models are built on a meticulously curated high-quality training corpus, which we are releasing as SmolLM-Corpus. Smollm Corpus includes:

Cosmopedia v2: A collection of synthetic textbooks and stories generated by Mixtral (28B tokens)
Python-Edu: educational Python samples from The Stack (4B tokens)
FineWeb-Edu (deduplicated): educational web samples from FineWeb (220B tokens)
Our evaluations demonstrate that SmolLM models outperform other models in their size categories across a diverse set of benchmarks, testing common sense reasoning and world knowledge. In this blog post, we will go over the curation of each subset in the training corpus and then discuss the training and evaluation of SmolLM models.
Snorkel AI x Hugging Face: unlock foundation models for enterprises
This article is a cross-post from an originally published post on April 6, 2023 in Snorkel's blog, by Friea Berg .

As OpenAI releases GPT-4 and Google debuts Bard in beta, enterprises around the world are excited to leverage the power of foundation models. As that excitement builds, so does the realization that most companies and organizations are not equipped to properly take advantage of foundation models.

Foundation models pose a unique set of challenges for enterprises. Their larger-than-ever size makes them difficult and expensive for companies to host themselves, and using off-the-shelf FMs for production use cases could mean poor performance or substantial governance and compliance risks.

Snorkel AI bridges the gap between foundation models and practical enterprise use cases and has yielded impressive results for AI innovators like Pixability. We’re teaming with Hugging Face, best known for its enormous repository of ready-to-use open-source models, to provide enterprises with even more flexibility and choice as they develop AI applications.
Introducing Snowball Fight ☃️, our First ML-Agents Environment
We're excited to share our first custom Deep Reinforcement Learning environment: Snowball Fight 1vs1 🎉. gif

Snowball Fight is a game made with Unity ML-Agents, where you shoot snowballs against a Deep Reinforcement Learning agent. The game is hosted on Hugging Face Spaces.

👉 You can play it online here
Banque des Territoires (CDC Group) x Polyconseil x Hugging Face: Enhancing a Major French Environmental Program with a Sovereign Data Solution
Table of contents
Case Study in English - Banque des Territoires (CDC Group) x Polyconseil x Hugging Face: Enhancing a Major French Environmental Program with a Sovereign Data Solution
Executive summary
The power of RAG to meet environmental objectives
Industrializing while ensuring performance and sovereignty
A modular solution to respond to a dynamic sector
Key Success Factors Success Factors
Case Study in French - Banque des Territoires (Groupe CDC) x Polyconseil x Hugging Face : améliorer un programme environnemental français majeur grâce à une solution data souveraine
Résumé
La puissance du RAG au service d'objectifs environnementaux
Industrialiser en garantissant performance et souveraineté
Une solution modulaire pour répondre au dynamisme du secteur
Facteurs clés de succèshttps://github.com/huggingface/blog/blob/main/sovereign-data-solution-case-study.md
Space secrets leak disclosure
Earlier this week our team detected unauthorized access to our Spaces platform, specifically related to Spaces secrets. As a consequence, we have suspicions that a subset of Spaces’ secrets could have been accessed without authorization.

As a first step of remediation, we have revoked a number of HF tokens present in those secrets. Users whose tokens have been revoked already received an email notice. We recommend you refresh any key or token and consider switching your HF tokens to fine-grained access tokens which are the new default.

We are working with outside cyber security forensic specialists, to investigate the issue as well as review our security policies and procedures.

Over the past few days, we have made other significant improvements to the security of the Spaces infrastructure, including completely removing org tokens (resulting in increased traceability and audit capabilities), implementing key management service (KMS) for Spaces secrets, robustifying and expanding our system’s ability to identify leaked tokens and proactively invalidate them, and more generally improving our security across the board. We also plan on completely deprecating “classic” read and write tokens in the near future, as soon as fine-grained access tokens reach feature parity. We will continue to investigate any possible related incident.

Finally, we have also reported this incident to law enforcement agencies and Data protection authorities.

We deeply regret the disruption this incident may have caused and understand the inconvenience it may have posed to you. We pledge to use this as an opportunity to strengthen the security of our entire infrastructure. For any question, please contact us at [email protected].
Introducing Spaces Dev Mode for a seamless developer experience
Hugging Face Spaces makes it easy for you to create and deploy AI-powered demos in minutes. Over 500,000 Spaces have been created by the Hugging Face community and it keeps growing! As part of Hugging Face Spaces, we recently released support for “Dev Mode”, to make your experience of building Spaces even more seamless.

Spaces Dev Mode lets you connect with VS Code or SSH directly to your Space. In a click, you can connect to your Space, and start editing your code, removing the need to push your local changes to the Space repository using git. Let's see how to setup this feature in your Space’s settings 🔥
Visualize proteins on Hugging Face Spaces
In this post we will look at how we can visualize proteins on Hugging Face Spaces.

Update May 2024

While the method described below still works, you'll likely want to save some time and use the Molecule3D Gradio Custom Component. This component will allow users to modify the protein visualization on the fly and you can more easily set the default visualization. Simply install it using:https://github.com/huggingface/blog/blob/main/spaces_3dmoljs.md blog/spaces_3dmoljs.md at main · huggingface/blog
Welcome spaCy to the Hugging Face Hub
spaCy is a popular library for advanced Natural Language Processing used widely across industry. spaCy makes it easy to use and train pipelines for tasks like named entity recognition, text classification, part of speech tagging and more, and lets you build powerful applications to process and analyze large volumes of text.

Hugging Face makes it really easy to share your spaCy pipelines with the community! With a single command, you can upload any pipeline package, with a pretty model card and all required metadata auto-generated for you. The inference API currently supports NER out-of-the-box, and you can try out your pipeline interactively in your browser. You'll also get a live URL for your package that you can pip install from anywhere for a smooth path from prototype all the way to production!

Finding models
Over 60 canonical models can be found in the spaCy org. These models are from the latest 3.1 release, so you can try the latest realesed models right now! On top of this, you can find all spaCy models from the community here https://huggingface.co/models?filter=spacy. Models - Hugging Face
Speech Synthesis, Recognition, and More With SpeechT5
We’re happy to announce that SpeechT5 is now available in 🤗 Transformers, an open-source library that offers easy-to-use implementations of state-of-the-art machine learning models.

SpeechT5 was originally described in the paper SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing by Microsoft Research Asia. The official checkpoints published by the paper’s authors are available on the Hugging Face Hub.

If you want to jump right in, here are some demos on Spaces:

Speech Synthesis (TTS)
Voice Conversion
Automatic Speech Recognition
Accelerating Stable Diffusion Inference on Intel CPUs
Recently, we introduced the latest generation of Intel Xeon CPUs (code name Sapphire Rapids), its new hardware features for deep learning acceleration, and how to use them to accelerate distributed fine-tuning and inference for natural language processing Transformers.

In this post, we're going to show you different techniques to accelerate Stable Diffusion models on Sapphire Rapids CPUs. A follow-up post will do the same for distributed fine-tuning.

At the time of writing, the simplest way to get your hands on a Sapphire Rapids server is to use the Amazon EC2 R7iz instance family. As it's still in preview, you have to sign up to get access. Like in previous posts, I'm using an r7iz.metal-16xl instance (64 vCPU, 512GB RAM) with an Ubuntu 20.04 AMI (ami-07cd3e6c4915b2d18).

Let's get started! Code samples are available on Gitlab.
https://github.com/huggingface/blog/blob/main/stable-diffusion-inference-intel.md blog/stable-diffusion-inference-intel.md at main · huggingface/blog
Stable Diffusion XL on Mac with Advanced Core ML Quantization
Stable Diffusion XL was released yesterday and it’s awesome. It can generate large (1024x1024) high quality images; adherence to prompts has been improved with some new tricks; it can effortlessly produce very dark or very bright images thanks to the latest research on noise schedulers; and it’s open source!

The downside is that the model is much bigger, and therefore slower and more difficult to run on consumer hardware. Using the latest release of the Hugging Face diffusers library, you can run Stable Diffusion XL on CUDA hardware in 16 GB of GPU RAM, making it possible to use it on Colab’s free tier.

The past few months have shown that people are very clearly interested in running ML models locally for a variety of reasons, including privacy, convenience, easier experimentation, or unmetered use. We’ve been working hard at both Apple and Hugging Face to explore this space. We’ve shown how to run Stable Diffusion on Apple Silicon, or how to leverage the latest advancements in Core ML to improve size and performance with 6-bit palettization.

For Stable Diffusion XL we’ve done a few things:

Ported the base model to Core ML so you can use it in your native Swift apps.
Updated Apple’s conversion and inference repo so you can convert the models yourself, including any fine-tunes you’re interested in.
Updated Hugging Face’s demo app to show how to use the new Core ML Stable Diffusion XL models downloaded from the Hub.
Explored mixed-bit palettization, an advanced compression technique that achieves important size reductions while minimizing and controlling the quality loss you incur. You can apply the same technique to your own models too!
Everything is open source and available today, let’s get on with it.https://github.com/huggingface/blog/blob/main/stable-diffusion-xl-coreml.md blog/stable-diffusion-xl-coreml.md at main · huggingface/blog
Stable Diffusion with 🧨 Diffusers
Open In Colab
Stable Diffusion 🎨 ...using 🧨 Diffusers

Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database. LAION-5B is the largest, freely accessible multi-modal dataset that currently exists.

In this post, we want to show how to use Stable Diffusion with the 🧨 Diffusers library, explain how the model works and finally dive a bit deeper into how diffusers allows one to customize the image generation pipeline.

Note: It is highly recommended to have a basic understanding of how diffusion models work. If diffusion models are completely new to you, we recommend reading one of the following blog posts:

The Annotated Diffusion Model
Getting started with 🧨 Diffusers
Now, let's get started by generating some images 🎨.#Stable Diffusion
Stable Diffusion in JAX / Flax !
Open In Colab
🤗 Hugging Face Diffusers supports Flax since version 0.5.1! This allows for super fast inference on Google TPUs, such as those available in Colab, Kaggle or Google Cloud Platform.

This post shows how to run inference using JAX / Flax. If you want more details about how Stable Diffusion works or want to run it in GPU, please refer to this Colab notebook.

If you want to follow along, click the button above to open this post as a Colab notebook.

First, make sure you are using a TPU backend. If you are running this notebook in Colab, select Runtime in the menu above, then select the option "Change runtime type" and then select TPU under the Hardware accelerator setting.

Note that JAX is not exclusive to TPUs, but it shines on that hardware because each TPU server has 8 TPU accelerators working in parallel.#Stable Diffusion
StackLLaMA: A hands-on guide to train LLaMA with RLHF
Models such as ChatGPT, GPT-4, and Claude are powerful language models that have been fine-tuned using a method called Reinforcement Learning from Human Feedback (RLHF) to be better aligned with how we expect them to behave and would like to use them.

In this blog post, we show all the steps involved in training a LlaMa model to answer questions on Stack Exchange with RLHF through a combination of:

Supervised Fine-tuning (SFT)
Reward / preference modeling (RM)
Reinforcement Learning from Human Feedback (RLHF)
#stackllama
Creating a Coding Assistant with StarCoder
If you’re a software developer, chances are that you’ve used GitHub Copilot or ChatGPT to solve programming tasks such as translating code from one language to another or generating a full implementation from a natural language query like “Write a Python program to find the Nth Fibonacci number”. Although impressive in their capabilities, these proprietary systems typically come with several drawbacks, including a lack of transparency on the public data used to train them and the inability to adapt them to your domain or codebase.

Fortunately, there are now several high-quality open-source alternatives! These include SalesForce’s CodeGen Mono 16B for Python, or Replit’s 3B parameter model trained on 20 programming languages. https://github.com/huggingface/blog/blob/main/starchat-alpha.md blog/starchat-alpha.md at main · huggingface/blog
StarCoder: A State-of-the-Art LLM for Code
Introducing StarCoder
StarCoder and StarCoderBase are Large Language Models for Code (Code LLMs) trained on permissively licensed data from GitHub, including from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. Similar to LLaMA, we trained a ~15B parameter model for 1 trillion tokens. We fine-tuned StarCoderBase model for 35B Python tokens, resulting in a new model that we call StarCoder.

We found that StarCoderBase outperforms existing open Code LLMs on popular programming benchmarks and matches or surpasses closed models such as code-cushman-001 from OpenAI (the original Codex model that powered early versions of GitHub Copilot). With a context length of over 8,000 tokens, the StarCoder models can process more input than any other open LLM, enabling a wide range of interesting applications. For example, by prompting the StarCoder models with a series of dialogues, we enabled them to act as a technical assistant. In addition, the models can be used to autocomplete code, make modifications to code via instructions, and explain a code snippet in natural language. We take several important steps towards a safe open model release, including an improved PII redaction pipeline, a novel attribution tracing tool, and make StarCoder publicly available under an improved version of the OpenRAIL license. The updated license simplifies the process for companies to integrate the model into their products. We believe that with its strong performance, the StarCoder models will serve as a solid foundation for the community to use and adapt it to their use-cases and products.
StarCoder2 and The Stack v2
StarCoder2
BigCode is releasing StarCoder2, the next generation of transparently trained open code LLMs. All StarCoder2 variants were trained on The Stack v2, a new large and high-quality code dataset. We release all models, datasets, and the processing as well as the training code. Check out the paper for details.

What is StarCoder2?
StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective. https://github.com/huggingface/blog/blob/main/starcoder2.md #StarCoder2 blog/starcoder2.md at main · huggingface/blog