HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive
Introduction
SD Turbo and SDXL Turbo are two fast generative text-to-image models capable of generating viable images in as little as one step, a significant improvement over the 30+ steps often required with previous Stable Diffusion models. SD Turbo is a distilled version of Stable Diffusion 2.1, and SDXL Turbo is a distilled version of SDXL 1.0. We’ve previously shown how to accelerate Stable Diffusion inference with ONNX Runtime. Not only does ONNX Runtime provide performance benefits when used with SD Turbo and SDXL Turbo, but it also makes the models accessible in languages other than Python, like C# and Java.
Supercharged Searching on the Hugging Face Hub
Open In Colab
The huggingface_hub library is a lightweight interface that provides a programmatic approach to exploring the hosting endpoints Hugging Face provides: models, datasets, and Spaces.

Up until now, searching on the Hub through this interface was tricky to pull off, and there were many aspects of it a user had to "just know" and get accustomed to.

In this article, we will be looking at a few exciting new features added to huggingface_hub to help lower that bar and provide users with a friendly API to search for the models and datasets they want to use without leaving their Jupyter or Python interfaces.

Before we begin, if you do not have the latest version of the huggingface_hub library on your system, please run the following cell:

!pip install huggingface_hub -U
SegMoE: Segmind Mixture of Diffusion Experts
SegMoE is an exciting framework for creating Mixture-of-Experts Diffusion models from scratch! SegMoE is comprehensively integrated within the Hugging Face ecosystem and comes supported with diffusers 🔥!

Among the features and integrations being released today:

Models on the Hub, with their model cards and licenses (Apache 2.0)
Github Repository to create your own MoE-style models.https://github.com/huggingface/blog/blob/main/segmoe.md blog/segmoe.md at main · huggingface/blog
How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap
👋 Hello, friends! We recently sat down with Swaraj Banerjee and Larry Zhang from Sempre Health, a startup that brings behavior-based, dynamic pricing to Healthcare. They are doing some exciting work with machine learning and are leveraging our Expert Acceleration Program to accelerate their ML roadmap.

An example of our collaboration is their new NLP pipeline to automatically classify and respond inbound messages. Since deploying it to production, they have seen more than 20% of incoming messages get automatically handled by this new system 🤯 having a massive impact on their business scalability and team workflow.

In this short video, Swaraj and Larry walk us through some of their machine learning work and share their experience collaborating with our team via the Expert Acceleration Program. Check it out:

<iframe width="100%" style="aspect-ratio: 16 / 9;"src="https://www.youtube.com/embed/QBOTlNJUtdk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
If you'd like to accelerate your machine learning roadmap with the help of our experts, as Swaraj and Larry did, visit hf.co/support to learn more about our Expert Acceleration Program and request a quote.
Sentence Transformers in the Hugging Face Hub
Over the past few weeks, we've built collaborations with many Open Source frameworks in the machine learning ecosystem. One that gets us particularly excited is Sentence Transformers.

Sentence Transformers is a framework for sentence, paragraph and image embeddings. This allows to derive semantically meaningful embeddings (1) which is useful for applications such as semantic search or multi-lingual zero shot classification. As part of Sentence Transformers v2 release, there are a lot of cool new features:

Sharing your models in the Hub easily.
Widgets and Inference API for sentence embeddings and sentence similarity.
Better sentence-embeddings models available (benchmark and models in the Hub).
Sentiment Analysis on Encrypted Data with Homomorphic Encryption
It is well-known that a sentiment analysis model determines whether a text is positive, negative, or neutral. However, this process typically requires access to unencrypted text, which can pose privacy concerns.

Homomorphic encryption is a type of encryption that allows for computation on encrypted data without needing to decrypt it first. This makes it well-suited for applications where user's personal and potentially sensitive data is at risk (e.g. sentiment analysis of private messages).

This blog post uses the Concrete-ML library, allowing data scientists to use machine learning models in fully homomorphic encryption (FHE) settings without any prior knowledge of cryptography. We provide a practical tutorial on how to use the library to build a sentiment analysis model on encrypted data.

The post covers:

transformers
how to use transformers with XGBoost to perform sentiment analysis
how to do the training
how to use Concrete-ML to turn predictions into predictions over encrypted data
how to deploy to the cloud using a client/server protocol
Last but not least, we’ll finish with a complete demo over Hugging Face Spaces to show this functionality in action.https://github.com/huggingface/blog/blob/main/sentiment-analysis-fhe.md blog/sentiment-analysis-fhe.md at main · huggingface/blog
Getting Started with Sentiment Analysis using Python
<script async defer src="https://unpkg.com/medium-zoom-element@0/dist/medium-zoom-element.min.js"></script>
Sentiment analysis is the automated process of tagging data according to their sentiment, such as positive, negative and neutral. Sentiment analysis allows companies to analyze data at scale, detect insights and automate processes.

In the past, sentiment analysis used to be limited to researchers, machine learning engineers or data scientists with experience in natural language processing. However, the AI community has built awesome tools to democratize access to machine learning in recent years. Nowadays, you can use sentiment analysis with a few lines of code and no machine learning experience at all! 🤯

In this guide, you'll learn everything to get started with sentiment analysis using Python, including:

What is sentiment analysis?
How to use pre-trained sentiment analysis models with Python
How to build your own sentiment analysis model
How to analyze tweets with sentiment analysis
Let's get started! 🚀https://github.com/huggingface/blog/blob/main/sentiment-analysis-python.md blog/sentiment-analysis-python.md at main · huggingface/blog
Getting Started with Sentiment Analysis on Twitter
<script async defer src="https://unpkg.com/medium-zoom-element@0/dist/medium-zoom-element.min.js"></script>
Sentiment analysis is the automatic process of classifying text data according to their polarity, such as positive, negative and neutral. Companies leverage sentiment analysis of tweets to get a sense of how customers are talking about their products and services, get insights to drive business decisions, and identify product issues and potential PR crises early on.

In this guide, we will cover everything you need to learn to get started with sentiment analysis on Twitter. We'll share a step-by-step process to do sentiment analysis, for both, coders and non-coders. If you are a coder, you'll learn how to use the Inference API, a plug & play machine learning API for doing sentiment analysis of tweets at scale in just a few lines of code. If you don't know how to code, don't worry! We'll also cover how to do sentiment analysis with Zapier, a no-code tool that will enable you to gather tweets, analyze them with the Inference API, and finally send the results to Google Sheets ⚡️

Read along or jump to the section that sparks 🌟 your interest:

What is sentiment analysis?
How to do Twitter sentiment analysis with code?
How to do Twitter sentiment analysis without coding?
Buckle up and enjoy the ride! 🤗https://github.com/huggingface/blog/blob/main/sentiment-analysis-twitter.md blog/sentiment-analysis-twitter.md at main · huggingface/blog
We Raised $100 Million for Open & Collaborative Machine Learning 🚀
Today we have some exciting news to share! Hugging Face has raised $100 Million in Series C funding 🔥🔥🔥 led by Lux Capital with major participations from Sequoia, Coatue and support of existing investors Addition, a_capital, SV Angel, Betaworks, AIX Ventures, Kevin Durant, Rich Kleiman from Thirty Five Ventures, Olivier Pomel (co-founder & CEO at Datadog) and more.

Series C

We've come a long way since we first open sourced PyTorch BERT in 2018 and are just getting started! 🙌

Machine learning is becoming the default way to build technology. When you think about your average day, machine learning is everywhere: from your Zoom background, to searching on Google, to ordering an Uber or writing an email with auto-complete --it's all machine learning.

Hugging Face is now the fastest growing community & most used platform for machine learning! With 100,000 pre-trained models & 10,000 datasets hosted on the platform for NLP,
SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit


SetFitABSA is an efficient technique to detect the sentiment towards specific aspects within the text.

Aspect-Based Sentiment Analysis (ABSA) is the task of detecting the sentiment towards specific aspects within the text. For example, in the sentence, "This phone has a great screen, but its battery is too small", the aspect terms are "screen" and "battery" and the sentiment polarities towards them are Positive and Negative, respectively.

ABSA is widely used by organizations for extracting valuable insights by analyzing customer feedback towards aspects of products or services in various domains. However, labeling training data for ABSA is a tedious task because of the fine-grained nature (token level) of manually identifying aspects within the training samples.

Intel Labs and Hugging Face are excited to introduce SetFitABSA, a framework for few-shot training of domain-specific ABSA models;
SetFit: Efficient Few-Shot Learning Without Prompts


SetFit is significantly more sample efficient and robust to noise than standard fine-tuning.

Few-shot learning with pretrained language models has emerged as a promising solution to every data scientist's nightmare: dealing with data that has few to no labels 😱.

Together with our research partners at Intel Labs and the UKP Lab, Hugging Face is excited to introduce SetFit: an efficient framework for few-shot fine-tuning of Sentence Transformers. SetFit achieves high accuracy with little labeled data - for example, with only 8 labeled examples per class on the Customer Reviews (CR) sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples 🤯!

Compared to other few-shot learning methods, SetFit has several unique features:
SmolLM - blazingly fast and remarkably powerful
TL;DR
This blog post introduces SmolLM, a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. It covers data curation, model evaluation, and usage.

Introduction
There is increasing interest in small language models that can operate on local devices. This trend involves techniques such as distillation or quantization to compress large models, as well as training small models from scratch on large datasets. These approaches enable novel applications while dramatically reducing inference costs and improving user privacy.

Microsoft's Phi series, Alibaba's Qwen2 (less than 2B), and Meta's MobileLLM demonstrate that small models can achieve impressive results when designed and trained thoughtfully. However, most of the details about the data curation and training of these models are not publicly available.

In this blog post, we're excited to introduce SmolLM, a series of state-of-the-art small language models available in three sizes: 135M, 360M, and 1.7B parameters. These models are built on a meticulously curated high-quality training corpus, which we are releasing as SmolLM-Corpus. Smollm Corpus includes:

Cosmopedia v2: A collection of synthetic textbooks and stories generated by Mixtral (28B tokens)
Python-Edu: educational Python samples from The Stack (4B tokens)
FineWeb-Edu (deduplicated): educational web samples from FineWeb (220B tokens)
Our evaluations demonstrate that SmolLM models outperform other models in their size categories across a diverse set of benchmarks, testing common sense reasoning and world knowledge. In this blog post, we will go over the curation of each subset in the training corpus and then discuss the training and evaluation of SmolLM models.
Snorkel AI x Hugging Face: unlock foundation models for enterprises
This article is a cross-post from an originally published post on April 6, 2023 in Snorkel's blog, by Friea Berg .

As OpenAI releases GPT-4 and Google debuts Bard in beta, enterprises around the world are excited to leverage the power of foundation models. As that excitement builds, so does the realization that most companies and organizations are not equipped to properly take advantage of foundation models.

Foundation models pose a unique set of challenges for enterprises. Their larger-than-ever size makes them difficult and expensive for companies to host themselves, and using off-the-shelf FMs for production use cases could mean poor performance or substantial governance and compliance risks.

Snorkel AI bridges the gap between foundation models and practical enterprise use cases and has yielded impressive results for AI innovators like Pixability. We’re teaming with Hugging Face, best known for its enormous repository of ready-to-use open-source models, to provide enterprises with even more flexibility and choice as they develop AI applications.
Introducing Snowball Fight ☃️, our First ML-Agents Environment
We're excited to share our first custom Deep Reinforcement Learning environment: Snowball Fight 1vs1 🎉. gif

Snowball Fight is a game made with Unity ML-Agents, where you shoot snowballs against a Deep Reinforcement Learning agent. The game is hosted on Hugging Face Spaces.

👉 You can play it online here
Banque des Territoires (CDC Group) x Polyconseil x Hugging Face: Enhancing a Major French Environmental Program with a Sovereign Data Solution
Table of contents
Case Study in English - Banque des Territoires (CDC Group) x Polyconseil x Hugging Face: Enhancing a Major French Environmental Program with a Sovereign Data Solution
Executive summary
The power of RAG to meet environmental objectives
Industrializing while ensuring performance and sovereignty
A modular solution to respond to a dynamic sector
Key Success Factors Success Factors
Case Study in French - Banque des Territoires (Groupe CDC) x Polyconseil x Hugging Face : améliorer un programme environnemental français majeur grâce à une solution data souveraine
Résumé
La puissance du RAG au service d'objectifs environnementaux
Industrialiser en garantissant performance et souveraineté
Une solution modulaire pour répondre au dynamisme du secteur
Facteurs clés de succèshttps://github.com/huggingface/blog/blob/main/sovereign-data-solution-case-study.md
Space secrets leak disclosure
Earlier this week our team detected unauthorized access to our Spaces platform, specifically related to Spaces secrets. As a consequence, we have suspicions that a subset of Spaces’ secrets could have been accessed without authorization.

As a first step of remediation, we have revoked a number of HF tokens present in those secrets. Users whose tokens have been revoked already received an email notice. We recommend you refresh any key or token and consider switching your HF tokens to fine-grained access tokens which are the new default.

We are working with outside cyber security forensic specialists, to investigate the issue as well as review our security policies and procedures.

Over the past few days, we have made other significant improvements to the security of the Spaces infrastructure, including completely removing org tokens (resulting in increased traceability and audit capabilities), implementing key management service (KMS) for Spaces secrets, robustifying and expanding our system’s ability to identify leaked tokens and proactively invalidate them, and more generally improving our security across the board. We also plan on completely deprecating “classic” read and write tokens in the near future, as soon as fine-grained access tokens reach feature parity. We will continue to investigate any possible related incident.

Finally, we have also reported this incident to law enforcement agencies and Data protection authorities.

We deeply regret the disruption this incident may have caused and understand the inconvenience it may have posed to you. We pledge to use this as an opportunity to strengthen the security of our entire infrastructure. For any question, please contact us at [email protected].
Introducing Spaces Dev Mode for a seamless developer experience
Hugging Face Spaces makes it easy for you to create and deploy AI-powered demos in minutes. Over 500,000 Spaces have been created by the Hugging Face community and it keeps growing! As part of Hugging Face Spaces, we recently released support for “Dev Mode”, to make your experience of building Spaces even more seamless.

Spaces Dev Mode lets you connect with VS Code or SSH directly to your Space. In a click, you can connect to your Space, and start editing your code, removing the need to push your local changes to the Space repository using git. Let's see how to setup this feature in your Space’s settings 🔥
Visualize proteins on Hugging Face Spaces
In this post we will look at how we can visualize proteins on Hugging Face Spaces.

Update May 2024

While the method described below still works, you'll likely want to save some time and use the Molecule3D Gradio Custom Component. This component will allow users to modify the protein visualization on the fly and you can more easily set the default visualization. Simply install it using:https://github.com/huggingface/blog/blob/main/spaces_3dmoljs.md blog/spaces_3dmoljs.md at main · huggingface/blog
Welcome spaCy to the Hugging Face Hub
spaCy is a popular library for advanced Natural Language Processing used widely across industry. spaCy makes it easy to use and train pipelines for tasks like named entity recognition, text classification, part of speech tagging and more, and lets you build powerful applications to process and analyze large volumes of text.

Hugging Face makes it really easy to share your spaCy pipelines with the community! With a single command, you can upload any pipeline package, with a pretty model card and all required metadata auto-generated for you. The inference API currently supports NER out-of-the-box, and you can try out your pipeline interactively in your browser. You'll also get a live URL for your package that you can pip install from anywhere for a smooth path from prototype all the way to production!

Finding models
Over 60 canonical models can be found in the spaCy org. These models are from the latest 3.1 release, so you can try the latest realesed models right now! On top of this, you can find all spaCy models from the community here https://huggingface.co/models?filter=spacy. Models - Hugging Face
Speech Synthesis, Recognition, and More With SpeechT5
We’re happy to announce that SpeechT5 is now available in 🤗 Transformers, an open-source library that offers easy-to-use implementations of state-of-the-art machine learning models.

SpeechT5 was originally described in the paper SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing by Microsoft Research Asia. The official checkpoints published by the paper’s authors are available on the Hugging Face Hub.

If you want to jump right in, here are some demos on Spaces:

Speech Synthesis (TTS)
Voice Conversion
Automatic Speech Recognition