Share and discover more about AI with social posts from the community.huggingface/OpenAi
XLSCOUT Unveils ParaEmbed 2.0: a Powerful Embedding Model Tailored for Patents and IP with Expert Support from Hugging Face
[!NOTE] This is a guest blog post by the XLSCOUT team.

XLSCOUT, a Toronto-based leader in the use of AI in intellectual property (IP), has developed a powerful proprietary embedding model called ParaEmbed 2.0 stemming from an ambitious collaboration with Hugging Face’s Expert Support Program. The collaboration focuses on applying state-of-the-art AI technologies and open-source models to enhance the understanding and analysis of complex patent documents including patent-specific terminology, context, and relationships. This allows XLSCOUT’s products to offer the best performance for drafting patent applications, patent invalidation searches, and ensuring ideas are novel compared to previously available patents and literature.

By fine-tuning on high-quality, multi-domain patent data curated by human experts, ParaEmbed 2.0 boasts a remarkable 23% increase in accuracy compared to its predecessor, ParaEmbed 1.0, which was released in October 2023. With this advancement, ParaEmbed 2.0 is now able to accurately capture context and map patents against prior art, ideas, products, or standards with even greater precision.
What is Sentence Transformers?
Sentence embeddings? Semantic search? Cosine similarity?!?! 😱 Just a few short weeks ago, these terms were so confusing to me that they made my head spin. I’d heard that Sentence Transformers was a powerful and versatile library for working with language and image data and I was eager to play around with it, but I was worried that I would be out of my depth. As it turns out, I couldn’t have been more wrong!

Sentence Transformers is among the libraries that Hugging Face integrates with, where it’s described with the following:

Compute dense vector representations for sentences, paragraphs, and images

In a nutshell, Sentence Transformers answers one question: What if we could treat sentences as points in a multi-dimensional vector space? This means that ST lets you give it an arbitrary string of text (e.g., “I’m so glad I learned to code with Python!”), and it’ll transform it into a vector, such as [0.2, 0.5, 1.3, 0.9]. Another sentence, such as “Python is a great programming language.”, would be transformed into a different vector. These vectors are called “embeddings,” and they play an essential role in Machine Learning. If these two sentences were embedded with the same model, then both would coexist in the same vector space, allowing for many interesting possibilities.

What makes ST particularly useful is that, once you’ve generated some embeddings, you can use the built-in utility functions to compare how similar one sentence is to another, including synonyms! 🤯 One way to do this is with the “Cosine Similarity” function. With ST, you can skip all the pesky math and call the very handy util.cos_sim function to get a score from -1 to 1 that signifies how “similar” the embedded sentences are in the vector space they share – the bigger the score is, the more similar the sentences are!
Liftoff! How to get started with your first ML project 🚀
People who are new to the Machine Learning world often run into two recurring stumbling blocks. The first is choosing the right library to learn, which can be daunting when there are so many to pick from. Even once you’ve settled on a library and gone through some tutorials, the next issue is coming up with your first big project and scoping it properly to maximize your learning. If you’ve run into those problems, and if you're looking for a new ML library to add to your toolkit, you're in the right place!

In this post I’ll take you through some tips for going from 0 to 100 with a new library by using Sentence Transformers (ST) as an example. We'll start by understanding the basics of what ST can do, and highlight some things that make it a great library to learn. Then, I'll share my battle-tested strategy for tackling your first self-driven project. We’ll also talk about how I built my first ST-powered project, and what I learned along the way 🥳
Fit More and Train Faster With ZeRO via DeepSpeed and FairScale
A guest blog post by Hugging Face fellow Stas Bekman

As recent Machine Learning models have been growing much faster than the amount of GPU memory added to newly released cards, many users are unable to train or even just load some of those huge models onto their hardware. While there is an ongoing effort to distill some of those huge models to be of a more manageable size -- that effort isn't producing models small enough soon enough.

In the fall of 2019 Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase and Yuxiong He published a paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, which contains a plethora of ingenious new ideas on how one could make their hardware do much more than what it was thought possible before. A short time later DeepSpeed has been released and it gave to the world the open source implementation of most of the ideas in that paper (a few ideas are still in works) and in parallel a team from Facebook released FairScale which also implemented some of the core ideas from the ZeRO paper.

If you use the Hugging Face Trainer, as of transformers v4.2.0 you have the experimental support for DeepSpeed's and FairScale's ZeRO features. The new --sharded_ddp and --deepspeed command line Trainer arguments provide FairScale and DeepSpeed integration respectively. Here is the full documentation.

This blog post will describe how you can benefit from ZeRO regardless of whether you own just a single GPU or a whole stack of them.
Very Large Language Models and How to Evaluate Them
Large language models can now be evaluated on zero-shot classification tasks with Evaluation on the Hub!

Zero-shot evaluation is a popular way for researchers to measure the performance of large language models, as they have been shown to learn capabilities during training without explicitly being shown labeled examples. The Inverse Scaling Prize is an example of a recent community effort to conduct large-scale zero-shot evaluation across model sizes and families to discover tasks on which larger models may perform worse than their smaller counterparts.

dataset
LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?
While developing Docmatix, we noticed that fine-tuning Florence-2 on it yielded great performance on DocVQA, but resulted in low scores on the benchmark. To enhance performance, we had to fine-tune the model further on DocVQA to learn the syntax required for the benchmark. Interestingly, this additional fine-tuning seemed to perform worse according to human evaluators, which is why we primarily used it for ablation studies and released the model only trained on Docmatix for broader use.

Although the generated answers semantically align with the reference answers, as illustrated in Figure 1, they still receive low scores. This raises these questions: Should we fine-tune the models to improve these metrics, or should we develop new metrics that better align with human perception?
KoMT-Bench, a benchmark designed to evaluate the capability of language models in following instructions in Korean. KoMT-Bench is an in-house dataset created by translating MT-Bench [1] dataset into Korean and modifying some questions to reflect the characteristics and cultural nuances of the Korean language. After the initial translation and modification, we requested expert linguists to conduct a thorough review of our benchmark dataset.

To conduct evaluations on KoMT-Bench, please visit the official KoMT-Bench GitHub repository in which the evaluation scripts are provided.
WikiRAG-TR is a dataset of 6K (5999) question and answer pairs which synthetically created from introduction part of Turkish Wikipedia Articles. The dataset is created to be used for Turkish Retrieval-Augmented Generation (RAG) tasks.

Dataset Information
Number of Instances: 5999 (5725 synthetically generated question-answer pairs, 274 augmented negative samples)
Dataset Size: 20.5 MB
Language: Turkish
Dataset License: apache-2.0
Dataset Category: Text2Text Generation
Dataset Domain: STEM and Social Sciences
WikiRAG-TR Pipeline
The creation of the dataset was accomplished in two main phases, each represented by a separate diagram.
Dataset Card for MedTrinity-25M
MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. This dataset can be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.

Homepage: https://github.com/yunfeixie233/MedTrinity-25M
Paperlink: https://arxiv.org/abs/2408.02900
Github Repo: https://github.com/UCSC-VLAA/MedTrinity-25M GitHub - yunfeixie233/MedTrinity-25M
This process allows the creation of two distinct datasets within Open-Critic-GPT:

Code-Preference-Pairs Dataset: (SFT) Contains pairs of duplicate code examples, with the only difference being one the rejected example has the bugged code 'surgically transplanted in' while the accepted is left the same.
Open-Critic-GPT Dataset: (DPO) Trains the model to find bugs and produce working code from broken code.
Both dataset's spans a total of 127 different language/structures, (some may have been lost in conversion started with 122k ended with 55k, due to lack of structured output, a finetuned model may preform better structured outputs.)
Both datasets contain of ~55K examples each (which both come from the same parent example)
Dataset Card for LLaVA-OneVision
!!! We are still uploading our dataset, stay tuned for final version, or contact [email protected] to get more details.

We provide the whole details of LLaVA-OneVision Dataset. In this dataset, we include the data splits used in the both final image stage and one-vision stage. For more details, please check our paper.

Dataset Sources
Dataset Collection: We include a few subsets from existing dataset collection Cambrian, Cauldron, UReader. Since we only used a few subsets from these datasets, and applied the cleaning and re-annotation process, we uploaded our processed version of these datasets into our own repository and thank the authors for providing the original datasets.
Other Datasets: For rest single source dataset, such as AI2D, OKVQA, we cite and link the original sources in our paper.
https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data lmms-lab/LLaVA-OneVision-Data · Datasets at Hugging Face
Dataset Card for magpie-ultra-v0.1
This dataset has been created with distilabel.
Dataset Summary
magpie-ultra it's a synthetically generated dataset for supervised fine-tuning using the new Llama 3.1 405B-Instruct model, together with other Llama models like Llama-Guard-3-8B and Meta-Llama-3.1-8B-Instruct.

The dataset contains challenging instructions and responses for a wide variety of tasks, such as Coding & debugging, Math, Data analysis, Creative Writing, advice seeking, or Brainstorming.

Explore the dataset in Argilla.
keras-io/wgan-molecular-graphs
Model description
This repo contains the model and the notebook for implementing a generative model for graphs and using it to generate novel molecules WGAN-GP with R-GCN for the generation of small molecular graphs.

Full credits go to Alexander Kensert

Reproduced by Vu Minh Chien

Motivation: The development of new drugs (molecules) can be extremely time-consuming and costly. The use of deep learning models can alleviate the search for good candidate drugs, by predicting the properties of known molecules (e.g., solubility, toxicity, affinity to the target protein, etc.). As the number of possible molecules is astronomical, the space in which we search for/explore molecules is just a fraction of the entire space. Therefore, it's arguably desirable to implement generative models that can learn to generate novel molecules (which would otherwise have never been explored).
Butterfly GAN
Model description
Based on paper: Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis

which states: "Notably, the model converges from scratch with just a few hours of training on a single RTX-2080 GPU, and has a consistent performance, even with less than 100 training samples"

also dubbed the Light-GAN model. This model was trained using the script here which is adapted from the lucidrains repo.

Differently from the script above, I used the transforms from the official repo. Because our training images were already cropped and aligned. official paper implementation repo
https://huggingface.co/ceyda/butterfly_cropped_uniq1K_512 ceyda/butterfly_cropped_uniq1K_512 · Hugging Face
Generate fauvism still life image using FastGAN
Model description
FastGAN model is a Generative Adversarial Networks (GAN) training on a small amount of high-fidelity images with minimum computing cost. Using a skip-layer channel-wise excitation module and a self-supervised discriminator trained as a feature-encoder, the model was able to converge after some hours of training for either 100 high-quality images or 1000 images datasets.

This model was trained on a dataset of 124 high-quality Fauvism painting images.

How to use:https://huggingface.co/huggan/fastgan-few-shot-fauvism-still-life huggan/fastgan-few-shot-fauvism-still-life · Hugging Face
gaIA: Italian Landscape GAN Model
gaIA is the first Italian GAN model trained on satellite images of a selection of Italy's main glaciers, forests, lakes, rivers, and coasts that are most affected by climate change. It is usable for scientific and artistic purposes.

Dataset
Images: 12k
Image Format: 1024x1024
Source: Copernicus Sentinel 2A
Reference Years: 2017 – June 2024
Riffusion: Optimized for Mobile Deployment
State-of-the-art generative AI model used to generate spectrogram images given any text input. These spectrograms can be converted into audio clips
Generates high resolution spectrograms images from text prompts using a latent diffusion model. This model uses CLIP ViT-L/14 as text encoder, U-Net based latent denoising, and VAE based decoder to generate the final image.

This model is an implementation of Riffusion found here. This repository provides scripts to run Riffusion on Qualcomm® devices. More details on model performance across various devices, can be found here.
Stable-Diffusion: Optimized for Mobile Deployment
State-of-the-art generative AI model used to generate detailed images conditioned on text descriptions
Generates high resolution images from text prompts using a latent diffusion model. This model uses CLIP ViT-L/14 as text encoder, U-Net based latent denoising, and VAE based decoder to generate the final image.

This model is an implementation of Stable-Diffusion found here. This repository provides scripts to run Stable-Diffusion on Qualcomm® devices. More details on model performance across various devices, can be found here.
—Model Details—
-Model Type: Image generation
-Model Stats:
**Input: Text prompt to generate image
**QNN-SDK: 2.19
**Text Encoder Number of parameters: 340M
**UNet Number of parameters: 865M
**VAE Decoder Number of parameters: 83M
**Model size: 1GB
Score-Based Generative Modeling through Stochastic Differential Equations (SDE)
Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution.
The Keras code example on denoising diffusion implicit models (DDIM).
Model description
The model uses a U-Net with identical input and output dimensions. It progressively downsamples and upsamples its input image, adding skip connections between layers having the same resolution. The architecture is a simplified version of the architecture of DDPM. It consists of convolutional residual blocks and lacks attention layers. The network takes two inputs, the noisy images and the variances of their noise components, which it encodes using sinusoidal embeddings.
https://huggingface.co/keras-io/denoising-diffusion-implicit-models