HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
SAM2 Video Predictor
This is a simple demo for video segmentation with SAM2.

Instructions: (read the instructions)

Upload your video [MP4-24fps]
With 'include' point type selected, Click on the object to mask on first frame
Switch to 'exclude' point type if you want to specify an area to avoid
Get Mask !
Check Propagation every 15 frames
Add point on corresponding frame number if any mask needs to be refined
If propagation seems ok on every 15 frames, propagate with "render" to render final masked video !
Hit Reset button if you want to refresh and start again.
Input video will be processed over 10 seconds only for demo purpose :)
https://huggingface.co/spaces/fffiloni/SAM2-Video-Predictor SAM2 Video Predictor - a Hugging Face Space by fffiloni
Second day of Composer use
genecyber
1h
First of all, I’m loving composer, I use it in combination with live preview and can see real time rendering of edits also.

Second of all, we really need some sort of internal versioning, unable to undo is pretty scary. Also id I add my own versions, composer gets confused which files it’s editing.

I had a filename change work tonight, that was cool.

Mostly I’m struggling to add files to composer context that stick, but it might be that my composer disconnects? If composer gets in a weird state I end up having to restart cursor completely to recover.

why is the ui not following standards? in minimized the + makes a new instead of expanding, and the only way to expand floating is keys+i ? at least I can now find my composer instances I might accidentally lose by clicking plus.

The direction is amazing, and far surpasses any other experiences I’ve tried with collaborative editing multiple files.

Also love the ability to modularize the code when it gets just too much for context windows. I’ll create a styles.css add it to composer context and ask it to move styles to css file, and so on, this is great. otherwise Cursor suffers like everyone else with attention when it comes to large files.

These are all over the place I know.
https://forum.cursor.com/t/second-day-of-composer-use/7584
Make Google part of your security team. Join Mandiant and Google Cloud experts online for Google Cloud Security Summit, Thursday, August 22, at 11:30 AM CST, to discover how you can defend your organisation against evolving cyberthreats with intel-driven security operations, a secure-by-design foundation, and AI innovations across the security life cycle.


Register to dive deep into key security topics and new technologies:
https://cloudonair.withgoogle.com/events/summit-apac-security-24?utm_content=invite2&utm_source=cloud_sfdc&utm_medium=email&utm_campaign=FY24-Q3-apac-EXP120-onlineevent-er-dgcsm-security-summit-2024-mc&pref=K&mkt_tok=ODA4LUdKVy0zMTQAAAGU6OHWBoF2Y5Hwwfb2QFOtyxtgmixuo-CGF6NRTRFIkjwshtRL-iLyPcZqVgrMOI8bqjtZOditNJpP6QJl-PDITmFSR8L1dvNKb2vJEg3zPxovd3Vaavw

Start with the opening keynote.

Join Sunil Potti, VP and GM of Google Cloud Security, to explore how AI is enhancing security and helping organisations boost their resilience.


Get to know Gemini for Security.

Check out the latest ways Google AI is transforming cloud security, security operations, and threat intelligence with robust Gemini-powered capabilities.


Gain valuable insights from the 2024 M-Trends report.

Learn about the evolving cyber-threat landscape from Steve D’sa, Regional Leader, Mandiant Consulting. Featuring APAC perspectives and best practice you can apply directly to your security program.

Register yourself and your team today, join five or more sessions and receive a Google Cloud collectible digital badge in recognition of your participation.
Metric Card for CharacTER
CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER). It is defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is considered to match a reference word and could be shifted, if the edit distance between them is below a threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower TER. If this is a text-based metric, make sure to wrap you input in double quotes. Alternatively you can use a JSON-formatted list as input.
https://huggingface.co/spaces/evaluate-metric/character CharacTER - a Hugging Face Space by evaluate-metric
Metric Card for BLEU
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation’s overall quality. Neither intelligibility nor grammatical correctness are not taken into account.

If this is a text-based metric, make sure to wrap you input in double quotes. Alternatively you can use a JSON-formatted list as input. https://huggingface.co/spaces/evaluate-metric/bleu BLEU - a Hugging Face Space by evaluate-metric
Metric Card for BERT Score
Metric description
BERTScore is an automatic evaluation metric for text generation that computes a similarity score for each token in the candidate sentence with each token in the reference sentence. It leverages the pre-trained contextual embeddings from BERT models and matches words in candidate and reference sentences by cosine similarity.

Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

How to use
BERTScore takes 3 mandatory arguments : predictions (a list of string of candidate sentences), references (a list of strings or list of list of strings of reference sentences) and either lang (a string of two letters indicating the language of the sentences, in ISO 639-1 format) or model_type (a string specififying which model to use, according to the BERT specification). The default behavior of the metric is to use the suggested model for the target language when one is specified, otherwise to use the model_type indicated.https://huggingface.co/spaces/evaluate-metric/bertscore BERT Score - a Hugging Face Space by evaluate-metric
Supabase Realtime: Broadcast and Presence Authorization

Today we're releasing Authorization for Realtime's Broadcast and Presence.

For context, Supabase includes three useful extensions for building real-time applications.

Broadcast: Send ephemeral, low-latency messages between users.
Presence: Show when users are online and share state between users.
Postgres Changes: Listen to Postgres database changes.
This release introduces authorization for Broadcast and Presence using Row Level Security policies:
https://youtu.be/IXRrU9MpA8Q

https://supabase.com/blog/supabase-realtime-broadcast-and-presence-authorization
New phone, new era. The new #Pixel9 is built for and with Gemini. It has…
- Tools using Gemini to spark creativity
- Pixel Camera features for great photos *and* videos
- AI that improves phone calls
- Smart, elevated design
#MadeByGoogle
TTS Arena: Benchmarking Text-to-Speech Models in the Wild
Automated measurement of the quality of text-to-speech (TTS) models is very difficult. Assessing the naturalness and inflection of a voice is a trivial task for humans, but it is much more difficult for AI. This is why today, we’re thrilled to announce the TTS Arena. Inspired by LMSys's Chatbot Arena for LLMs, we developed a tool that allows anyone to easily compare TTS models side-by-side. Just submit some text, listen to two different models speak it out, and vote on which model you think is the best. The results will be organized into a leaderboard that displays the community’s highest-rated models.https://huggingface.co/blog/arena-tts TTS Arena: Benchmarking Text-to-Speech Models in the Wild
Yesterday, we released Parler-TTS and Data-Speech, fully open-source reproduction of work from the paper:
Natural language guidance of high-fidelity text-to-speech with synthetic annotations (2402.01912)


Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).

https://huggingface.co/collections/parler-tts/parler-tts-fully-open-source-high-quality-tts-models-66164ad285ba03e8ffde214c

Parler-TTS Mini v0.1, is the first iteration Parler-TTS model trained using 10k hours of narrated audiobooks. It generates high-quality speech with features that can be controlled using a simple text prompt (e.g. gender, background noise, speaking rate, pitch and reverberation).

To improve the prosody and naturalness of the speech further, we're scaling up the amount of training data to 50k hours of speech. The v1 release of the model will be trained on this data, as well as inference optimisations, such as flash attention and torch compile.

parler-tts/parler_tts_mini_v0.1


Data-Speech can be used for annotating speech characteristics in a large-scale setting.

parler-tts/open-source-speech-datasets-annotated-using-data-speech-661648ffa0d3d76bfa23d534


This work is both scalable and easily modifiable and will hopefully help the TTS research community explore new ways of conditionning speech synthesis.

All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models. Parler-TTS: fully open-source high-quality TTS - a parler-tts Collection
I'm excited to announce that Transformers.js V3 is finally available on NPM! 🔥 State-of-the-art Machine Learning for the web, now with WebGPU support! 🤯⚡️

Install it from NPM with:
𝚗𝚙𝚖 𝚒 @𝚑𝚞𝚐𝚐𝚒𝚗𝚐𝚏𝚊𝚌𝚎/𝚝𝚛𝚊𝚗𝚜𝚏𝚘𝚛𝚖𝚎𝚛𝚜

or via CDN, for example: https://v2.scrimba.com/s0lmm0qh1q

Segment Anything demo:
webml-community/segment-anything-webgpu
Imagen 3
Published on Aug 14
·
Submitted by
akhaliq
on Aug 14
#2 Paper of the day
Authors:
Imagen-Team-Google
,

Jason Baldridge
,
Jakob Bauer
,
Mukul Bhutani
,
Nicole Brichtova
,
Andrew Bunner
,
Kelvin Chan
,

Yichang Chen
,
Sander Dieleman
,

Yuqing Du
,

Zach Eaton-Rosen
,

Hongliang Fei
,
Nando de Freitas
,
Yilin Gao
,
Evgeny Gladchenko
,
Sergio Gómez Colmenarejo
,
Mandy Guo
,

Alex Haig
,
Will Hawkins
,

Hexiang Hu
,
Huilian Huang
,

Tobenna Peter Igwe
+229 authors
Abstract
We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
Published on Aug 14
·
Submitted by
akhaliq
on Aug 14
#1 Paper of the day
Authors:

Yushi Bai
,
Jiajie Zhang
,

Xin Lv
,

Linzhi Zheng
,
Siqi Zhu
,
Lei Hou
,

Yuxiao Dong
,

Jie Tang
,

Juanzi Li
Abstract
Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability. Our code & models are at: https://github.com/THUDM/LongWriter. GitHub - THUDM/LongWriter: LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning
Published on Aug 9
·
Submitted by
akhaliq
on Aug 15
Authors:
Bo-Wen Zhang
,
Yan Yan
,

Lin Li
,
Guang Liu
Abstract
Recent advancements in Chain-of-Thoughts (CoT) and Program-of-Thoughts (PoT) methods have greatly enhanced language models' mathematical reasoning capabilities, facilitating their integration into instruction tuning datasets with LLMs. However, existing methods for large-scale dataset creation require substantial seed data and high computational costs for data synthesis, posing significant challenges for scalability. We introduce InfinityMATH, a scalable instruction tuning dataset for programmatic mathematical reasoning. The construction pipeline emphasizes decoupling numbers from mathematical problems to synthesize number-independent programs, enabling efficient and flexible scaling while minimizing dependency on specific numerical values. Fine-tuning experiments with open-source language and code models, such as Llama2 and CodeLlama, demonstrate the practical benefits of InfinityMATH. These fine-tuned models, showed significant relative improvements on both in-domain and out-of-domain benchmarks, ranging from 184.7% to 514.3% on average. Additionally, these models exhibited high robustness on the GSM8K+ and MATH+ benchmarks, which are enhanced version of test sets with simply the number variations. InfinityMATH ensures that models are more versatile and effective across a broader range of mathematical problems. The data is available at https://huggingface.co/datasets/flagopen/InfinityMATH. flagopen/InfinityMATH · Datasets at Hugging Face
Generative Photomontage
Published on Aug 14
·
Submitted by
akhaliq
on Aug 15
Authors:
Sean J. Liu
,
Nupur Kumari
,
Ariel Shamir
,
Jun-Yan Zhu
Abstract
Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a single image that captures everything a user wants. In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images, in essence forming a Generative Photomontage. Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. Our method faithfully preserves the user-selected regions while compositing them harmoniously. We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines.
https://huggingface.co/papers/2408.07116 Paper page - Generative Photomontage
LGM Full
This custom pipeline encapsulates the full LGM pipeline, including multi-view diffusion.

It is provided as a resource for the ML for 3D Course.

Original LGM paper: LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation.https://huggingface.co/Thever/LGM-Thever Thever/LGM-Thever · Hugging Face
[ECCV 2024] VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
Porject page, Paper link

VFusion3D is a large, feed-forward 3D generative model trained with a small amount of 3D data and a large volume of synthetic multi-view data. It is the first work exploring scalable 3D generative/reconstruction models as a step towards a 3D foundation.

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
Junlin Han, Filippos Kokkinos, Philip Torr
GenAI, Meta and TVG, University of Oxford
European Conference on Computer Vision (ECCV), 2024

News
[08.08.2024] HF Demo is available, big thanks to Jade Choghari's help for making it possible.
[25.07.2024] Release weights and inference code for VFusion3D.
Quick Start
Getting started with VFusion3D is super easy! 🤗 Here’s how you can use the model with Hugging Face:https://huggingface.co/facebook/vfusion3d facebook/vfusion3d · Hugging Face
Let’s see JEPA in action🤖
Simplified image-based implementation training on a CPU with live preview support - very satisfying to watch:)

I-JEPA is the image-based version of JEPA (Joint-Embedding Predictive Architecture - an alternative to autoregressive LLM architectures ) pioneered by professor Yann Lecun.

At a higher level, I-JEPA predicts image segment representations (Target) based on representations of other segments within the same image (Context). It consists of three key components: a context encoder, target encoder and a predictor.

Code: https://github.com/Jaykef/ai-algorithms/blob/main/mnist_ijepa.ipynb
Introducing Fineweb-Edu-Fortified: An enhanced Fineweb-Edu dataset. 📚

This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models. The dataset is based on the Fineweb-Edu subset of the large Fineweb dataset and includes:

- Exact-match deduplication across all crawls
- Embeddings for each row using the TaylorAI/bge-micro model
- Count column indicating duplication frequency
- Includes data from 95 Common Crawl crawls (2013-2024)
- Rows have been reduced from 1.279B to 0.324B after deduplication
- It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)

Access the entire Fineweb-Edu-Fortified dataset on Hugging Face →
airtrain-ai/fineweb-edu-fortified


Try a semantic search demo via this Hugging Face Space →
airtrain-ai/fineweb-edu-fortified-search-demo


Many thanks to the amazing @josh-sematic for his work on this project, the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets and for their support during our work on Fineweb-Edu-Fortified, and also thanks to @underspirit for pointing out the reduction in dataset size that could be achieved via deduplication. 🤗