HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
UCSC-VLAA/MedTrinity-25M A Large-scale Multimodal Dataset
https://huggingface.co/papers/2408.02900
MedTrinity--25M is a large-scale multimodal dataset in the field of medicine.

Key Highlights
Dataset size and coverage: Covers more than 25 million images from 10 modalities with multi-granular annotations for more than 65 diseases.
Richness of annotations: Contains global textual information such as disease/lesion type, modality, region-specific descriptions and inter-regional relations, as well as detailed local annotations of regions of interest (ROIs) such as bounding boxes, segmentation masks.

Innovative data generation: Developed the first automated pipeline to extend multimodal data by generating multi-granular visual and textual annotations (in the form of image-ROI-description triplets) without image-text pairs.
Data collection and processing: Collected and preprocessed data from more than 90 different sources, and identified ROIs associated with abnormal regions using domain-specific expert models.
SwarmUI startup and creation speed

Maybe because it is based on ComfyUI, but SwarmUI startup and creation speed is quite fast
I can't simply compare it with the A1111 version of SD-webui, but for now, if you want to run SDXL, this is much more comfortable
FLUX.1 [schnell] is a 12 billion parameter rectifier transformer

that generates images from text descriptions.

It has many features and uses, but also some limitations and prohibited uses.

Highlights
Powerful generation capabilities: can generate images from text descriptions, output quality is cutting-edge, prompt following is competitive, and generates high-quality images in 1 to 4 steps.

Training method: trained by latent adversarial diffusion distillation.
License and use: licensed under apache-2.0 for personal, scientific and commercial purposes, with reference implementation, sample code and API endpoints.
Usage: can be used through specific github repositories, ComfyUI, Diffusers, etc., and corresponding code examples are given.

Limitations: cannot provide factual information, may amplify social biases, may not match prompt generation output, prompt following is affected by prompt style.
Prohibited use: It cannot be used to violate laws and regulations, harm minors, generate false and harmful information, disseminate personally identifiable information, harass or bully others, create illegal content, make fully automated decisions that affect personal legal rights, or create false information on a large scale.
https://huggingface.co/black-forest-labs/FLUX.1-schnell black-forest-labs/FLUX.1-schnell · Hugging Face
FLUX.1 [dev] is a 12 billion parameter rectifier transformer that can generate images from text descriptions.

Highlights
Powerful generation capabilities: It can generate images from text descriptions, and the output quality is cutting-edge, second only to the FLUX.1 [pro] model.

Excellent performance: It has competitive prompt following capabilities, matching the performance of closed-source alternatives.

Efficient training: Training is improved through guided distillation.

Open weight usage: Open weights to promote new scientific research and empower artists to develop innovative workflows.

Multiple usage paths: It provides reference implementations, sample code, can be obtained from multiple sources through API, can also be used for local reasoning in Comfy UI, and can be used with diffusers library.

Restrictions on use: It cannot provide factual information, may amplify social bias, may not match prompts, and is greatly affected by prompt style.

Prohibited uses: It clarifies a series of prohibited uses that violate laws and regulations, harm others, etc.

Followed license: Follows the FLUX.1 [dev] non-commercial license.
https://huggingface.co/black-forest-labs/FLUX.1-dev black-forest-labs/FLUX.1-dev · Hugging Face
this is amazing because Flux Schnell is, well, super-schnell (aka fast) and amazing at prompt following, while the IPA SDXL step gives it all the texture and style you would ever need

you can use this in the Glif Chrome Extension (pick the Flux Guided Style Transfer preset) or on glif:

https://chromewebstore.google.com/detail/glif-remix-the-web-with-a/abfbooehhdjcgmbmcpkcebcmpfnlingo

or here directly: https://glif.app/@fab1an/glifs/clzjsqemt00006mh13re8uu3b
FLUX is incredible at prompt following, but can't do Style Transfer yet

here's a neat trick: you can use FLUX Schnell's gens as Controlnet into an SDXL workflow with an IPA + Style Ref, boom, you now both style and prompt following :)

some examples + link to a glif workflow:
CatVTON: A simple and most efficient virtual try-on diffusion model🤩

Lightweight (899.06M params only), Parameter-Efficient Training (49.57M parameters trainable), and Inference on less than 8G VRAM for 1024X768 resolution
🚀 OV-DINO (Open-Vocabulary Detection with Instance-level Noise Optimization)

New approach to open-vocabulary object detection. It improves the ability of vision models to detect and identify objects in images, even objects outside training data.

🤩 SAM2 integration in Demo👇
Is your summer reading list still empty? Curious if an LLM can generate a book blurb you'd enjoy and help build a KTO preference dataset at the same time?

A demo using
@huggingface
Spaces and
@gradio
to collect LLM output preferences: https://huggingface.co/spaces/davanstrien/would-you-read-it
We’re taking OpenAI DevDay on the road! Join us this fall in San Francisco, London, or Singapore for hands-on sessions, demos, and best practices. Meet our engineers and see how developers around the world are building with OpenAI.

openai.com/devday/
We’re starting to roll out advanced Voice Mode to a small group of ChatGPT Plus users. Advanced Voice Mode offers more natural, real-time conversations, allows you to interrupt anytime, and senses and responds to your emotions.https://x.com/i/status/1818353580279316863
The ChatGPT desktop app for macOS is now available for all users.

Get faster access to ChatGPT to chat about email, screenshots, and anything on your screen with the Option + Space shortcut: https://openai.com/chatgpt/mac/The ChatGPT desktop app for macOS is now available for all users.

Get faster access to ChatGPT to chat about email, screenshots, and anything on your screen with the Option + Space shortcut: https://openai.com/chatgpt/mac/

The desktop app for macOS now gives you side-by-side access to ChatGPT. Use Option + Space to open a companion window, which stays in front so you can use it more easily when working with other apps.
CodiumAI PR-Agent is an open-source tool that assists developers in streamlining pull-request creation and review. It automatically analyzes the PR and can provide several types of feedback, including Auto-Description, PR Review, Q&A, Code Suggestion, and more.
some cool things you can build with supabase realtime

http://supabase.com/realtime
Meta releases SAM2 segmentation model

Last week, Meta continued to make efforts in the field of images and released the Meta Segment Anything Model2 (SAM2) image segmentation model.
It is used for real-time, promptable image and video object segmentation, achieving a leap in video segmentation experience and enabling seamless use between image and video applications.
SM2 surpasses previous capabilities in image segmentation accuracy and achieves better video segmentation performance than existing works, while requiring one-third of the interaction time.
SM2 can also segment any object in any video or image (often described as O-shot generalization), which means it can be applied to previously unseen visual content without custom adaptation.
Also released is SAV: the largest video segmentation dataset, the SA-V dataset contains an order of magnitude more annotations, and the number of videos in the video object segmentation dataset is about 4.5 times that of existing datasets.
The main features of S-V are: more than 600,000 mask annotations on approximately 51,000 videos. Videos showing geographical diversity and real scenes, collected from 47 countries. Covers annotations for whole objects, object parts, and challenging situations, such as objects being occluded, disappearing, and reappearing.
This demo is outrageous, SAM2 can stably track and segment a person from a very blurry, very detailed aerial video.
Download the model here: https:/github.com/facebookresearch/segment-anything-2
Experience SAM2 here: https:/sam2.metademolab.com/
Google releases Gemma 2 2B and Gemini 1.5 Pro
Google also started to exert its strength last week, and released Gemini 1.5 Pro and Gemma 2 2B models successively.
Among them, Gemini 1.5 Pro 0801 surpassed GPT-4o mini in the overall ranking of LLM Arena and became the first. Google said that this is an experimental version
It is not yet an official version, so it is only available in AI Studio.
However, from the test, Gemini 1.5 Pro 0801's multimodal capabilities are very powerful, basically surpassing GPT-4o and Claude 3.5, and it supports audio
and video. I tried it with a podcast file of more than an hour, and it was summarized in more than ten seconds.
In addition, Google also released Gemma 2 2B, a model that can run on the device side. This model also scored higher in the LLM Arena than a number of LLMs that are much larger than it.
This is the quantified running effect of Gemma 2 2B plus MLX on iPhone 15pro.
And this model also has built-in Google's newly released security classifier ShieldGemma, which can effectively detect hate speech, harassment, sexually suggestive content and dangerous content.
First release on the entire network: Zhipu "Sora" is now open source
Zhipu's CogVideoX, which is the homologous model of "Qingying", was open sourced when this article was published.
The model has been uploaded to Github and Hugging Face.
The model can be run and adjusted on a single A6000 card.
Generated resolution 720 * 480, 6 seconds, 48 ​​frames.
The training data comes from the Internet, and B Station also provides technical support.
*Note 1: Inference is 21.6G, peak is 36G, and fine-tuning is stable at 46.2G, which is within the range of A6000.

*Note 2: The inference optimization has just been updated, and the peak is now 18G, which can be run on a single 4090 card.