HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
Run Snowflake Arctic with an API
Posted April 23, 2024 by @cbh123

Snowflake Arctic is a new open-source language model from Snowflake. Arctic is on-par or better than both Llama 3 8B and Llama 2 70B on all metrics while using less than half of the training compute budget.

It's massive. At 480B, Arctic is the biggest open-source model to date. As expected from a model from Snowflake, it's good at SQL and other coding tasks, and it has a liberal Apache 2.0 license.

With Replicate, you can run Arctic in the cloud with one line of code.
Picking an SD3 version
Stability AI have packaged up SD3 Medium in different ways to make sure it can run on as many devices as possible.

SD3 uses three different text encoders. (The text encoder is the part that takes your prompt and puts it into a format the model can understand). One of these new text encoders is really big – meaning it uses a lot of memory. If you’re looking at the SD3 Hugging Face weights, you’ll see four options with different text encoder configurations. You should choose which one to use based on your available VRAM.

sd3_medium_incl_clips_t5xxlfp8.safetensors
This encoder contains the model weights, the two CLIP text encoders and the large T5-XXL model in a compressed fp8 format. We recommend these weights for simplicity and best results.

sd3_medium_incl_clips_t5xxlfp16.safetensors
The same as sd3_medium_incl_clips_t5xxlfp8.safetensors, except the T5 part isn’t compressed as much. By using fp16 instead of fp8, you’ll get a slight improvement in your image quality. This improvement comes at the cost of higher memory usage.

sd3_medium_incl_clips.safetensors
This version does away with the T5 element altogether. It includes the weights with just the two CLIP text encoders. This is a good option if you do not have much VRAM, but your results might be very different from the full version. You might notice that this version doesn’t follow your prompts as closely, and it may also reduce the quality of text in images.

sd3_medium.safetensors
This model is just the base weights without any text encoders. If you use these weights, make sure you’re loading the text encoders separately. Stability AI have provided an example ComfyUI workflow for this.
How to get the best results from Stable Diffusion 3?
Stability AI recently released the weights for Stable Diffusion 3 Medium, a 2 billion parameter text-to-image model that excels at photorealism, typography, and prompt following.

You can run the official Stable Diffusion 3 model on Replicate, and it is available for commercial use. We have also open-sourced our Diffusers and ComfyUI implementations (read our guide to ComfyUI).

In this blog post we’ll show you how to use Stable Diffusion 3 (SD3) to get the best images, including how to prompt SD3, which is a bit different from previous Stable Diffusion models.

To help you experiment, we’ve created an SD3 explorer model that exposes all of the settings we discuss here.https://d31rfu1d3w8e4q.cloudfront.net/static/blog/get-the-best-from-stable-diffusion-3/explorer-screenshot.png
What makes FLUX.1 special?
FLUX.1 models have state-of-the-art performance in prompt following, visual quality, image detail, and output diversity. Here are some particular areas where we’ve been impressed:

Text! Unlike older models that often messed up similar-looking letters, Flux can handle tricky words with repeated letters. This makes it great for designs where text needs to be accurate. Check out this Black Forest Flux Schnell gateau:https://d31rfu1d3w8e4q.cloudfront.net/static/blog/flux/cake-text.png
How to fine-tune: Focus on effective datasets?
This is the third blog post in a series about adapting open source large language models (LLMs). In this post, we explore some rules of thumb for curating a good training dataset.

In Part 1, we took a look at prevalent approaches for adapting language models to domain data.
In Part 2, we discussed how to determine if fine-tuning is the right approach for your use case.
Introduction

Fine-tuning LLMs is a mix of art and science, with best practices in the field still emerging. In this blog post, we’ll highlight design variables for fine-tuning and give directional guidance on best practices we’ve seen so far to fine-tune models with resource constraints. We recommend using the information below as a starting point to strategize your fine-tuning experiments.

Full fine-tuning vs. parameter-efficient fine-tuning (PEFT)

Both full fine-tuning and PEFT have shown improvements in downstream performance when applied to new domains in both academic and practical settings. Choosing one boils down to compute available (in GPU hours and GPU memory), performance on tasks other than the target downstream task (the learning-forgetting tradeoff) and human annotation costs.

Full fine-tuning is more prone to suffer from two problems: model collapse and catastrophic forgetting. Model collapse is where the model output converges to a limited set of outputs and the tail of the original content distribution disappears. Catastrophic forgetting, as discussed in Part 1 of this series, leads to the model losing its abilities. Some early empirical studies show that full fine-tuning techniques are more prone to the above mentioned issues as compared to PEFT techniques, though more research needs to be done.

PEFT techniques serve as natural regularizers for fine-tuning by design. PEFT often costs relatively less compute to train a downstream model and is much more accessible for a resource-constrained scenario with limited dataset sizes. In some cases, full fine-tuning has shown better performance at the specific task of interest, often at the cost of forgetting some of the capabilities of the original model. This “learning-forgetting” tradeoff between the specific downstream task performance and performance on other tasks is explored deeply in the comparison of LoRA and full fine-tuning in this paper.

Given resource constraints, PEFT techniques will likely give a better performance boost/cost ratio as compared to full fine-tuning. If downstream performance is of paramount importance with resource constraints, full fine-tuning will be the most effective. In either scenario, the key is to create a high-quality dataset keeping the following key principles in mind.
How NVIDIA is using structured weight pruning and knowledge distillation to build new Llama models
Large language models like Llama can move with impressive speed and precision to handle a variety of challenging tasks, such as generating code, solving math problems, and helping doctors make life-saving medical decisions. Open source models are already leading to incredible breakthroughs across disciplines—however, they’re resource-intensive to deploy. It’s important that we work collaboratively across the industry to make it even easier for people to tap into the game-changing potential of LLMs.

Last month, we announced Llama 3.1, which includes our largest model yet, the 405B, as well as two smaller models with 70 billion and 8 billion parameters, respectively. Smaller models from a larger relative are typically cheaper to deploy to the masses and perform well across many language tasks. In a new research paper, our partners at NVIDIA explore how various large models can be made smaller using structured weight pruning and knowledge distillation—without having to train a new model from scratch. Working with Llama 3.1 8B, the team shares how it created Llama-Minitron 3.1 4B, its first work within the Llama 3.1 open source family.

Learn more about this work, and get the pruning and distillation strategy and additional resources by reading NVIDIA’s blog post.https://ai.meta.com/blog/nvidia-llama/
FLUX.1: First Impressions
FLUX.1 is a new AI model (available on Replicate) that makes images from text. Unlike most text-to-image models, which rely on diffusion, FLUX.1 uses an upgraded technique called “flow matching.”

While diffusion models create images by gradually removing noise from a random starting point, flow matching takes a more direct approach, learning the precise transformations needed to map noise onto a realistic image. This difference in methodology leads to a distinct aesthetic and unique advantages in terms of speed and control.

We were curious to see how this approach impacts the generated images, so we fed it a variety of prompts, many created by other AI models. Here are some observations:

Text: It gets it (mostly)
One of the challenges in text-to-image generation is accurately translating words into visual representations. FLUX.1 handles this surprisingly well, even in complex scenarios like memes.

Prompt:

Photograph of letterpress serif type on thick rough creamy paper saying ‘REPLICATE.COM

https://d31rfu1d3w8e4q.cloudfront.net/static/blog/flux-first-impressions/letterpress.webp
Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials
We present Meta 3D AssetGen (AssetGen), a significant advancement in text-to-3D generation which produces faithful, high-quality meshes with texture and material control. Compared to works that bake shading in the 3D object's appearance, AssetGen outputs physically-based rendering (PBR) materials, supporting realistic relighting. AssetGen generates first several views of the object with factored shaded and albedo appearance channels, and then reconstructs colours, metalness and roughness in 3D, using a deferred shading loss for efficient supervision. It also uses a sign-distance function to represent 3D shape more reliably and introduces a corresponding loss for direct shape supervision. This is implemented using fused kernels for high memory efficiency. After mesh extraction, a texture refinement transformer operating in UV space significantly improves sharpness and details. AssetGen achieves 17% improvement in Chamfer Distance and 40% in LPIPS over the best concurrent work for few-view reconstruction, and a human preference of 72% over the best industry competitors of comparable speed, including those that support PBR. Project page with generated assets: https://assetgen.github.io.

Yawar Siddiqui, Tom Monnier,
Filippos Kokkinos
, Mahendra Kariya, Yanir Kleiman, Emilien Garreau,
Oran Gafni
,
Natalia Neverova
, Andrea Vedaldi, Roman Shapovalov, David Novotny
https://ai.meta.com/research/publications/meta-3d-assetgen-text-to-mesh-generation-with-high-quality-geometry-texture-and-pbr-materials/
The Llama 3 Herd of Models
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Llama team https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
Imagine yourself: Tuning-Free Personalized Image Generation
Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, e.g., changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model’s SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.

Zecheng He,
Bo Sun
,
Felix Xu
, Haoyu Ma, Ankit Ramchandani, Vincent Cheung, Siddharth Shah, Anmol Kalia, Ning Zhang, Peizhao Zhang,
Roshan Sumbaly
, Peter Vajda, Animesh Sinha
https://ai.meta.com/research/publications/imagine-yourself-tuning-free-personalized-image-generation/
X-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss—an objective matching related samples—underlies methods from self-supervised to multimodal learning. Contrastive losses, however, can be viewed more broadly as modifying a similarity graph to indicate how samples should relate in the embedding space. This view reveals a shortcoming in contrastive learning: the similarity graph is binary, as only one sample is the related positive sample. Crucially, similarities across samples are ignored. Based on this observation, we revise the standard contrastive loss to explicitly encode how a sample relates to others. We experiment with this new objective, called X-Sample Contrastive, to train vision models based on similarities in class or text caption descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and vision-language models trained on the same data across a range of tasks. When training on CC12M, we outperform CLIP by 0.6% on both ImageNet and ImageNet Real. Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of 16.8% on ImageNet and 18.1% on ImageNet Real when training with CC3M. Finally, our objective seems to encourage the model to learn representations that separate objects from their attributes and backgrounds, with gains of 3.3-5.6% over CLIP on ImageNet9. We hope the proposed solution takes a small step towards developing richer learning objectives for understanding sample relations in foundation models.

Vlad Sobal,
Mark Ibrahim
, Randall Balestriero,
Vivien Cabannes
, Pietro Astolfi, Kyunghyun Cho,
Yann LeCun
https://ai.meta.com/research/publications/x-sample-contrastive-loss-improving-contrastive-learning-with-sample-similarity-graphs/
An overview of the SAM 2 framework.

SAM 2 uses a transformer architecture with streaming memory for real-time video processing. It builds on the original SAM model, extending its capabilities to video.

For more technical details, check out the Research paper.

Safety
⚠️ Users should be aware of potential ethical implications: - Ensure you have the right to use input images and videos, especially those featuring identifiable individuals. - Be responsible about generated content to avoid potential misuse. - Be cautious about using copyrighted material as inputs without permission.

Support
All credit goes to the Meta AI Research teamhttps://raw.githubusercontent.com/facebookresearch/segment-anything-2/main/assets/model_diagram.png
How to Use SAM 2 for Video?
Segment Anything Model 2 (SAM 2) is a unified video and image segmentation model.

Video segmentation presents unique challenges compared to image segmentation. Object motion, deformation, occlusion, lighting changes, and other factors can vary dramatically from frame to frame. Videos are often lower quality than images due to camera motion, blur, and lower resolution, further increasing the difficulty.

SAM 2 demonstrates improved accuracy in video segmentation, with 3 times fewer interactions than previous approaches. SAM 2 is more accurate for image segmentation and 6 times faster than the original Segment Anything Model (SAM).https://youtu.be/Dv003fTyO-Y
SAM 2: Segment Anything in Images and Videos
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos. We extend SAM to video by considering images as a video with a single frame. The model design is a simple transformer architecture with streaming memory for real-time video processing. We build a model-in-the-loop data engine, which improves model and data via user interaction, to collect our SA-V dataset, the largest video segmentation dataset to date. SAM 2 trained on our data provides strong performance across a wide range of tasks and visual domains.https://github.com/zsxkib/segment-anything-2/raw/video/assets/sa_v_dataset.jpg?raw=true
Batouresearch / high-resolution-controlnet-tile
Run time and cost
This model costs approximately $0.054 to run on Replicate, or 18 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 75 seconds. The predict time for this model varies significantly based on the inputs.

Readme
High quality upscale from Fermat.app. Increase the creativity to encourage hallucination.https://replicate.delivery/pbxt/etT436Z2RrWAOajwhQm6YLBHiT5Y1Oix2aZnDLnIkfM7u4ESA/out-0.png
https://replicate.delivery/pbxt/1rbKAbFss7ZUGNxKnFGmOEHBaeEZ7cI7Sx61eiOo9AyGjQMTA/output.jpg
Restore images
These models restore and improve images by fixing defects like blur, noise, and low resolution. Key capabilities:

Deblurring - Sharpen blurry images by reversing blur effects. Useful for old photos.
Denoising - Remove grain and artifacts by learning noise patterns.
Colorization - Add realistic color to black and white photos.
Face restoration - Improve the image quality of faces in old photos, or unrealistic AI generated faces.
Our Picks
Best restoration model: google-research/maxim
If you need to sharpen a blurry photo, or remove noise or compression artifacts, start with google-research/maxim. It has a total of 11 image restoration models baked-in that let you deblur, denoise, remove raindrops, and more. If you’re not getting the results you’re looking for, try megvii-research/nafnet which is similar but supports fewer restoration features.

Best colorization model: piddnad/ddcolor
The best model for adding color to black and white photos is piddnad/ddcolor, which was released in 2023. If you are looking for more saturated results try out arielreplicate/deoldify_image.

Best face restoration model: sczhou/codeformer
If you’re looking for a face restoration model, try starting with sczhou/codeformer. It produces more realistic faces than alternatives like tencentarc/gfpgan. If you aren’t getting the exact image improvements you want, we recommend exploring more modern upscaling models like batouresearch/magic-image-refiner.https://replicate.com/collections/image-restoration
Using FLUX.1 Schnell for faster inference
You can use your FLUX.1 Dev LoRA with the smaller FLUX.1 Schnell model, to generate images faster and cheaper. Just change the model parameter from “dev” to “schnell” when you generate, and lower your number of steps to something small like 4.

Note that outputs will still be under the non-commercial license of FLUX.1 Dev.

Examples and use cases
Check out our examples gallery for inspiration. You can see how others have fine-tuned FLUX.1 to create different styles, characters, a never-ending parade of cute animals, and more.https://d31rfu1d3w8e4q.cloudfront.net/static/blog/fine-tune-flux/3-base.webp
How to fine-tune FLUX.1
Fine-tuning FLUX.1 on Replicate is a straightforward process that can be done either through the web interface or programmatically via the API. Let’s walk through both methods.

Prepare your training data
To start fine-tuning, you’ll need a collection of images that represent the concept you want to teach the model. These images should be diverse enough to cover different aspects of the concept. For example, if you’re fine-tuning on a specific character, include images in various settings, poses, and lighting. Here are some guidelines:

Use 12-20 images for best results
Use large images if possible
Use JPEG or PNG formats
(Optional) Create a corresponding .txt file for each image with the same name, containing the caption
Once you have your images (and optional captions), zip them up into a single file.

Start the training process
https://replicate.com/blog/fine-tune-flux Fine-tune FLUX.1 with your own images
What is fine-tuning FLUX.1 on Replicate?
These big image generation models, like FLUX.1 and Stable Diffusion, are trained on a bunch of images that have had noise added, and they learn the reverse function of “adding noise.” Amazingly, that turns out to be “creating images.”

How do they know which image to create? They build on transformer models, like CLIP and T5, which are themselves trained on tons of image-caption pairs. These are language-to-image encoders: they learn to map an image and its caption to the same shape in high-dimensional space. When you send them a text prompt, like “squirrel reading a newspaper in the park,” they can map that to patterns of pixels in a grid. To the encoder, the picture and the caption are the same thing.

The image generation process looks like this: take some input pixels, move them a little bit away from noise and toward the pattern created by your text input, and repeat until the correct number of steps is reached.

The fine-tuning process, in turn, takes each image/caption pair from your dataset and updates that internal mapping a little bit. You can teach the model anything this way, as long as it can be represented through image-caption pairs: characters, settings, mediums, styles, genres. In training the model will learn to associate your concept with a particular text string. Include this string in your prompt to activate that association.

For example, say you want to fine-tune the model on your comic book superhero. You’ll collect some images of your character as your dataset. A well-rounded batch: different settings, costumes, lighting, maybe even different art styles. That way the model understands that what it’s learning is this one person, not any of these other incidental details.

Pick a short, uncommon word or phrase as your trigger: something unique that won’t conflict with other concepts or fine-tunes. You might choose something like “bad 70s food” or “JELLOMOLD”. Train your model. Now, when you prompt “Establishing shot of bad 70s food at a party in San Francisco,” your model will call up your specific concept. Easy as that.

Could it be as easy as that? Yes, actually. We realized that we could use the Replicate platform to make fine-tuning as simple as uploading images. We can even do the captioning for you.

If you’re not familiar with Replicate, we make it easy to run AI as an API. You don’t have to go looking for a beefy GPU, you don’t have to deal with environments and containers, you don’t have to worry about scaling. You write normal code, with normal APIs, and pay only for what you use.

You can try this right now! It doesn’t take a lot of images. Check out our examples gallery to see the kinds of styles and characters people are creating.

Grab a few photos of your pet, or your favorite houseplant, and let’s get started.
Fine-tune FLUX.1 with your own images
We have fine-tuning for FLUX.1
Fine-tuning is now available on Replicate for the FLUX.1 [dev] image model. Here’s what that means, and how to do it.

FLUX.1 is a family of text-to-image models released by Black Forest Labs this summer. The FLUX.1 models set a new standard for open-weights image models: they can generate realistic hands, legible text, and even the strangely hard task of “funny memes.” You can now fine-tune your model on Replicate with the FLUX.1 Dev LoRA Trainer.

If you know what all that means and you’re ready to try it with your dataset, you can skip to the code.

Otherwise, here’s what it means and why you should care.
https://replicate.com/blog/fine-tune-flux Fine-tune FLUX.1 with your own images