HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
vit-base-xray-pneumonia
This model is a fine-tuned version of google/vit-base-patch16-224-in21k on the chest-xray-pneumonia dataset. It achieves the following results on the evaluation set:

Loss: 0.3387
Accuracy: 0.9006
Model description
More information needed

Intended uses & limitations
More information needed

Training and evaluation data
More information needed

Training procedure
Training hyperparameters
The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 10 https://huggingface.co/nickmuchi/vit-base-xray-pneumonia nickmuchi/vit-base-xray-pneumonia · Hugging Face
SegFormer (b5-sized) encoder pre-trained-only
SegFormer encoder fine-tuned on Imagenet-1k. It was introduced in the paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Xie et al. and first released in this repository.

Disclaimer: The team releasing SegFormer did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description
SegFormer consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on semantic segmentation benchmarks such as ADE20K and Cityscapes. The hierarchical Transformer is first pre-trained on ImageNet-1k, after which a decode head is added and fine-tuned altogether on a downstream dataset.

This repository only contains the pre-trained hierarchical Transformer, hence it can be used for fine-tuning purposes.

Intended uses & limitations
You can use the model for fine-tuning of semantic segmentation. See the model hub to look for fine-tuned versions on a task that interests you.
https://huggingface.co/nvidia/mit-b5 nvidia/mit-b5 · Hugging Face
dit-large-finetuned-rvlcdip
Document Image Transformer (large-sized model)
Document Image Transformer (DiT) model pre-trained on IIT-CDIP (Lewis et al., 2006), a dataset that includes 42 million document images and fine-tuned on RVL-CDIP, a dataset consisting of 400,000 grayscale images in 16 classes, with 25,000 images per class. It was introduced in the paper DiT: Self-supervised Pre-training for Document Image Transformer by Li et al. and first released in this repository. Note that DiT is identical to the architecture of BEiT.

Disclaimer: The team releasing DiT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description
The Document Image Transformer (DiT) is a transformer encoder model (BERT-like) pre-trained on a large collection of images in a self-supervised fashion. The pre-training objective for the model is to predict visual tokens from the encoder of a discrete VAE (dVAE), based on masked patches.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled document images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder.
https://huggingface.co/microsoft/dit-large-finetuned-rvlcdip microsoft/dit-large-finetuned-rvlcdip · Hugging Face
rare-puppers
Autogenerated by HuggingPics🤗🖼

Create your own image classifier for anything by running the demo on Google Colab.

Report any issues with the demo at the github repo.

Example Images
corgi https://huggingface.co/nateraw/rare-puppers nateraw/rare-puppers · Hugging Face
Turns out if you do a cute little hack, you can make
nateraw/musicgen-songstarter-v0.2
work on vocal inputs. 👀

Now, you can hum an idea for a song and get a music sample generated with AI 🔥🔥

Give it a try: ➡️
nateraw/singing-songstarter
⬅️

It'll take your voice and try to autotune it (because let's be real, you're no michael jackson), then pass it along to the model to condition on the melody. It works surprisingly well!

https://huggingface.co/spaces/nateraw/singing-songstarter Sing an idea ➡️ Music - a Hugging Face Space by nateraw
baseball-stadium-foods
Autogenerated by HuggingPics🤗🖼

Create your own image classifier for anything by running the demo.

Report any issues with the demo at the github repo.

Example Images
cotton candy
https://huggingface.co/nateraw/baseball-stadium-foods nateraw/baseball-stadium-foods · Hugging Face
Google didn't publish vit-tiny and vit-small model checkpoints in Hugging Face. I converted the weights from the timm repository. This model is used in the same way as ViT-base.

Note that [safetensors] model requires torch 2.0 environment.
https://huggingface.co/WinKawaks/vit-small-patch16-224 WinKawaks/vit-small-patch16-224 · Hugging Face
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
MobileCLIP was introduced in MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training (CVPR 2024), by Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.

This repository contains the MobileCLIP-B (LT) checkpoint for timm.

MobileCLIP Performance Figure

Highlights
Our smallest variant MobileCLIP-S0 obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller.
MobileCLIP-S2 obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples.
MobileCLIP-B(LT) attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336.
https://huggingface.co/apple/mobileclip_b_lt_timm apple/mobileclip_b_lt_timm · Hugging Face
BEiT (base-sized model, fine-tuned on ImageNet-22k)
BEiT model pre-trained in a self-supervised fashion on ImageNet-22k - also called ImageNet-21k (14 million images, 21,841 classes) at resolution 224x224, and fine-tuned on the same dataset at resolution 224x224. It was introduced in the paper BEIT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong and Furu Wei and first released in this repository.

Disclaimer: The team releasing BEiT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description
The BEiT model is a Vision Transformer (ViT), which is a transformer encoder model (BERT-like). In contrast to the original ViT model, BEiT is pretrained on a large collection of images in a self-supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. The pre-training objective for the model is to predict visual tokens from the encoder of OpenAI's DALL-E's VQ-VAE, based on masked patches. Next, the model was fine-tuned in a supervised fashion on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. Contrary to the original ViT models, BEiT models do use relative position embeddings (similar to T5) instead of absolute position embeddings, and perform classification of images by mean-pooling the final hidden states of the patches, instead of placing a linear layer on top of the final hidden state of the [CLS] token.

By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. Alternatively, one can mean-pool the final hidden states of the patch embeddings, and place a linear layer on top of that.

Intended uses & limitations
You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k microsoft/beit-base-patch16-224-pt22k-ft22k · Hugging Face
📣 Introducing Dataset Viber: your chill repo for data collection, annotation and vibe checks! 🎉

I've cooked up Dataset Viber, a set of cool tools designed to make data preparation for AI models easier, more approachable and enjoyable for standalone AI engineers and enthusiasts.

🔧 What Dataset Viber offers:
- CollectorInterface: Lazily collect model interaction data without human annotation
- AnnotatorInterface: Annotate your data with models in the loop
- BulkInterface: Explore data distribution and annotate in bulk
- Embedder: Efficiently embed data with ONNX-optimized speeds

🎯 Key features:
- Supports various tasks for text, chat, and image modalities
- Runs in .ipynb notebooks
- Logs data to local CSV or directly to Hugging Face Hub
- Easy to install via pip: pip install dataset-viber

It's not designed for team collaboration or production use, but rather as a fun and efficient toolkit for individual projects.

Want to give it a try? Check out the repository link https://github.com/davidberenstein1957/dataset-viber/.

I'm excited to hear your feedback and learn how you vibe with your data. Feel free to open an issue or reach out if you have any questions or suggestions!

Some shoutouts:
- Gradio for the amazing backbone
- Daniel van Strien for some initial presentations I did on vibe checks
- Emily Omier for the workshop on structuring GitHub repo READMEs
- Hamel Husain for keeping mentioning that people should look at their data.
- Philipp Schmid for his code for ONNX feature-extractors
- Ben Burtenshaw for the first PR GitHub - davidberenstein1957/dataset-viber: Dataset Viber is your chill repo for data collection, annotation and vibe checks.
Blane187/animalese-py


or you can make your voice to animalese with it:
Blane187/animalese_RVC


i'm just bored, so i make the project, lol
Everchanging Quest is out !

It is an LLM controlled Rogue-Like in which the LLM gets a markdown representation of the map, and should generate a JSON with the objective to fulfill on the map as well as the necessary objects and their placements.

Come test it on the space :
Jofthomas/Everchanging-Quest
Some personal and professional news

I'm writing a book on ML metrics.

Together with Wojtek Kuberski, we’re creating the missing piece of every ML university program and online course: a book solely dedicated to Machine Learning metrics!

The book will cover the following types of metrics:
• Regression
• Classification
• Clustering
• Ranking
• Vision
• Text
• GenAI
• Bias and Fairness

👉 check out the book: https://www.nannyml.com/metrics
𝗭𝗲𝗿𝗼-𝗺𝗮𝘁𝗵 𝗶𝗻𝘁𝗿𝗼 𝘁𝗼 𝗔𝗜 𝗵𝗶𝘀𝘁𝗼𝗿𝘆: 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝟭𝟵𝟱𝟬𝘀 𝘁𝗼 𝘁𝗼𝗱𝗮𝘆'𝘀 𝗟𝗟𝗠𝘀 📖

I wanted to structure my thinking about LLMs by going through their history since the 50s. This history is captivating, with the opposition between Connexionists (Rosenblatt, LeCun) and Symbolists, the first victories of "deep" neural networks, the revolution of Attention...

So I might have gone a bit too far! 😅

📝 I've made a long post summarizing the main stages of building LLMs: neural networks, optimization, backpropagation, attention layers...

And I've made sure to keep it 100% horrible-latex-math-free: the technical stuff is conveyed in graphs only, so it should be accessible to really anyone, even your grandfather (I'm sending it to mine right now).

Read it here in english 👉 https://aymeric-roucher.github.io/brief-history-of-ai/
Pour le post en français 👉 https://aymeric-roucher.github.io/breve-histoire-de-l-ia/
Fal/AuraFlow-v0.3
is now here with support for different aspect resolutions (w/h up to 1536px!) and much nicer aesthetics! Make sure to install the latest diffusers to get support for it.
As some of you know, I try to convert models to either fp32 or bf16 depending on theirs size before doing imatrix and quantization

Today I decided to see if that matters, and the results have me.. for lack of a better word, perplexed

My setup:

Mistral Nemo Instruct 2407
- convert to FP32, calculate imatrix, quantize to Q8_0 and Q4_K_M
- convert to FP16, calculate imatrix, quantize to Q8_0 and Q4_K_M

I calculated the kld base from the FP32 model:
./llama-perplexity -m /models/Mistral-Nemo-Instruct-2407-f32.gguf -f /training_data/wikitext-2-raw/wiki.test.raw --kl-divergence-base /training_data/mistral-nemo-f32.kld -ngl 35 -fa -sm row

then calculated the divergence itself for each like so:
./llama-perplexity -m /models/Mistral-Nemo-Instruct-2407-Q8_0.gguf -f /training_data/wikitext-2-raw/wiki.test.raw --kl-divergence-base /training_data/mistral-nemo-f32.kld --kl-divergence -ngl 50 -fa -sm row

Q4_K_M from fp16 and fp32 were similar, trading blows across statistics, odd since i expected fp32 to be strictly better but it's not

Q8_0 is where things get weird. Despite each file being slightly different size, and the sha256sum of course being different, they each get *completely identical* scores, down to 6 decimal places of precision on the statistics.

How is this possible? Is there something I don't understand about llama.cpp that makes it always convert to fp16 before it does quantization? Am I wasting time using FP32/BF16??
https://huggingface.co/posts/bartowski/608656345183499 @bartowski on Hugging Face:
Improved ControlNet!
Now supports dynamic resolution for perfect landscape and portrait outputs. Generate stunning images without distortion—optimized for any aspect ratio!
...
https://huggingface.co/spaces/DamarJati/FLUX.1-DEV-Canny FLUX.1-DEV Canny - a Hugging Face Space by DamarJati