Share and discover more about AI with social posts from the community.huggingface/OpenAi
AI Comic Factory
Last release: AI Comic Factory 1.2
The AI Comic Factory will soon have an official website: aicomicfactory.app
For more information about my other projects please check linktr.ee/FLNGR.
Running the project at home
First, I would like to highlight that everything is open-source (see here, here, here, here).
However the project isn't a monolithic Space that can be duplicated and ran immediately: it requires various components to run for the frontend, backend, LLM, SDXL etc.
If you try to duplicate the project, open the .env you will see it requires some variables.
Last release: AI Comic Factory 1.2
The AI Comic Factory will soon have an official website: aicomicfactory.app
For more information about my other projects please check linktr.ee/FLNGR.
Running the project at home
First, I would like to highlight that everything is open-source (see here, here, here, here).
However the project isn't a monolithic Space that can be duplicated and ran immediately: it requires various components to run for the frontend, backend, LLM, SDXL etc.
If you try to duplicate the project, open the .env you will see it requires some variables.
distilabel 1.3.0 is out! This release contains many core improvements and new tasks that help us building
argilla/magpie-ultra-v0.1
!
Distributed pipeline execution with Ray, new Magpie tasks, reward models, components for dataset diversity based on sentence embeddings, Argilla 2.0 compatibility and many more features!
Check the new release in GitHub: https://github.com/argilla-io/distilabel
argilla/magpie-ultra-v0.1
!
Distributed pipeline execution with Ray, new Magpie tasks, reward models, components for dataset diversity based on sentence embeddings, Argilla 2.0 compatibility and many more features!
Check the new release in GitHub: https://github.com/argilla-io/distilabel
JoseRFJunior/TransNAR
https://github.com/JoseRFJuniorLLMs/TransNAR
https://arxiv.org/html/2406.09308v1
TransNAR hybrid architecture. Similar to Alayrac et al, we interleave existing Transformer layers with gated cross-attention layers which enable information to flow from the NAR to the Transformer. We generate queries from tokens while we obtain keys and values from nodes and edges of the graph. The node and edge embeddings are obtained by running the NAR on the graph version of the reasoning task to be solved. When experimenting with pre-trained Transformers, we initially close the cross-attention gate, in order to fully preserve the language model’s internal knowledge at the beginning of training.
https://github.com/JoseRFJuniorLLMs/TransNAR
https://arxiv.org/html/2406.09308v1
TransNAR hybrid architecture. Similar to Alayrac et al, we interleave existing Transformer layers with gated cross-attention layers which enable information to flow from the NAR to the Transformer. We generate queries from tokens while we obtain keys and values from nodes and edges of the graph. The node and edge embeddings are obtained by running the NAR on the graph version of the reasoning task to be solved. When experimenting with pre-trained Transformers, we initially close the cross-attention gate, in order to fully preserve the language model’s internal knowledge at the beginning of training.
🔥 New state of the art model for background removal is out
🤗 You can try the model at
ZhengPeng7/BiRefNet
📈 model shows impressive results outperforming
briaai/RMBG-1.4
🚀 you can try out the model in:
ZhengPeng7/BiRefNet_demo
📃paper:
Bilateral Reference for High-Resolution Dichotomous Image Segmentation (2401.03407)https://cdn-uploads.huggingface.co/production/uploads/6527e89a8808d80ccff88b7a/lMX02zCeSDvLulbFFuT7N.png
🤗 You can try the model at
ZhengPeng7/BiRefNet
📈 model shows impressive results outperforming
briaai/RMBG-1.4
🚀 you can try out the model in:
ZhengPeng7/BiRefNet_demo
📃paper:
Bilateral Reference for High-Resolution Dichotomous Image Segmentation (2401.03407)https://cdn-uploads.huggingface.co/production/uploads/6527e89a8808d80ccff88b7a/lMX02zCeSDvLulbFFuT7N.png
Live Portrait Updated to V5
Animals Live animation added
All of the main repo changes and improvements added to our modified and improve app
Link : https://patreon.com/posts/107609670
Animals Live animation added
All of the main repo changes and improvements added to our modified and improve app
Link : https://patreon.com/posts/107609670
New smol-vision tutorial dropped: QLoRA fine-tuning IDEFICS3-Llama 8B on VQAv2 🐶
Learn how to efficiently fine-tune the latest IDEFICS3-Llama on visual question answering in this notebook 📖
Fine-tuning notebook: https://github.com/merveenoyan/smol-vision/blob/main/Idefics_FT.ipynb
Resulting model:
merve/idefics3llama-vqav2
Learn how to efficiently fine-tune the latest IDEFICS3-Llama on visual question answering in this notebook 📖
Fine-tuning notebook: https://github.com/merveenoyan/smol-vision/blob/main/Idefics_FT.ipynb
Resulting model:
merve/idefics3llama-vqav2
MobileViT (small-sized model)
MobileViT model pre-trained on ImageNet-1k at resolution 256x256. It was introduced in MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer by Sachin Mehta and Mohammad Rastegari, and first released in this repository. The license used is Apple sample code license.
Disclaimer: The team releasing MobileViT did not write a model card for this model so this model card has been written by the Hugging Face team.
Model description
MobileViT is a light-weight, low latency convolutional neural network that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers. As with ViT (Vision Transformer), the image data is converted into flattened patches before it is processed by the transformer layers. Afterwards, the patches are "unflattened" back into feature maps. This allows the MobileViT-block to be placed anywhere inside a CNN. MobileViT does not require any positional embeddings.
Intended uses & limitations
You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.
https://huggingface.co/apple/mobilevit-small
MobileViT model pre-trained on ImageNet-1k at resolution 256x256. It was introduced in MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer by Sachin Mehta and Mohammad Rastegari, and first released in this repository. The license used is Apple sample code license.
Disclaimer: The team releasing MobileViT did not write a model card for this model so this model card has been written by the Hugging Face team.
Model description
MobileViT is a light-weight, low latency convolutional neural network that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers. As with ViT (Vision Transformer), the image data is converted into flattened patches before it is processed by the transformer layers. Afterwards, the patches are "unflattened" back into feature maps. This allows the MobileViT-block to be placed anywhere inside a CNN. MobileViT does not require any positional embeddings.
Intended uses & limitations
You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.
https://huggingface.co/apple/mobilevit-small
Vision Transformer (base-sized model)
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.
https://huggingface.co/google/vit-base-patch16-224
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.
https://huggingface.co/google/vit-base-patch16-224