HF-hub - Share and discover more about AI with social posts from the community.huggingface/OpenAi
Share and discover more about AI with social posts from the community.huggingface/OpenAi
If you are interested in deep reinforcement learning, find below my ICML paper on how we can detect adversaries in deep reinforcement learning:

Paper: Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisions
Link: https://proceedings.mlr.press/v202/korkmaz23a.html
𝗔𝗿𝗰𝗲𝗲 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝘀 𝗦𝘂𝗽𝗲𝗿𝗡𝗼𝘃𝗮, 𝗯𝗲𝘁𝘁𝗲𝗿 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗲 𝗼𝗳 𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭-𝟳𝟬𝗕!

2️⃣ versions: 70B and 8B
🧠 Trained by distilling logits from Llama-3.1-405B
🐥 Used a clever compression method to reduce dataset weight from 2.9 Petabytes down to 50GB (may share it in a paper)
⚙️ Not all benchmarks are improved: GPQA and MUSR go down a slight bit
🤗 8B weights are available on HF (not the 70B)

Read their blog post 👉 https://blog.arcee.ai/arcee-supernova-training-pipeline-and-model-composition/
Model weights (8B) 👉
arcee-ai/Llama-3.1-SuperNova-Lite Arcee-SuperNova: Training Pipeline and Model Composition
🚀 Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Here's the details:

Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
📉 New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
💾 Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
🧩 Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
🐛 Bug fixes: Too many to name here, check out the release notes!
📝 Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.

Check out the full release notes here ⭐️: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0

I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas. Release v3.1.0 - Hard Negatives Mining utility; new loss function for symmetric tasks; streaming datasets; custom modules · UKPLab/sentence…
Please check the Open Source AI Network: we mapped the top 500 HF users
based on their followers' profiles.

The map can be found here:
bunkalab/mapping_the_OS_community
Finally tried Kotaemon, an open-source RAG tool for document chat!

With local models, it's free and private. Perfect for journalists and researchers.

I put Kotaemon to the test with EPA's Greenhouse Gas Inventory. Accurately answered questions on CO2 percentage in 2022 emissions and compared 2022 vs 2021 data

🛠 Kotaemon's no-code interface makes it user-friendly.
- Use your own models or APIs from OpenAI or Cohere
- Great documentation & easy installation
- Multimodal capabilities + reranking
- View sources, navigate docs & create graphRAG

🌟 Kotaemon is gaining traction with 11.3k GitHub stars

Try the online demo:
cin-model/kotaemon-demo

GitHub: https://github.com/Cinnamon/kotaemon
Docs: https://cinnamon.github.io/kotaemon/usage/
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.

Whisper large-v3 has the same architecture as the previous large and large-v2 models, except for the following minor differences:

The spectrogram input uses 128 Mel frequency bins instead of 80
A new language token for Cantonese
The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2 . The model was trained for 2.0 epochs over this mixture dataset.

The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper large-v2 . For more details on the different checkpoints available, refer to the section Model details.

Disclaimer: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and pasted from the original model card.
MiniCPM3-4B is the 3rd generation of MiniCPM series. The overall performance of MiniCPM3-4B surpasses Phi-3.5-mini-Instruct and GPT-3.5-Turbo-0125, being comparable with many recent 7B~9B models.

Compared to MiniCPM1.0/MiniCPM2.0, MiniCPM3-4B has a more powerful and versatile skill set to enable more general usage. MiniCPM3-4B supports function call, along with code interpreter. Please refer to Advanced Features for usage guidelines.

MiniCPM3-4B has a 32k context window. Equipped with LLMxMapReduce, MiniCPM3-4B can handle infinite context theoretically, without requiring huge amount of memory.
FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post.

Key Features
Cutting-edge output quality, second only to our state-of-the-art model FLUX.1 [pro].
Competitive prompt following, matching the performance of closed source alternatives .
Trained using guidance distillation, making FLUX.1 [dev] more efficient.
Open weights to drive new scientific research, and empower artists to develop innovative workflows.
Generated outputs can be used for personal, scientific, and commercial purposes as described in the FLUX.1 [dev] Non-Commercial License.
When the three AI Godfathers join hands to write a paper you know it’s nothing short of classic genius! This was an excellent read, I hope they write one on Generative AI.

Read: https://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf
🎓 Introducing the конспекты-уроков.рф Lesson Plans Dataset -
nyuuzyou/classnotes


Dataset highlights:
- Metadata for 65,068 lesson plans from конспекты-уроков.рф
- 58,433 lesson plans available in original format
- Multilingual content: Primarily Russian, with some Kazakh, Ukrainian, Belarusian, and English
- Each entry includes: URL, title, description, author, publication date, file size, and download link
- Data reflects educational materials accessible through the конспекты-уроков.рф platform
- Licensed under Creative Commons (https://creativecommons.org/licenses/by-nc/3.0/deed.en)

This dataset offers a unique window into online educational resources, particularly in Russian-language contexts. It provides opportunities for analyzing lesson plan trends, topic distributions, and language patterns in educational materials. The dataset is particularly well-suited for tasks such as text classification and text retrieval in multilingual educational settings.
> Article read: Simple guide to LLM inference and to TGI

I've just read article "LLM inference at scale with TGI" by @martinigoyanes . It's really good content, a must-read if you want a good low-level intro to LLM inference with TGI!

My takeaways:

How does inference work?
🧠 Prefill: the input prompt is tokenized on CPU, then transferred to GPU. Then one single forward pass generates the initial token.
🔄 Decode: the model generates ("decodes") tokens one by one, each time appending the new token to the current input of size N to then generate a new token again with this augmented input of length N+1. This loop ends either when a specific token called "End-of-sequence" is generated or when the completion reaches a pre-specified maximum length. Then the sequence is de-tokenized on CPU to yield text again.
This step's speed determines the Time Per Output Token, which directly translates to the key metric: Throughput

🤔 How was the separation between the two steps decided ? Like, why does prefill include this strange generation of only one token at then end?
➡️ The cost of attention scales quadratically with the number of tokens, so it can really explode quickly.
To compensate for that, a really important technique called KV caching was devised: using the fact that when generating token N+1, the Key and Value (K and V) matrices generated inside the Transformers are a simple extension from the K and V from the previous step, the model caches the K and V matrices between steps : thus the separation - the prefill part is the part that prepares this KV cache, while the decoding is the one that leverages it and expands it by one at each step.

TGI-specific takeaways:
⚙️ TGI has many SOTA techniques for decoding: Paged Attention, KV Caching and Flash Attention…
🔀 TGI's router handles generations finishing early because of an EOS token: instead of static batching, it continuously batches requests to the inference engine & filters away finished requests. https://cdn-uploads.huggingface.co/production/uploads/63d10d4e8eaa4831005e92b5/8_CFLfbkMRDWj8QkgTcRh.png
Help me to upgrade my model.

Hi all, so I am a complete beginner in coding, however, with the help of Claude (similar to Matt :P) and GPT 4o have been able to develop this RAG PDF summarizer/Q&A plus a web search tool.

The application is specifically built for summarization task including summarizing a financial document, news article, resume, research document, call transcript, etc.

The space could be found here:
Shreyas094/SearchGPT


The news tool simply use duckduckgo chat to generate the search results using llama 3.1 70bn model.

I want your support to fine tune the retrieval task for handling more unstructured documents.
A lot of coverage of the Apple event! I’ve selected a few unique angles and distinctive takes.

The NYT
- "The iPhone’s limited feature set is emblematic of how Apple is taking a cautious approach to generative A.I."
- "Wall Street is enthusiastic about the artificially intelligent phones, with analysts predicting the features could help Apple sell a record 240 million iPhones next year."

The Guardian
- "Despite the bells and whistles, and being a tech-adopting lot, I bet many of you won’t be lining up to buy it."
- One reason is the simple cost of the iPhone 16, which starts at $799.
- The adoption of AI into the iPhone could be considered a step change in how the iPhone works. But there may not be a huge hankering to use ChatGPT on your phone."

The WSJ
- Apple didn’t say when the AI services would be available in China, its second-largest market after the U.S.
- The delay puts the iPhone maker at a disadvantage against rivals offering AI services
- Huawei held its own announcement in China to release the Mate XT, a three-way foldable smartphone with AI features.
- Apple said that the launch of Apple Intelligence was subject to regulatory approval. In China, any generative AI models that could influence public opinion need government approval.

CNN
- "For an event built around unveiling Apple’s first AI-powered iPhone, there was one striking absence over the two-hour presentation: the words 'artificial intelligence.'"
- "But Apple understands something that often gets lost in the bot-pilled bubble of Silicon Valley: Regular people don’t trust AI."

Links:
https://www.nytimes.com/2024/09/09/technology/apple-event-iphone-16-watch.html
https://www.theguardian.com/technology/article/2024/sep/10/techscape-iphone-16-cost-features
https://www.wsj.com/tech/apples-challenge-in-china-rises-with-new-rival-phones-and-ai-delay-8cf871fb?mod=rss_Technology
https://www.cnn.com/2024/09/10/business/apple-iphone-ai-nightcap/ Apple Unveils New iPhones With Built-In Artificial Intelligence
Almost ready: search for a Hugging Face dataset on the Hub from information in the datasets viewer preview!

Soon, you can find deep-cut datasets even if they don't have a full dataset card (you should still document your datasets!)

You can help improve this project by rating synthetic user search queries for hub datasets.

If you have a Hub login, you can start annotating in Argilla
in < 5 seconds here: https://davanstrien-my-argilla.hf.space/dataset/1100a091-7f3f-4a6e-ad51-4e859abab58f/annotation-mode

I need to do some tidying, but I'll share all the code and in-progress datasets for this soon!https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/4ZfsdtnD8ay-WnkqxjpjX.png
Reflection Llama 3.1 70B (Correct Weights) on ZeroGPU thanks to llama.cpp and unsloth (for quantization)

ZeroGPU space
-
gokaygokay/Reflection-70B-llamacpp


- Working Model
mattshumer/ref_70_e3


- Quantized Models
unsloth/Reflection-Llama-3.1-70B-GGUF
Interested in learning about everything Image?

​With the rise of recent interest in Vision Language Models (VLMs), we decided to make a push to include an ImageField within Argilla! This means any open source developer can now work on better models for vision ML tasks too and we would like to show you how.

​We would love to introduce this new feature to you, so we've prepared a set of notebooks to go over some common image scenarios.
finetune an CLIP retrieval model with sentence transformers
use ColPali+ Qwen VL for RAG and log the results to Argilla
image-generation preference: creating multi-modal preference datasets for free using Hugging Face inference endpoints.

​See you on Thursday!

https://lu.ma/x7id1jqu Everything image: from fine-tuning CLIP models to synthetic image datasets · Luma
The New York Times did a fun quiz to test your ability to detect whether a video is AI-generated or real. They put Runway, Kling, and Sora to the test.

I got 10/10 🤓 —how about you?

https://www.nytimes.com/interactive/2024/09/09/technology/ai-video-deepfake-runway-kling-quiz.html

00:12 A.I. Can Now Create Lifelike Videos. Can You Tell What’s Real?
⚖️ 𝐀𝐈 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐢𝐬 𝐂𝐨𝐩𝐲𝐫𝐢𝐠𝐡𝐭 𝐈𝐧𝐟𝐫𝐢𝐧𝐠𝐞𝐦𝐞𝐧𝐭

This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.

I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.

While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.

[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214

LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx