HF-hub

Share and discover more about AI with social posts from the community.huggingface/OpenAi

23:34 · Sep 15, 2024 · Sun

Have you tried the new SQL Console yet?

Would love to know any queries you've tried or general feedback! If you haven't go try it out and let us know 🤗

If you have some interesting queries feel free to share the URLs as well!

23:00 · Sep 12, 2024 · Thu

𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝘆𝗼𝘂𝗿 𝗛𝗧𝗠𝗟 𝘄𝗲𝗯𝗽𝗮𝗴𝗲𝘀 𝘁𝗼 𝗺𝗮𝗿𝗸𝗱𝗼𝘄𝗻 𝗶𝘀 𝗻𝗼𝘄 𝗽𝗼𝘀𝘀𝗶𝗯𝗹𝗲 𝗲𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝘄𝗶𝘁𝗵 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝗟𝗟𝗠! 👏

Jina just released Reader-LM, that handles the whole pipeline of extracting markdown from HTML webpages.

A while ago, Jina had released a completely code-based deterministic program to do this extraction, based on some heuristics : e.g., “if the text is in a <p> tag, keep it, but if it’s hidden behind another, remove it”.

🤔 But they received complaints from readers: some found it too detailed, other not enough, depending on the pages.

➡️ So they decided, 𝗺𝗮𝘆𝗯𝗲 𝗵𝗲𝘂𝗿𝗶𝘀𝘁𝗶𝗰𝘀 𝘄𝗲𝗿𝗲 𝗻𝗼𝘁 𝗲𝗻𝗼𝘂𝗴𝗵: 𝗶𝗻𝘀𝘁𝗲𝗮𝗱, 𝘁𝗵𝗲𝘆 𝘁𝗿𝗶𝗲𝗱 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗮 𝗟𝗟𝗠 𝘁𝗼 𝗱𝗼 𝘁𝗵𝗲 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗲𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻. This LLM does not need to be very strong,but it should handle a very long context: it’s a challenging, “shallow-but-wide” architecture.

𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
2️⃣ models: Reader-LM-0.5B and 1.5B
⚙️ Two stages of training: first, short and simple HTML to get the basics, then ramp up to longer and harder HTML up to 128k tokens
🔎 Use contrastive search for decoding: this empirically reduces “repeating output” issues
➡️ Their models beat much larger models at HTML extraction 🔥
🤗 Weights available on HF (sadly cc-by-nc license):
jinaai/reader-lm-1.5b

23:00 · Sep 12, 2024 · Thu

Hugging face presents FineVideo 🎥! Unlocking the next generation of Video understanding 🚀

🤯3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs.
🔥
@mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.

Very psyched to fine-tune idefics on this dataset. ⚡️
Explore the videos:
HuggingFaceFV/FineVideo-Explorer

23:00 · Sep 12, 2024 · Thu

𝐎𝐩𝐞𝐧𝐀𝐈 𝐟𝐢𝐧𝐚𝐥𝐥𝐲 𝐫𝐞𝐯𝐞𝐚𝐥𝐬 “🍓”: 𝐜𝐫𝐚𝐳𝐲 𝐜𝐡𝐚𝐢𝐧-𝐨𝐟-𝐭𝐡𝐨𝐮𝐠𝐡𝐭-𝐭𝐮𝐧𝐞𝐝 𝐦𝐨𝐝𝐞𝐥 >> 𝐆𝐏𝐓-𝟒𝐨 💥

OpenAI had hinted at a mysterious “project strawberry” for a long time: 𝘁𝗵𝗲𝘆 𝗽𝘂𝗯𝗹𝗶𝘀𝗵𝗲𝗱 𝘁𝗵𝗶𝘀 𝗻𝗲𝘄 𝗺𝗼𝗱𝗲𝗹 𝗰𝗮𝗹𝗹𝗲𝗱 “𝗼𝟭” 𝟭𝗵𝗼𝘂𝗿 𝗮𝗴𝗼, 𝗮𝗻𝗱 𝘁𝗵𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗺𝗶𝗻𝗱-𝗯𝗹𝗼𝘄𝗶𝗻𝗴.

🤯 Ranks among the top 500 students in the US in a qualifier for the USA Math Olympiad
🤯 Beats human-PhD-level accuracy by 8% on GPQA, hard science problems benchmark where the previous best was Claude 3.5 Sonnet with 59.4.
🤯 Scores 78.2% on vision benchmark MMMU, making it the first model competitive w/ human experts
🤯 GPT-4o on MATH scored 60% ⇒ o1 scores 95%

How did they pull this? Sadly OpenAI keeps increasing their performance in “making cryptic AF reports to not reveal any real info”, so here are excerpts:

💬 “𝗼𝟭 𝘂𝘀𝗲𝘀 𝗮 𝗰𝗵𝗮𝗶𝗻 𝗼𝗳 𝘁𝗵𝗼𝘂𝗴𝗵𝘁 𝘄𝗵𝗲𝗻 𝗮𝘁𝘁𝗲𝗺𝗽𝘁𝗶𝗻𝗴 𝘁𝗼 𝘀𝗼𝗹𝘃𝗲 𝗮 𝗽𝗿𝗼𝗯𝗹𝗲𝗺. 𝗧𝗵𝗿𝗼𝘂𝗴𝗵 𝗿𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴, 𝗼𝟭 𝗹𝗲𝗮𝗿𝗻𝘀 𝘁𝗼 𝗵𝗼𝗻𝗲 𝗶𝘁𝘀 𝗰𝗵𝗮𝗶𝗻 𝗼𝗳 𝘁𝗵𝗼𝘂𝗴𝗵𝘁 𝗮𝗻𝗱 𝗿𝗲𝗳𝗶𝗻𝗲 𝘁𝗵𝗲 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 𝗶𝘁 𝘂𝘀𝗲𝘀. It learns to recognize and correct its mistakes.”

And of course, they decide to hide the content of this precious Chain-of-
Thought. Would it be for maximum profit? Of course not, you awful capitalist, it’s to protect users:

💬 “We also do not want to make an unaligned chain of thought directly visible to users.”

They’re right, it would certainly have hurt my feelings to see the internal of this model tearing apart math problems.

🤔 I suspect it could be not only CoT, but also some agentic behaviour where the model can just call a code executor. The kind of score improvement the show certainly looks like the ones you see with agents.

This model will be immediately released for ChatGPT and some “trusted API users”.

Let’s start cooking to release the same thing in 6 months! 🚀

22:59 · Sep 12, 2024 · Thu

I believe Hugging Face should have something similar to Hacktoberfest. I miss the days when there were events like this every 3 months for audio, deep reinforcement learning, gradio themes, but it turns out everything slowed down. There are no more Hugging Face events.

22:59 · Sep 12, 2024 · Thu

📢 The Three-hop (💡aspect + 🤔opinion + 🧠reason) Chain-of-Thought concept + LLM represent a decent concept for reasoning emotions of participants in textual dialogues.
Delighted to share the tutorial video which make you aware of:
✅ The proper application of LLM towards implicit IR
✅ Ways for aligning different information types (causes and states) within the same LLM
✅ Launch your LLM in GoogleColab that is capable for characters Emotion Extraction in dialogues 🧪

🎥: https://www.youtube.com/watch?v=vRVDQa7vfkU

Project: https://github.com/nicolay-r/THOR-ECAC
Paper: https://aclanthology.org/2024.semeval-1.4/
Model card:
nicolay-r/flan-t5-emotion-cause-thor-base

YouTube

How to Extract Emotions using Large Language Models?

This video shares the reasoning concepts adaption forretrieving emotions of dialogue characters via Large Language Models (LLM). We fine-tune Flan-T5-base (250M) with the reforged THoR-ISA framework for Emotion Cause Extraction. Our model took 3'rd place…

21:03 · Sep 11, 2024 · Wed

The Romulus model series has been released on Hugging Face, continually pre-trained on 34,864,949 tokens of French laws and intended to serve as a foundation for fine-tuning on labeled data 🤗

The training code, dataset and model weights are open and available free on HF and the training was based on H100 provided by Microsoft for Startups using Unsloth AI by @danielhanchen and @shimmyshimmer 🦥

Link to the base model:
louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1

Link to the instruct model:
louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1-Instruct

Link to the dataset:
louisbrulenaudet/Romulus-cpt-fr

Please note that these models have not been aligned for the production of usable texts as they stand, and will certainly need to be refined for the desired tasks in order to produce satisfactory results.https://cdn-uploads.huggingface.co/production/uploads/6459fa0f5b3111fbe83286e1/n_KKbhGEDZg-2NMBu3OGo.jpeg

21:02 · Sep 11, 2024 · Wed

> 𝗪𝗮𝗻𝘁 𝘁𝗼 𝗸𝗻𝗼𝘄 𝗵𝗼𝘄 𝗺𝘂𝗰𝗵 𝗮𝗻 𝗔𝗣𝗜 𝗟𝗟𝗠 𝗰𝗮𝗹𝗹 𝗰𝗼𝘀𝘁𝘀 𝘆𝗼𝘂?

I've just made this Space that gets you the API price for any LLM call, for nearly all inference providers out there!

This is based on a comment by @victor under my HF Post a few months back, and leverages BerriAI's data for LLM prices.

Check it out here 👉
m-ric/text_to_dollars

21:02 · Sep 11, 2024 · Wed

Auto-regressive LMs have ruled, but encoder-based architectures like GLiNER are proving to be just as powerful for information extraction while offering better efficiency and interpretability. 🔍✨

Past encoder backbones were limited by small pre-training datasets and old techniques, but with innovations like LLM2Vec, we've transformed decoders into high-performing encoders! 🔄💡

What’s New?
🔹Converted Llama & Qwen decoders to advanced encoders
🔹Improved GLiNER architecture to be able to work with rotary positional encoding
🔹New GLiNER (zero-shot NER) & GLiClass (zero-shot classification) models

🔥 Check it out:

New models:
knowledgator/llm2encoder-66d1c76e3c8270397efc5b5e

GLiNER package: https://github.com/urchade/GLiNER

GLiClass package: https://github.com/Knowledgator/GLiClass

💻 Read our blog for more insights, and stay tuned for what’s next!
https://medium.com/@knowledgrator/llm2encoders-e7d90b9f5966

GitHub

GitHub - urchade/GLiNER: Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @…

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024 - urchade/GLiNER

21:02 · Sep 11, 2024 · Wed

Free research tip:
Get used to writing the first draft of your paper in markdown using vscode’s jupyter notebook extension - it lets you do quick sanity checks with code and maths - an absolute AAA experience:)

21:02 · Sep 11, 2024 · Wed

made an image similarity demo to test out the
mistral-community/pixtral-12b-240910
model .

If anyone knows how to generate captions with it , please do let me know x 🚀

here's the demo :
Tonic/Pixtral

hope you like it 🤗

21:01 · Sep 11, 2024 · Wed

What if we asked the AI what it thought of our hugging face profile? 👹
I've released a new space capable of doing it.... watch out, it hits hard! 🥊

Try it now ➡️
enzostvs/hugger-roaster

Share your roast below 👇

21:01 · Sep 11, 2024 · Wed

If you are interested in deep reinforcement learning, find below my ICML paper on how we can detect adversaries in deep reinforcement learning:

Paper: Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisions
Link: https://proceedings.mlr.press/v202/korkmaz23a.html

PMLR

Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisions

Learning in MDPs with highly complex state representations is currently possible due to multiple advancements in reinforcement learning algorithm design. How...

21:01 · Sep 11, 2024 · Wed

𝗔𝗿𝗰𝗲𝗲 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝘀 𝗦𝘂𝗽𝗲𝗿𝗡𝗼𝘃𝗮, 𝗯𝗲𝘁𝘁𝗲𝗿 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗲 𝗼𝗳 𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭-𝟳𝟬𝗕!

2️⃣ versions: 70B and 8B
🧠 Trained by distilling logits from Llama-3.1-405B
🐥 Used a clever compression method to reduce dataset weight from 2.9 Petabytes down to 50GB (may share it in a paper)
⚙️ Not all benchmarks are improved: GPQA and MUSR go down a slight bit
🤗 8B weights are available on HF (not the 70B)

Read their blog post 👉 https://blog.arcee.ai/arcee-supernova-training-pipeline-and-model-composition/
Model weights (8B) 👉
arcee-ai/Llama-3.1-SuperNova-Lite

Arcee AI Blog

Arcee-SuperNova: Training Pipeline and Model Composition

We trained Arcee SuperNova-70B and Arcee SuperNova-8B to be a generally intelligent Llama-3.1-405B derivatives using intelligent distillation, novel post-training, and model merging techniques.

21:01 · Sep 11, 2024 · Wed

🚀 Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Here's the details:

⛏ Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
📉 New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
💾 Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
🧩 Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
✨ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
🐛 Bug fixes: Too many to name here, check out the release notes!
📝 Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.

Check out the full release notes here ⭐️: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0

I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.

GitHub

Release v3.1.0 - Hard Negatives Mining utility; new loss function for symmetric tasks; streaming datasets; custom modules · UKPLab/sentence…

This release introduces a hard negatives mining utility to get better models out of your data, a new strong loss function for symmetric tasks, training with streaming datasets to avoid having to st...

21:01 · Sep 11, 2024 · Wed

Please check the Open Source AI Network: we mapped the top 500 HF users
based on their followers' profiles.

The map can be found here:
bunkalab/mapping_the_OS_community

21:00 · Sep 11, 2024 · Wed

Finally tried Kotaemon, an open-source RAG tool for document chat!

With local models, it's free and private. Perfect for journalists and researchers.

I put Kotaemon to the test with EPA's Greenhouse Gas Inventory. Accurately answered questions on CO2 percentage in 2022 emissions and compared 2022 vs 2021 data

🛠 Kotaemon's no-code interface makes it user-friendly.
- Use your own models or APIs from OpenAI or Cohere
- Great documentation & easy installation
- Multimodal capabilities + reranking
- View sources, navigate docs & create graphRAG

🌟 Kotaemon is gaining traction with 11.3k GitHub stars

Try the online demo:
cin-model/kotaemon-demo

GitHub: https://github.com/Cinnamon/kotaemon
Docs: https://cinnamon.github.io/kotaemon/usage/

GitHub

GitHub - Cinnamon/kotaemon: An open-source RAG-based tool for chatting with your documents.

An open-source RAG-based tool for chatting with your documents. - Cinnamon/kotaemon

21:40 · Sep 10, 2024 · Tue

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.

Whisper large-v3 has the same architecture as the previous large and large-v2 models, except for the following minor differences:

The spectrogram input uses 128 Mel frequency bins instead of 80
A new language token for Cantonese
The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2 . The model was trained for 2.0 epochs over this mixture dataset.

The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper large-v2 . For more details on the different checkpoints available, refer to the section Model details.

Disclaimer: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and pasted from the original model card.

21:40 · Sep 10, 2024 · Tue

MiniCPM3-4B is the 3rd generation of MiniCPM series. The overall performance of MiniCPM3-4B surpasses Phi-3.5-mini-Instruct and GPT-3.5-Turbo-0125, being comparable with many recent 7B~9B models.

Compared to MiniCPM1.0/MiniCPM2.0, MiniCPM3-4B has a more powerful and versatile skill set to enable more general usage. MiniCPM3-4B supports function call, along with code interpreter. Please refer to Advanced Features for usage guidelines.

MiniCPM3-4B has a 32k context window. Equipped with LLMxMapReduce, MiniCPM3-4B can handle infinite context theoretically, without requiring huge amount of memory.

Before

After