Update daily papers 2025-01-23

gabrielchua · Jan 23, 2025 · fc3c2b8 · fc3c2b8
1 parent de4efd6
commit fc3c2b8
Show file tree

Hide file tree

Showing 2 changed files with 31 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 <table style="border: none; border-collapse: collapse;">
 <tr style="border: none;">
 <td style="border: none;">
-<img src="https://img.shields.io/badge/Last%20Updated-2025--01--22 
+<img src="https://img.shields.io/badge/Last%20Updated-2025--01--23 
 
 -brightgreen" alt="Last Updated"> <a href="https://t.me/daily_ai_papers"><img src="https://img.shields.io/badge/Telegram-Join%20Channel-blue?style=flat-square&logo=telegram" alt="Telegram"></a> <a href="https://gabrielchua.me/daily-ai-papers/"><img src="https://img.shields.io/badge/Website-Visit%20Daily%20AI%20Papers-blue?style=flat-square&logo=github" alt="Website"></a> <br><br>
 Summaries auto-generated from <a href="https://huggingface.co/papers">HuggingFace's Daily Papers</a> using Gemini and GitHub Actions. All credits go to the research and HuggingFace communities.<br><br>
@@ -17,6 +17,21 @@ Note: Authors may be listed by their HuggingFace IDs. Additionally, summaries ar
 </table>
 
 
+## Papers for 2025-01-23
+
+| Title | Authors | Summary |
+|-------|---------|---------|
+| DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Read more on [arXiv](https://arxiv.org/abs/2501.12948) or [HuggingFace](https://huggingface.co/papers/2501.12948))| AS-7, haha-point, freesky, DejianYang, guoday | DeepSeek-R1 is a series of reasoning models developed using reinforcement learning.  **Main research question or objective:** How to enhance the reasoning capabilities of large language models (LLMs) using reinforcement learning (RL) without supervised fine-tuning (SFT).  **Key methodology used:** A multi-stage training pipeline involving initial fine-tuning on a small amount of cold-start data, followed by reasoning-oriented RL, rejection sampling with supervised fine-tuning, and finally, reinforcement learning for all scenarios, alongside distillation to smaller models.  **Primary results:** DeepSeek-R1 achieved 79.8% Pass@1 on AIME 2024, surpassing OpenAI-o1-1217, and attained an impressive score of 97.3% on MATH-500.  **Principal implication for AI practitioners:** The findings suggest that the distillation of reasoning patterns from larger models into smaller models is highly effective, offering a practical approach for enhancing reasoning abilities in resource-constrained applications.  |
+| FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces (Read more on [arXiv](https://arxiv.org/abs/2501.12909) or [HuggingFace](https://huggingface.co/papers/2501.12909))| Senbao Shi, Li-Zhouyi, PigCatchingExpert, longyuewang, imryanxu | FILMAGENT is an LLM-based multi-agent framework for automated film production in 3D virtual spaces. The main research objective is to automate virtual film production using a collaborative multi-agent approach. The key methodology involves simulating film crew roles (director, screenwriter, actors, cinematographer) with LLM-based agents, using a three-stage workflow (idea development, scriptwriting, cinematography) with Critique-Correct-Verify and Debate-Judge collaboration algorithms. Primary results show that FILMAGENT achieved an average human evaluation score of 3.98 out of 5, outperforming single-agent baselines. The principal implication for AI practitioners is that multi-agent collaboration can significantly enhance the quality of automated film production, offering a viable approach for end-to-end film automation.  |
+| Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback (Read more on [arXiv](https://arxiv.org/abs/2501.12895) or [HuggingFace](https://huggingface.co/papers/2501.12895))| Yu Cheng, linjieli222, Xiaoye08, huxy912, yaful | Test-time preference optimization (TPO) aligns large language model (LLM) outputs with human preferences during inference without retraining.  The research objective was to determine if LLMs could be aligned with human preferences during inference using iterative textual feedback rather than purely numerical rewards.  TPO iteratively refines LLM outputs based on textual critiques derived from a reward model's numerical scores.  Evaluation across multiple benchmarks showed TPO progressively improved alignment; for example, the unaligned Llama-3.1-70B-SFT model surpassed its aligned counterpart, Llama-3.1-70B-Instruct, on several metrics after only a few iterations. This work demonstrates a practical, lightweight method for test-time preference optimization, enabling rapid adaptation of LLMs to evolving preferences without retraining, directly impacting AI practitioners by offering a computationally efficient alignment technique.  |
+| VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding (Read more on [arXiv](https://arxiv.org/abs/2501.13106) or [HuggingFace](https://huggingface.co/papers/2501.13106))| Sicong, Guanzheng, Zhiqiang007, ClownRat, CausalLi | VideoLLaMA3 is an advanced multimodal foundation model designed for image and video understanding, emphasizing a vision-centric approach. The main research objective is to develop a more capable model for both image and video understanding by leveraging high-quality image-text data. The key methodology involves a four-stage training paradigm: vision-centric alignment, vision-language pretraining, multi-task fine-tuning, and video-centric fine-tuning, coupled with a vision encoder adapted for dynamic resolution inputs and video token compression. Primary results show that VideoLLaMA3 achieves state-of-the-art performance on several benchmarks, including a 67.1% accuracy on the MathVista testmini dataset. The principal implication for AI practitioners is that focusing on high-quality image-text data and vision-centric training can significantly enhance both image and video understanding capabilities in multimodal models, as demonstrated by VideoLLaMA3's performance improvements.  |
+| Kimi k1.5: Scaling Reinforcement Learning with LLMs (Read more on [arXiv](https://arxiv.org/abs/2501.12599) or [HuggingFace](https://huggingface.co/papers/2501.12599))| ChonghuaLiao, DuChenZhuang, shelowize, xingbowei, KbsdJames | Kimi k1.5 is a multi-modal large language model trained with reinforcement learning, featuring enhanced reasoning and long-context processing. The main research objective is to explore scaling reinforcement learning (RL) with large language models (LLMs) to improve performance beyond the limitations of traditional supervised fine-tuning. The key methodology involves long-context scaling up to 128k tokens, improved policy optimization via a variant of online mirror descent, a simplistic RL framework, and multi-modal training on text and vision data. A primary result is that the long-context-of-thought (long-CoT) version achieved 96.2 on the MATH 500 benchmark. The principal implication for AI practitioners is that scaling context length in RL with LLMs, combined with refined optimization techniques, can significantly improve model performance on complex reasoning tasks, offering a viable path for continued advancements in AI capabilities.  |
+| Autonomy-of-Experts Models (Read more on [arXiv](https://arxiv.org/abs/2501.13074) or [HuggingFace](https://huggingface.co/papers/2501.13074))| Yining Qian, kangzhanhui, shwu, Ruobing-Xie, AngLv | This paper introduces Autonomy-of-Experts (AoE), a novel Mixture-of-Experts (MoE) paradigm where experts autonomously select inputs based on their internal activation norms. The main research question is whether allowing experts to autonomously select inputs based on their internal activation norms can improve upon the traditional MoE model's expert selection and training effectiveness. The key methodology involves removing routers and having experts pre-compute internal activations for inputs, ranking them by their activation norms, and only forwarding the top-ranking experts for processing. Primary results show that AoE models outperform traditional MoE models in downstream tasks, with a specific finding that a 4B parameter AoE model achieved an average accuracy of 49.80 across various tasks, compared to 48.06 for a comparable traditional MoE model. For AI practitioners, the principal implication is that AoE offers a more efficient and effective approach to training MoE models by eliminating the need for routers and improving expert specialization, directly enhancing downstream performance.  |
+| Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament (Read more on [arXiv](https://arxiv.org/abs/2501.13007) or [HuggingFace](https://huggingface.co/papers/2501.13007))| Yixin Cao, Rui Min, Zijun Yao, Yantao Liu, juanli | Pairwise Reward Model (Pairwise RM) is introduced to improve Best-of-N (BoN) sampling for Large Language Models (LLMs) through a knockout tournament framework. The main research question is how to effectively select the best candidate solution from multiple LLM-generated outputs without relying on arbitrary and inconsistent reward scores. The key methodology involves training a Pairwise RM to perform pairwise comparisons of candidate solutions' correctness and using a knockout tournament to iteratively eliminate incorrect solutions. Primary results show that Pairwise RM achieves a 6.7% average improvement on MATH-500 over the strongest baseline. The principal implication for AI practitioners is that Pairwise RM with knockout tournaments offers a more robust mechanism for selecting the best solution in BoN sampling, especially for challenging math problems.  |
+| O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning (Read more on [arXiv](https://arxiv.org/abs/2501.12570) or [HuggingFace](https://huggingface.co/papers/2501.12570))| Yibo Wang, Haiying He, Li Shen, cxc361461518, iNk233 | O1-Pruner is a fine-tuning method designed to reduce the inference overhead of long-thought reasoning models while maintaining accuracy. The main research question is how to minimize the reasoning overhead of long-thought Large Language Models (LLMs) without compromising their accuracy. The key methodology is Length-Harmonizing Fine-Tuning (O1-Pruner), which uses pre-sampling and RL-style fine-tuning to encourage shorter reasoning processes under accuracy constraints. The primary results show that O1-Pruner reduces solution length by 40.5% while achieving an average accuracy of 76.8% on the Marco-01-7B model. The principal implication for AI practitioners is that O1-Pruner offers an effective method to optimize long-thought reasoning models, achieving a balance between computational efficiency and high accuracy.  |
+| IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems (Read more on [arXiv](https://arxiv.org/abs/2501.11067) or [HuggingFace](https://huggingface.co/papers/2501.11067))| Ilankad23, Eladlev | IntellAgent is a multi-agent framework for evaluating conversational AI systems by generating synthetic benchmarks. The main research objective is to develop a scalable, open-source framework that addresses the limitations of manually curated benchmarks for evaluating conversational AI. The key methodology involves a multi-agent pipeline that combines policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. Primary results show a strong correlation (0.98 for Airline, 0.92 for Retail) between model performance on IntellAgent and the T-bench benchmark, despite IntellAgent using only synthetic data. The principal implication for AI practitioners is that IntellAgent provides a robust and detailed evaluation tool for conversational AI, enabling targeted optimization of models across diverse scenarios and policies.  |
+
+
 ## Papers for 2025-01-22
 
 | Title | Authors | Summary |