2025年10月07日NLP论文汇总(英文)


Topic 1: Large Language Model Optimization and Control

Topic Overview

The topic of Large Language Model (LLM) Optimization and Control is crucial for advancing the adaptability, efficiency, and interpretability of LLMs across various applications, particularly in multilingual and specialized contexts. As LLMs become increasingly prevalent, there is a growing need to understand and manipulate their underlying mechanisms to ensure they perform effectively in diverse settings without requiring extensive retraining or additional data. This research area encompasses methodologies for fine-tuning, controlling language-specific behaviors, and integrating human-like judgment processes to improve model performance and generalizability.

Individual Paper Contributions

The papers in this topic demonstrate a shift towards more targeted and efficient methods for optimizing and controlling LLMs. Innovations range from training-free language-specific dimension manipulation to token-level data importance assessment, in-context learning for modeling annotator disagreement, and causal representation learning for hate speech detection. There is also a trend towards integrating local linguistic and cultural contexts into model development, as seen in the Sunflower project, which emphasizes the importance of localized data and expertise in improving model performance for underrepresented languages.

Datasets and Evaluation

These evaluations highlight the importance of choosing appropriate metrics and datasets that reflect the complexity and nuances of the problems being addressed, ensuring that the proposed solutions are rigorously tested and validated.


Topic 2: Multimodal Reasoning and Integration

Topic Overview

Multimodal reasoning and integration is a critical area in artificial intelligence that deals with the ability of models to understand and process information from multiple sources or modalities simultaneously. This includes integrating textual, visual, and auditory inputs to achieve coherent and contextually accurate outputs. The importance of this topic stems from the fact that real-world applications often require processing data that comes in various forms, such as in educational assessments, healthcare diagnostics, and autonomous agent interactions. Enhancing the capabilities of models to handle multimodal data can significantly improve their effectiveness and reliability in complex, real-world scenarios.

Individual Paper Contributions

The papers collectively highlight a shift towards integrating complex reasoning and multimodal processing capabilities within large language models (LLMs). Innovations include:

Datasets and Evaluation Metrics

Evaluation metrics include:


Topic 3: Reasoning and Decision Making in LLMs

Topic Overview

Reasoning and decision-making in Large Language Models (LLMs) is a critical area of research that aims to enhance the models’ ability to generate coherent, comprehensive, and logically sound responses. This topic is essential for improving the reliability and trustworthiness of LLMs, particularly in safety-critical domains and complex reasoning tasks. The focus spans from evaluating the comprehensiveness of factual recall in generated texts to developing frameworks that improve reasoning capabilities through innovative methods like knowledge editing, latent reasoning, and modular architectures.

Individual Paper Contributions

The technical approaches in these papers reflect a shift towards more nuanced and targeted methods for improving reasoning and decision-making in LLMs. Innovations include:

Datasets and Evaluation

These datasets and evaluation metrics underscore the diverse nature of the problems addressed, ranging from mathematical reasoning to conversational banking tasks, and highlight the importance of context-specific evaluation in assessing the effectiveness of LLM reasoning enhancements.


Topic 4: Adaptive and Dynamic Learning Techniques

Topic Overview

Adaptive and Dynamic Learning Techniques in the realm of large language models (LLMs) aim to enhance the models’ ability to learn and improve during inference rather than relying solely on extensive offline training. These techniques are vital for increasing the efficiency and performance of LLMs in various agentic tasks such as tool usage, multi-turn conversations, and complex reasoning tasks. They address the issues of redundancy, high computational costs, and inefficiency in traditional fine-tuning paradigms, making LLMs more adaptable and capable in real-world applications.

Individual Paper Contributions

The papers in this collection exhibit several technical trends:

Datasets and Evaluation

These datasets and benchmarks cover a wide range of tasks including tool usage, long-chain reasoning, social dynamics understanding, and audio comprehension, providing a comprehensive evaluation of the proposed methods.


Topic 5: Machine Translation and Cross-Lingual Applications

Topic Overview

Machine Translation and Cross-Lingual Applications is a field focused on developing systems that can effectively translate text between different languages while maintaining semantic and cultural fidelity. The importance of this research is underscored by the need for global communication and information accessibility across diverse linguistic and cultural contexts. Addressing performance gaps and biases in these systems, especially for low-resource languages, is critical for ensuring fairness and robustness in AI technologies, thereby promoting inclusivity and reducing digital divides.

Individual Paper Contributions

The papers collectively highlight the growing trend towards leveraging large language models (LLMs) for enhancing cross-lingual applications and addressing issues related to performance gaps and biases. Techniques such as in-context learning, hybrid bias detection pipelines, and human-in-the-loop curation are emerging as key methodologies to improve the robustness and fairness of machine translation and related tasks. There is also a noticeable shift towards developing frameworks and datasets that cater to low-resource and morphologically complex languages, underscoring the importance of linguistic and cultural sensitivity in AI systems.

Datasets and Evaluation

These summaries encapsulate the advancements and methodologies presented in each paper, contributing to the broader goal of improving machine translation and cross-lingual applications.


Topic 6: Reinforcement Learning and Adaptive Systems

Topic Overview

Reinforcement Learning and Adaptive Systems represent a dynamic area of research focused on enhancing the capabilities of artificial intelligence models to adapt to changing environments and improve their performance over time. This topic is particularly pertinent in the context of large language models (LLMs), where the goal is to develop methods that allow weaker models to effectively train stronger ones, optimize resource usage, enable real-time reasoning, and perform sophisticated tasks such as hierarchical text classification and agentic tasks. The advancements in this field are critical for moving towards Artificial General Intelligence (AGI) and ensuring that AI systems can operate reliably and efficiently in diverse applications.

Individual Paper Contributions

The papers in this collection collectively demonstrate a trend towards leveraging reinforcement learning (RL) and adaptive strategies to enhance the functionality and performance of large language models (LLMs). Innovations include the use of contrastive decoding and implicit rewards to improve sample quality in weak-to-strong generalization, hierarchical reward functions to optimize search behavior in agentic RAG systems, and multi-agent frameworks to manage stateful reasoning during inference. Additionally, there is a focus on developing new benchmarks and datasets that reflect real-world complexities and requirements, emphasizing the importance of practical and diverse evaluation methods.

Datasets and Evaluation


Topic 7: Knowledge Representation and Retrieval

Topic Overview

Knowledge Representation and Retrieval is a critical area in artificial intelligence, particularly in natural language processing (NLP), which focuses on how machines can understand, interpret, and utilize human knowledge effectively. This includes developing methods for distilling knowledge from large language models (LLMs) to smaller ones, enhancing multilingual knowledge graph completions, optimizing pretraining strategies for lightweight models, and synthesizing complex linguistic phenomena like sarcasm in speech. The advancements in this field are essential for creating more efficient, versatile, and fair AI systems that can operate effectively in resource-constrained environments and across multiple languages and contexts.

Individual Paper Contributions

The papers in this collection showcase a trend towards more sophisticated and efficient methods for knowledge representation and retrieval, particularly in the context of language models. Innovations include adaptive mechanisms for knowledge distillation, architectural improvements for multilingual knowledge sharing, sub-network extraction and distillation for reducing resource requirements, and deep analysis techniques for uncovering biases and knowledge in LLMs. Additionally, there is a growing interest in applying these models to specialized tasks such as sarcastic speech synthesis, emphasizing the need for models to capture subtle linguistic nuances and contextual information.

Datasets and Evaluation

The papers utilized a variety of datasets to evaluate their methodologies, including dialogue summarization (SUMM), arithmetic reasoning (GSM, GSM_Plus), the Nemotron-CC dataset, the News Headlines Sarcasm dataset, HiFi-TTS corpus, and MUStARD++. Evaluation metrics varied according to the task, with common metrics including accuracy, Hits@1, Hits@3, Hits@10, Mean Reciprocal Rank (MRR), and subjective evaluations of speech naturalness and expressivity. These datasets and metrics helped validate the effectiveness and efficiency of the proposed methods in different contexts and tasks, highlighting the importance of both quantitative and qualitative assessments in evaluating knowledge representation and retrieval techniques.


Topic 8: Evaluation and Assessment Methods

Topic Overview

Evaluation and assessment methods in artificial intelligence (AI) are critical for ensuring that AI systems, especially large language models (LLMs), are not only accurate but also safe and aligned with ethical standards. These methods are essential for addressing challenges such as generating truthful yet safe responses, automating complex evaluations, and deploying AI effectively in low-resource settings. Accurate evaluation metrics and robust assessment strategies are necessary for the responsible advancement of AI technologies, particularly in areas like natural language processing (NLP), where the nuances of human language and cultural contexts pose additional complexities.

Individual Paper Contributions

The papers in this collection highlight evolving trends in AI evaluation and assessment. Key approaches include:

These methodologies demonstrate a shift towards more sophisticated and context-specific evaluation techniques that aim to balance accuracy, safety, and scalability.

Datasets and Evaluation Metrics

These datasets and metrics collectively contribute to a more nuanced and reliable evaluation framework for various AI applications, emphasizing the importance of context-specific and human-aligned assessments.


Topic 9: Cognitive and Social Simulations

Topic Overview

Cognitive and social simulations aim to replicate human cognition and social interaction within artificial systems, such as autonomous agents and multi-agent systems (MAS), to improve their decision-making and collaboration capabilities. This field is critical for developing AI that can operate in complex, dynamic environments and engage in sophisticated forms of teamwork, akin to human groups. Research in this area addresses challenges in optimizing the efficiency of cognitive processes, understanding the dynamics of team structures, and enhancing the social awareness of AI systems. These advancements have broad implications for fields such as autonomous robotics, virtual assistants, and complex problem-solving scenarios.

Individual Paper Contributions

The papers under the topic of Cognitive and Social Simulations reflect a trend towards refining and optimizing cognitive processes in AI, particularly in scenarios involving complex reasoning and decision-making. Zhang et al. focus on quantifying and mitigating overthinking in LLMs through structured analysis, Xu et al. advocate for a shift in planning strategies from actions to schemas to manage cognitive load in autonomous agents, and Muralidharan et al. apply lessons from human team dynamics to enhance the collaboration and social awareness of multi-agent systems. These studies collectively emphasize the importance of reducing unnecessary computational costs, managing cognitive resources effectively, and fostering efficient and meaningful interaction among AI entities.

Datasets and Evaluation

These datasets and evaluation methods underscore the varied approaches taken to study cognitive and social behaviors in AI, reflecting a commitment to rigorous testing and a holistic understanding of AI’s operational efficiency and collaborative potential.


Topic 10: Specialized Applications and Domains

Topic Overview

The topic of specialized applications and domains in large language models (LLMs) focuses on enhancing LLMs’ performance in specific, often complex, fields such as healthcare, chemistry, and multilingual environments. These specialized applications aim to address the limitations of general-purpose LLMs when dealing with domain-specific knowledge and tasks, ensuring that AI systems can provide accurate and efficient support tailored to particular professions or industries. The research is important for developing more reliable and effective AI solutions that can integrate seamlessly into professional workflows, thereby contributing to advancements in fields such as drug discovery, medical diagnosis, and multilingual communication.

Individual Paper Contributions

The papers in this topic adopt several technical trends, including the use of specialized datasets and benchmarks to train and evaluate LLMs for domain-specific tasks, the application of multi-agent frameworks for iterative code refinement, and the development of new evaluation metrics to better capture domain-specific nuances. There is also a trend towards employing lightweight adaptation techniques, such as prompt engineering and small-scale fine-tuning, to improve LLM performance without requiring extensive computational resources. Furthermore, the papers highlight the importance of incorporating hierarchical verification in tasks like clinical coding, and the necessity of robust benchmarks in fields like chemical reasoning and nursing.

Datasets and Evaluation

Evaluation metrics vary widely, with each paper tailoring its metrics to the specific needs of its domain. For instance, ToolLibGen uses retrieval accuracy, oMeBench uses validity of intermediates and logical coherence, NurseLLM uses accuracy in MCQ tasks, and “Does Local News Stay Local?” relies on log-odds ratios and topic modeling. In “Beyond Monolingual Assumptions,” metrics like exact match and BLEU scores are employed, while “Toward Reliable Clinical Coding” utilizes new metrics such as prefix-n match and prefix overlap ratio to better assess hierarchical misalignments in clinical coding.


Topic 11: misc

Topic Overview

The research topic encompasses a broad array of challenges and innovations in the realm of large language models (LLMs) and their applications. From enhancing reasoning capabilities and managing privacy risks to optimizing multimodal interactions and refining psychometric evaluations, these studies collectively address the need for more robust, efficient, and ethically sound AI systems. Each paper delves into a specific facet of LLM functionality, aiming to bridge the gap between theoretical advancements and practical usability, ultimately pushing the boundaries of what LLMs can achieve in diverse and complex real-world scenarios.

Individual Paper Contributions

The papers collectively highlight several emerging trends in LLM research:

  1. Resource-Efficient Techniques: Multiple studies focus on developing methods that reduce computational costs and improve efficiency, such as LightReasoner, which leverages smaller models for supervision, and RCPU, which compensates for pruning errors to preserve performance with fewer resources.
  2. Multi-Agent Systems: Papers like CompassLLM and MAPRO emphasize the potential of multi-agent systems for enhancing LLM performance in specialized tasks, such as geo-spatial reasoning and prompt optimization.
  3. Bias and Ethical Considerations: Studies like WinoQueer-NLI and PATCH underscore the importance of addressing bias and privacy concerns in LLMs, introducing new metrics and methodologies to mitigate these issues.
  4. Adaptation and Generalization: Works such as TTM and CARPAS explore methods to adapt LLMs to diverse tasks and environments, using techniques like test-time optimization and content-aware refinement to improve generalization and performance.
  5. Comprehensive Evaluation Frameworks: Several papers, including HaystackCraft and Quantifying Data Contamination in Psychometric Evaluations, propose new benchmarks and metrics to provide a more thorough and accurate evaluation of LLMs across different domains and tasks.

Datasets and Evaluation Metrics

These studies collectively advance the field by addressing key challenges in LLM performance, evaluation, and ethical considerations, paving the way for more sophisticated and reliable AI systems in the future.


References


  1. Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models ↩︎

  2. TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning ↩︎

  3. Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning ↩︎

  4. Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge ↩︎

  5. Causality Guided Representation Learning for Cross-Style Hate Speech Detection ↩︎

  6. Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models ↩︎

  7. Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation ↩︎

  8. LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology ↩︎

  9. Drift No More? Context Equilibria in Multi-Turn LLM Interactions ↩︎

  10. Multimodal Safety Evaluation in Generative Agent Social Simulations ↩︎

  11. Comparing human and language models sentence processing difficulties on complex structures ↩︎

  12. TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription ↩︎

  13. Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects ↩︎

  14. Can Speech LLMs Think while Listening? ↩︎

  15. Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation ↩︎

  16. Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards ↩︎

  17. More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning ↩︎

  18. Banking Done Right: Redefining Retail Banking with Language-Centric AI ↩︎

  19. SUBQRAG: sub-question driven dynamic graph rag ↩︎

  20. Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts ↩︎

  21. Self-Improving LLM Agents at Test-Time ↩︎

  22. Accelerating Diffusion LLM Inference via Local Determinism Propagation ↩︎

  23. AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs ↩︎

  24. OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference ↩︎

  25. OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs ↩︎

  26. AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding ↩︎

  27. Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning ↩︎

  28. ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs ↩︎

  29. Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains ↩︎ ↩︎

  30. Multilingual Generative Retrieval via Cross-lingual Semantic Compression ↩︎ ↩︎

  31. LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish ↩︎ ↩︎

  32. LASER: An LLM-based ASR Scoring and Evaluation Rubric ↩︎ ↩︎

  33. Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data ↩︎ ↩︎

  34. Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages ↩︎ ↩︎

  35. Contrastive Weak-to-strong Generalization ↩︎

  36. HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation ↩︎

  37. MemWeaver: A Hierarchical Memory from Textual Interactive Behaviors for Personalized Generation ↩︎

  38. Large Language Models Meet Virtual Cell: A Survey ↩︎

  39. LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning ↩︎

  40. Reasoning for Hierarchical Text Classification: The Case of Patents ↩︎

  41. VoiceAgentBench: Are Voice Assistants ready for agentic tasks? ↩︎

  42. AdaSwitch: Adaptive Switching Generation for Knowledge Distillation ↩︎

  43. Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing ↩︎

  44. Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation ↩︎

  45. Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge ↩︎

  46. Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER ↩︎

  47. Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis ↩︎

  48. The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs ↩︎

  49. OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment ↩︎

  50. Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages ↩︎

  51. Populism Meets AI: Advancing Populism Research with LLMs ↩︎

  52. How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu ↩︎

  53. Do LLMs Really Need 10+ Thoughts for “Find the Time 1000 Days Later”? Towards Structural Understanding of LLM Overthinking ↩︎

  54. The Cognitive Bandwidth Bottleneck: Shifting Long-Horizon Agent from Planning with Actions to Planning with Schemas ↩︎

  55. Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics ↩︎

  56. ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning ↩︎

  57. oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning ↩︎

  58. NurseLLM: The First Specialized Language Model for Nursing ↩︎

  59. Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations ↩︎

  60. Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models ↩︎

  61. Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation ↩︎

  62. LightReasoner: Can Small Language Models Teach Large Language Models Reasoning? ↩︎

  63. Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models ↩︎

  64. Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection ↩︎

  65. Stress-Testing Model Specs Reveals Character Differences among Language Models ↩︎

  66. Textual Entailment and Token Probability as Bias Evaluation Metrics ↩︎

  67. TTOM: Test-Time Optimization and Memorization for Compositional Video Generation ↩︎

  68. Who Stole Your Data? A Method for Detecting Unauthorized RAG Theft ↩︎

  69. LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics ↩︎

  70. CompassLLM: A Multi-Agent Approach toward Geo-Spatial Reasoning for Popular Path Query ↩︎

  71. PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing ↩︎

  72. Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping ↩︎

  73. Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible ↩︎

  74. All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations ↩︎

  75. Machines in the Crowd? Measuring the Footprint of Machine-Generated Text on Reddit ↩︎

  76. MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference ↩︎

  77. Meaningful Pose-Based Sign Language Evaluation ↩︎

  78. Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation ↩︎

  79. CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching ↩︎

  80. RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model ↩︎

  81. Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models ↩︎

  82. ConCuR: Conciseness Makes State-of-the-Art Kernel Generation ↩︎

  83. CARPAS: Towards Content-Aware Refinement of Provided Aspects for Summarization in Large Language Models ↩︎

  84. Quantifying Data Contamination in Psychometric Evaluations of LLMs ↩︎

  85. Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models ↩︎