2025年10月09日NLP论文汇总(英文)


Topic 1: Large Language Model Optimization and Fine-Tuning

Topic Overview

Large Language Model (LLM) optimization and fine-tuning have become central themes in the advancement of AI applications across various domains. These models, characterized by their massive scale and versatile capabilities, offer unprecedented opportunities for improving task-specific performance. However, challenges remain in understanding how fine-tuning impacts these models, especially in specialized fields like healthcare, and in enhancing their efficiency, reasoning, and adaptability to different tasks and data formats. Addressing these issues can lead to more efficient, reliable, and trustworthy AI systems, which are essential for applications ranging from healthcare to e-commerce and beyond.

Individual Paper Contributions

The papers collectively highlight several key trends in LLM optimization and fine-tuning:

  1. Domain-Specific Adaptation: Techniques such as ’tuning vectors’ and parameter-efficient fine-tuning (PEFT) with LoRA are being developed to better adapt LLMs to specialized domains like healthcare and e-commerce.
  2. Efficient Learning Methods: Layer-selective tuning and masked fine-tuning paradigms aim to reduce the need for extensive data and resources, making LLMs more efficient learners.
  3. Inference-Time Augmentation: Methods like P-TTS demonstrate the potential of using test-time data augmentation to improve reasoning performance with minimal additional data.
  4. Hybrid Approaches: Combining statistical algorithms with LLM-based methods, as seen in IRIS, opens new avenues for causal discovery and handling unstructured data.
  5. Adaptive Resolution Selection: For Visual Large Language Models (VLLMs), adaptive resolution selection strategies are emerging to optimize performance based on the specific requirements of vision-language tasks.

Datasets and Evaluation

The primary datasets and evaluation metrics used across the papers include:

Evaluation metrics commonly used include accuracy, macro F1 scores, Normalized Hamming Distance (NHD), RMSLE, MALE, SAR, DAR, and task-specific benchmarks like spBLEU and xComet for translation tasks. Each metric helps in quantifying the performance improvements brought about by the proposed methodologies in their respective application areas.


Topic 2: Multimodal and Multilingual Reasoning

Topic Overview

Multimodal and multilingual reasoning is an emerging area of research in natural language processing (NLP) that focuses on enhancing the capabilities of large language models (LLMs) to process and understand information presented in multiple forms (e.g., text, images, videos) and across various languages. This topic is crucial for developing AI systems capable of interacting effectively in global, multicultural environments where users may input data in a variety of formats and languages. Improvements in this domain can lead to more robust, versatile, and inclusive AI technologies, which are increasingly necessary as digital communication becomes more complex and widespread.

Individual Paper Contributions

The papers collectively highlight several evolving trends in multimodal and multilingual reasoning:

Datasets and Evaluation Metrics


Topic 3: Reinforcement Learning and Policy Optimization for LLMs

Topic Overview

Reinforcement Learning (RL) and Policy Optimization for Large Language Models (LLMs) represent a critical area of research aimed at enhancing the adaptability, decision-making, and overall performance of AI agents in complex, long-horizon tasks. These tasks often require the integration of simulation and reasoning capabilities, as well as the ability to handle sparse rewards effectively. By improving RL techniques and policy optimization methods, researchers aim to create more robust, versatile, and human-aligned AI systems capable of navigating diverse and challenging environments.

Individual Paper Contributions

The papers in this collection showcase a variety of technical trends aimed at advancing RL and policy optimization for LLMs:

Datasets and Evaluation

The papers utilize a range of datasets to evaluate their proposed methods:

Evaluation metrics vary according to the specific tasks and datasets, including success rates, AUC (Area Under the Curve), accuracy, F1 scores, BLEU-1 scores, and decoding latency. These metrics collectively measure the performance, stability, and efficiency of the proposed methods in enhancing LLMs’ reasoning and decision-making capabilities.


Topic 4: Knowledge Representation and Reasoning

Topic Overview

Knowledge Representation and Reasoning (KRR) is a critical area in artificial intelligence that deals with the representation of knowledge in a way that supports automated reasoning. Advances in KRR can significantly enhance the performance of machine learning models in various applications, including social network analysis, anomaly detection, recommendation systems, commonsense reasoning, nutrition question answering, and multilingual video corpus retrieval. By improving how knowledge is structured and processed, researchers aim to develop more effective, reliable, and inclusive AI systems capable of handling complex tasks with limited labeled data or across diverse linguistic contexts.

Individual Paper Contributions

The papers collectively highlight several key trends in KRR:

Datasets and Evaluation

These datasets and evaluation metrics reflect the diversity of KRR applications and the importance of contextually appropriate benchmarks for assessing model performance and reasoning capabilities.


Topic 5: Speech and Audio Processing with LLMs

Topic Overview

The research topic of “Speech and Audio Processing with LLMs” is critical for advancing the capabilities of large language models (LLMs) in understanding and generating speech in real-time. Enhancing speech recognition, dialogue state tracking, and real-time reasoning in spoken language models can significantly improve human-computer interaction experiences. This topic is not only about technical improvements but also about making speech technologies more accessible and robust across different accents and dialects, which is essential for global applications. Furthermore, the development of unsupervised lexicon learning from speech contributes to the advancement of zero-resource speech technologies, enabling the creation of speech recognition systems for languages with limited textual resources.

Individual Paper Contributions

The papers in this collection reflect a trend towards more sophisticated and efficient architectures in speech and audio processing. Innovations include dual-brain frameworks for concurrent thinking and speaking, active model selection strategies to optimize resource usage, and advanced spectrogram masking techniques to handle accent variations. Additionally, there is a growing emphasis on context management strategies in spoken dialog systems and the exploration of self-supervised learning (SSL) for unsupervised lexicon learning. These advancements aim to address challenges such as high latency, error propagation, and robustness to linguistic diversity.

Datasets and Evaluation

The papers utilized a variety of datasets and evaluation metrics to validate their contributions:


Topic 6: Evaluation and Benchmarking of LLMs

Topic Overview

The evaluation and benchmarking of Large Language Models (LLMs) have become increasingly important as these models find applications in a wide range of domains. Accurate and comprehensive benchmarks are essential to measure the performance, reliability, and capability of LLMs in specific tasks, thereby guiding improvements and fostering trust in their use. This report summarizes five papers that contribute to the advancement of LLM evaluation across different scenarios, including academic promotion, statistical reasoning, pre-training dynamics, narrative understanding, and safety guardrails for agentic systems.

Individual Paper Contributions

The papers in this collection showcase a variety of technical approaches and methodological advancements in the evaluation and benchmarking of LLMs:

Datasets and Evaluation Metrics


Topic 7: Reasoning and Logical Generalization

Topic Overview

Reasoning and logical generalization are fundamental aspects of human intelligence and are increasingly being explored in artificial intelligence (AI) research to enhance the capabilities of large language models (LLMs). These models have shown remarkable proficiency in various reasoning tasks, but they often struggle with maintaining consistency and precision in structured output formats or when dealing with less familiar or encrypted forms of input. Addressing these challenges is crucial for advancing AI applications in areas such as complex decision-making, scientific research, and ensuring AI safety. Research in this domain aims to improve the robustness, reliability, and generalization abilities of LLMs in logical reasoning tasks.

Individual Paper Contributions

The papers in this topic reflect a growing trend towards developing hybrid and modular systems that integrate the strengths of neural and symbolic reasoning. Innovations include lightweight frameworks like DICE for guiding LLM outputs, new evaluation paradigms such as ciphered reasoning, and white-box verification techniques like CRV for understanding and correcting reasoning errors. There is a clear shift towards addressing the limitations of LLMs in structured output adherence, handling unfamiliar text formats, and verifying the correctness of reasoning processes, all of which are critical for advancing AI safety and reliability.

Datasets and Evaluation


Topic 8: Privacy and Security in LLMs

Topic Overview

Privacy and security in large language models (LLMs) are critical concerns as these models become increasingly integrated into everyday applications. Issues such as data leakage, re-identification attacks, and vulnerabilities to adversarial attacks pose significant risks to the confidentiality and integrity of personal and sensitive information. Ensuring the trustworthiness of these models is paramount, especially in contexts where the reasoning process must be transparent and reliable. This includes enhancing interpretability, faithfulness, and reliability in reasoning models, safeguarding against poisoning and contamination attacks in retrieval-augmented generation (RAG) systems, and addressing vulnerabilities in multimodal models that process both textual and visual data.

Individual Paper Contributions

The papers collectively highlight a trend towards developing comprehensive frameworks and algorithms that enhance the security and trustworthiness of large language models and vision language models. Innovations focus on improving interpretability, reliability, and faithfulness through structured reasoning, semantic filtering, and order-sensitive re-identification. There is also an emphasis on evaluating models across multilingual and multimodal contexts, reflecting the increasing complexity and diversity of real-world applications.

Datasets and Evaluation

These datasets and metrics provide a robust foundation for evaluating the effectiveness and security of various models and frameworks in their respective domains, ensuring that advancements in privacy and security are rigorously tested and validated.


Topic 9: Human-like Reasoning and Dialogue

Topic Overview

The research topic of “Human-like Reasoning and Dialogue” focuses on enhancing the capabilities of large language models (LLMs) to interact with humans more naturally and effectively. This involves improving the models’ ability to understand and respond to ambiguous instructions, personalize interactions, detect and mitigate biases, and evaluate the quality of AI-generated content. The importance of this research lies in addressing the limitations of LLMs in mimicking human reasoning and dialogue, which is crucial for their broader adoption in areas such as customer service, education, and mental health support.

Individual Paper Contributions

The papers in this collection adopt several technical trends and methodological evolutions:

Datasets and Evaluation Metrics

These datasets and metrics collectively contribute to a comprehensive evaluation of LLMs in various aspects of human-like reasoning and dialogue, providing a solid foundation for future research and practical applications.


Topic 10: Data and Training Strategies for LLMs

Topic Overview

The research topic of “Data and Training Strategies for LLMs” focuses on advancing the methodologies and datasets employed in training and refining large language models (LLMs). These strategies aim to enhance LLMs’ performance, reliability, and adaptability across various applications, including automated essay scoring, reasoning tasks, prompt dataset analysis, and abstractive summarization. The importance of this topic lies in addressing the inherent limitations of LLMs, such as their tendency to generate inaccurate or unreliable outputs, and in developing methods that allow for more autonomous and efficient model training and evaluation. By doing so, these studies contribute to making LLMs more effective and trustworthy in real-world scenarios.

Individual Paper Contributions

The papers in this collection showcase a variety of technical approaches to improve LLM performance. Harada et al. focus on dynamic rubric refinement for AES, leveraging LLMs’ ability to reflect on their own scoring processes. Wei et al. propose a unified framework for comparing tree search algorithms and reward designs in LLM reasoning, aiming to standardize and enhance these methodologies. Marinas et al. advance the use of Elasticsearch for full-text indexing of large-scale datasets, offering insights into the scalability and efficiency of such systems. Zhang et al. emphasize the importance of analyzing prompt datasets to understand and improve LLM interaction patterns. Lastly, Huang et al. explore fine-tuning methods to enhance summarization faithfulness, constructing novel datasets and evaluation metrics to address the issue of hallucinations.

Datasets and Evaluation

Evaluation metrics used include:


Topic 11: misc

Topic Overview

This research topic encompasses a variety of studies focused on enhancing the capabilities of large language models (LLMs) and diffusion large language models (DLLMs) in various complex reasoning tasks, as well as addressing challenges related to their deployment, safety, and interpretability. The importance of this topic lies in its potential to advance the practical applicability and efficiency of AI systems, making them more reliable, versatile, and suitable for real-world tasks that require sophisticated reasoning and understanding. Additionally, the topic delves into specialized applications such as medical diagnostics, crisis communication, and software engineering, where the performance and reliability of LLMs are critical for improving outcomes and user experiences.

Individual Paper Contributions

The papers in this collection adopt a variety of technical approaches and methodologies to enhance the capabilities of large language models and diffusion large language models. Key trends include:

Datasets and Evaluation

The main datasets and evaluation metrics used in the papers are:

Evaluation metrics include:

These studies collectively advance the field by addressing critical limitations, proposing novel methodologies, and establishing new benchmarks and evaluation frameworks for assessing and improving the capabilities of large language models in diverse applications.


References


  1. Understanding the Effects of Domain Finetuning on LLMs ↩︎

  2. LLP: LLM-based Product Pricing in E-commerce ↩︎

  3. IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data ↩︎

  4. LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning ↩︎

  5. Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs ↩︎

  6. Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation ↩︎

  7. Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives ↩︎

  8. Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors ↩︎

  9. Multimodal Policy Internalization for Conversational Agents ↩︎

  10. CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation ↩︎

  11. Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation ↩︎

  12. Dyna-Mind: Learning to Simulate from Experience for Better AI Agents ↩︎

  13. Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood ↩︎

  14. Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models ↩︎

  15. DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning ↩︎

  16. SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models ↩︎

  17. Preference-Aware Memory Update for Long-Term LLM Agents ↩︎

  18. Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph ↩︎

  19. ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering ↩︎

  20. Large Language Models Do NOT Really Know What They Don’t Know ↩︎

  21. Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language ↩︎

  22. NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering ↩︎

  23. ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models ↩︎

  24. Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval ↩︎

  25. Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models ↩︎

  26. Active Model Selection for Large Language Models ↩︎

  27. Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking ↩︎

  28. The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach ↩︎

  29. Unsupervised lexicon learning from speech is limited by representations rather than clustering ↩︎

  30. AutoPR: Let’s Automate Your Academic Promotion! ↩︎

  31. StatEval: A Comprehensive Benchmark for Large Language Models in Statistics ↩︎

  32. MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics ↩︎

  33. TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation ↩︎

  34. Building a Foundational Guardrail for General Agentic Systems via Synthetic Data ↩︎

  35. DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction ↩︎

  36. All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language ↩︎

  37. Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic ↩︎

  38. Verifying Chain-of-Thought Reasoning via Its Computational Graph ↩︎

  39. ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability ↩︎

  40. SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG ↩︎

  41. A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages ↩︎

  42. Stronger Re-identification Attacks through Reasoning and Aggregation ↩︎

  43. Target speaker anonymization in multi-speaker recordings ↩︎

  44. Text Prompt Injection of Vision Language Models ↩︎

  45. Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation ↩︎

  46. Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM ↩︎

  47. Abductive Preference Learning ↩︎

  48. CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs ↩︎

  49. Judge’s Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement ↩︎

  50. Unpacking Hateful Memes: Presupposed Context and False Claims ↩︎

  51. Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection ↩︎

  52. The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs ↩︎

  53. Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise ↩︎

  54. Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey ↩︎

  55. Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World ↩︎

  56. Large Language Model Prompt Datasets: An In-depth Analysis and Insights ↩︎

  57. Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning ↩︎

  58. Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models ↩︎ ↩︎

  59. Mitigating Overthinking through Reasoning Shaping ↩︎ ↩︎ ↩︎

  60. NL2GenSym: Natural Language to Generative Symbolic Rules for SOAR Cognitive Architecture via Large Language Models ↩︎ ↩︎ ↩︎

  61. Don’t Throw Away Your Pretrained Model ↩︎

  62. FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference ↩︎ ↩︎

  63. Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference ↩︎ ↩︎ ↩︎

  64. One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations ↩︎

  65. CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts ↩︎

  66. Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation ↩︎

  67. A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System ↩︎

  68. Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models ↩︎ ↩︎

  69. Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation ↩︎

  70. ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation ↩︎

  71. Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation ↩︎ ↩︎

  72. iBERT: Interpretable Style Embeddings via Sense Decomposition ↩︎ ↩︎

  73. It’s 2025 – Narrative Learning is the new baseline to beat for explainable machine learning ↩︎

  74. CrisiText: A dataset of warning messages for LLM training in emergency communication ↩︎

  75. StreamingVLM: Real-Time Understanding for Infinite Video Streams ↩︎

  76. HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection ↩︎

  77. Steering Embedding Models with Geometric Rotation: Mapping Semantic Relationships Across Languages and Models ↩︎

  78. LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads? ↩︎

  79. Can We Reliably Rank Model Performance across Domains without Labeled Data? ↩︎ ↩︎

  80. DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation ↩︎

  81. Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation ↩︎

  82. Group-Adaptive Adversarial Learning for Robust Fake News Detection Against Malicious Comments ↩︎

  83. HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness ↩︎