2025年10月05日NLP论文汇总(英文)


Topic 1: Reasoning and Problem Solving

Topic Overview

Reasoning and problem-solving are fundamental cognitive abilities that underpin a wide range of tasks, from understanding complex narratives to solving intricate mathematical problems. In the realm of artificial intelligence, especially within large language models (LLMs), the ability to reason effectively is critical for ensuring that AI systems can operate ethically, efficiently, and with a deeper understanding of context. As LLMs become increasingly integrated into daily life, addressing their limitations in reasoning and problem-solving becomes paramount. This includes mitigating biases, improving performance on long-context tasks, and enhancing the generation of high-quality outputs such as images and dialogues. The papers reviewed here tackle these challenges through innovative methodologies and frameworks, contributing to the broader goal of making AI systems more reliable and versatile.

Individual Paper Contributions

The reviewed papers collectively illustrate several emerging trends in the field of reasoning and problem-solving within AI. A notable trend is the utilization of multi-agent systems and frameworks, such as MADIAVE and ARM, to enhance decision-making and reasoning processes. Another trend is the application of structured protocols, like Chain-of-Thought (CoT), to improve the interpretability and accuracy of AI models, as seen in EvalMORAAL and PDS. Lastly, there is a growing emphasis on mitigating context length issues and improving efficiency in LLMs through strategies like ‘retrieve-then-reason’ and reinforcement learning-based approaches like ShortCoTI, indicating a focus on optimizing resource usage and performance in complex tasks.

Datasets and Evaluation

These evaluations highlight the importance of diverse datasets in assessing the robustness and generalizability of reasoning methods across different contexts and applications.


Topic 2: Large Language Model Optimization and Calibration

Topic Overview

Large Language Model (LLM) optimization and calibration are essential for advancing the practicality and efficiency of these models in various applications, ranging from natural language processing to multimodal tasks. As LLMs continue to grow in size and complexity, researchers are focusing on methods to reduce their resource demands while maintaining or even enhancing their performance. This involves developing techniques for more efficient parameter utilization, adaptive compression, improved long-term planning, and reliable information extraction. These efforts aim to make LLMs more scalable, interpretable, and aligned with ethical and value-based considerations, thereby broadening their applicability and trustworthiness.

Individual Paper Contributions

The papers in this collection adopt various technical approaches to optimize and calibrate large language models:

Datasets and Evaluation Metrics

Evaluation metrics include:

These metrics collectively provide a comprehensive assessment of the models’ efficiency, performance, and reliability across different tasks and domains.


Topic 3: Multimodal and Cross-Modal Learning

Topic Overview

Multimodal and Cross-Modal Learning is an interdisciplinary field that integrates multiple forms of data (such as text, images, audio, video) into machine learning models to enhance their understanding and interaction with complex information. This area is vital for developing intelligent systems capable of interpreting and generating content across different modalities, which is essential for applications ranging from natural language processing (NLP) to autonomous agents in scientific research. By enabling models to learn from and reason about multimodal inputs, this research can lead to breakthroughs in areas like cross-lingual communication, automated scientific discovery, and cognitive impairment assessment, thereby democratizing access to information and improving decision-making processes in diverse fields.

Individual Paper Contributions

The papers highlight a shift towards more sophisticated multimodal and cross-modal learning approaches, leveraging large language models (LLMs) and integrating specialized encoders for handling diverse data types. Innovations include the creation of new benchmarks and datasets to evaluate the models’ performance under varied conditions, as well as the development of methodologies to enhance context awareness and model adaptability. There is a clear trend toward addressing the limitations of existing models in specific domains, such as low-resource languages, spatial audio understanding, and scientific reasoning, by proposing novel architectures and evaluation frameworks.

Datasets and Evaluation

Each paper employs tailored evaluation metrics suited to their specific objectives, such as ChrF++, COMET, BLEU for NLP, PR-AUC for meme virality prediction, and various correlation measures for CIU extraction, demonstrating the diversity and complexity of the field.


Topic 4: Reinforcement Learning and Policy Optimization

Topic Overview

Reinforcement Learning (RL) and policy optimization are pivotal areas in the advancement of large language models (LLMs), aiming to enhance their performance and adaptability across diverse and specialized tasks. The integration of RL techniques with LLMs has led to significant breakthroughs in areas such as domain-specific summarization, mathematical reasoning, and secure interactions with external tools. However, challenges such as entropy collapse, the need for continuous learning in changing environments, and ensuring safety against adversarial attacks persist. Addressing these issues is crucial for unlocking the full potential of LLMs in practical applications, particularly in scenarios requiring real-time adjustments and interaction with external knowledge sources.

Individual Paper Contributions

The papers collectively showcase several key trends in the application of RL and policy optimization to LLMs:

Datasets and Evaluation Metrics

These contributions and evaluations collectively advance the field by providing innovative solutions to common challenges and setting a benchmark for future research in RL and policy optimization for LLMs.


Topic 5: Cultural and Linguistic Adaptation in LLMs

Topic Overview

The topic of cultural and linguistic adaptation in Large Language Models (LLMs) is critical for ensuring that AI systems are capable of interacting effectively and respectfully within diverse cultural and linguistic contexts. As LLMs become increasingly ubiquitous in applications such as translation systems, educational tools, search engines, and generative platforms, there is a growing need to develop benchmarks and methodologies that can accurately assess these models’ cultural sensitivity and linguistic proficiency. This not only enhances the quality of AI-generated content but also ensures that such systems respect regional norms, moral frameworks, idiomatic expressions, and socio-political identities.

Individual Paper Contributions

The papers collectively highlight several technical trends:

Datasets and Evaluation

The papers utilized a variety of datasets and evaluation metrics:

Evaluation metrics included:


Topic 6: Data Handling and Processing

Topic Overview

Data handling and processing is a critical area in machine learning and artificial intelligence, particularly in the context of large language models (LLMs) and multimodal models. Efficient and effective data processing techniques are necessary to manage the increasing complexity and size of these models, ensuring that they can be deployed and utilized in various scenarios, from cloud-based services to resource-constrained edge devices. This topic encompasses advancements in optimizing model training, improving inference efficiency, and enhancing the quality of data-driven systems across different domains, including natural language processing (NLP) and machine translation.

Individual Paper Contributions

The papers in this collection explore various technical trends in data handling and processing, including:

Datasets and Evaluation Metrics

Evaluation metrics across the papers include:


Topic 7: LLM Validation and Reliability

Topic Overview

The topic of LLM validation and reliability is critical in the field of artificial intelligence, particularly concerning the safe and responsible deployment of large language models (LLMs). As LLMs become increasingly prevalent in various applications—from content moderation and document processing to conversational agents and code generation—it is imperative to ensure that these models operate within ethical and legal boundaries. This involves not only mitigating the risk of generating harmful or misleading content but also understanding the nuances of how different factors (such as context length, position, and type of harmful content) influence model behavior. Additionally, addressing vulnerabilities to specific attacks like prompt injection and jailbreaks, as well as understanding the impact of synthetic data on model performance, are key challenges in enhancing the robustness and reliability of LLMs.

Individual Paper Contributions

The papers collectively demonstrate a trend towards more nuanced and systematic approaches to evaluating and enhancing the reliability and safety of LLMs. Innovations include:

These advancements indicate a move towards more comprehensive and context-aware methods for ensuring LLM reliability, focusing on both performance and interpretability.

Datasets and Evaluation Metrics

Evaluation metrics include:

These datasets and metrics highlight the diverse approaches and considerations in validating and ensuring the reliability of LLMs across different dimensions and applications.


Topic 8: Knowledge Representation and Retrieval

Topic Overview

The topic of Knowledge Representation and Retrieval is central to the advancement of AI systems, especially in domains where precise, structured, and actionable information is essential. These domains include aviation maintenance, collaborative multi-agent systems, historical climate data analysis, literature review generation, and task-oriented dialogue management. The research focuses on overcoming the limitations of large language models (LLMs) in handling domain-specific tasks, improving their reliability, and ensuring that they can effectively retrieve and synthesize knowledge from complex, unstructured data sources. By integrating structured knowledge representations such as knowledge graphs and employing advanced retrieval and reasoning techniques, these studies aim to enhance the applicability of AI in high-stakes environments and improve the overall performance of automated systems in various practical scenarios.

Individual Paper Contributions

The papers in this collection reflect a growing trend towards leveraging structured knowledge representations, such as knowledge graphs, alongside retrieval-augmented generation (RAG) methods to enhance the performance of AI systems in domain-specific and high-stakes environments. There is a clear shift towards integrating multi-agent systems and graph neural networks (GNNs) to enable adaptive collaboration and better handling of complex tasks. Additionally, the use of specialized prompting techniques, including negative examples and few-shot learning, is highlighted as a means to improve the accuracy and reliability of information extraction and synthesis from unstructured data sources. These approaches collectively aim to address the limitations of large language models in specific contexts, enhancing their applicability and trustworthiness.

Datasets and Evaluation

The datasets and evaluation metrics vary across the papers, reflecting the diversity of application domains and the specific challenges they address. Key datasets include:

Evaluation metrics include:

These metrics provide a comprehensive view of the system’s performance, ranging from factual consistency and retrieval accuracy to the coherence and usability of generated outputs.


Topic 9: Human-Centric AI and Social Inference

Topic Overview

Human-Centric AI and Social Inference is a burgeoning field focused on enhancing AI systems to better understand and respond to the complexities of human social interactions. This includes areas such as detecting sarcasm, irony, and humor, assessing creativity, communicating uncertainty, and providing emotional support through dialogue. The importance of this research lies in making AI systems more empathetic, reliable, and aligned with human values, thus enabling smoother and more effective human-AI collaborations in everyday life and specialized fields like healthcare and legal services.

Individual Paper Contributions

The papers in this collection reflect evolving trends in human-centric AI research, emphasizing the importance of enhancing AI models with deeper cognitive abilities and aligning them more closely with human values and preferences. Notable trends include the use of fine-tuning techniques to improve specific reasoning capabilities, the integration of reinforcement learning for task refinement, and the application of curiosity-driven mechanisms to personalize subjective evaluations. There is a growing recognition of the need for AI models to not only generate text but also to reason about and communicate uncertainty effectively, particularly in complex and sensitive domains.

Datasets and Evaluation

Evaluation metrics across the papers include Pearson correlation, Cohen’s kappa, F1 scores, BLEU-1/2, ROUGE-L, METEOR, BERTScore, diversity scores (Distinct-1 and Distinct-2), AUC, and ECE. These metrics were selected to comprehensively assess the performance of AI models in terms of their alignment with human judgment, logical consistency, and the quality of generated responses.


Topic 10: Enterprise Applications and Task Discovery

Topic Overview

The topic of enterprise applications and task discovery focuses on leveraging large language models (LLMs) to enhance various business operations, from job recommendations and structural analysis to financial decision support and optimization modeling. These applications are critical for improving the efficiency, accuracy, and scalability of tasks that traditionally require significant human expertise and computational resources. The importance of this research lies in its potential to revolutionize industries by enabling more sophisticated and automated solutions to complex problems, ultimately driving innovation and productivity.

Individual Paper Contributions

The papers collectively highlight a shift towards more sophisticated and context-aware applications of LLMs in enterprise settings. Key trends include:

Datasets and Evaluation

The evaluation metrics varied across papers, including accuracy, F1-score, ROUGE scores, and domain-specific benchmarks such as SpokenWOZ and USPTO-2M. Each paper emphasized the need for context-specific metrics to accurately gauge model performance and reliability in their respective domains.


Topic 11: misc

Topic Overview

The research topic of “miscellaneous” covers a broad spectrum of challenges and innovations in the domain of large language models (LLMs) and their applications. These papers collectively address issues such as long-context processing, multi-step reasoning, memory management, authorship attribution, quantum-inspired music notation, clinical expertise alignment, and the inevitability of hallucination under the open world assumption. Each paper explores unique facets of LLMs, contributing to their efficiency, effectiveness, and adaptability across diverse scenarios. Understanding these challenges is crucial for advancing LLMs and making them more reliable and versatile in real-world applications.

Individual Paper Contributions

The papers in this collection adopt a range of innovative approaches to tackle various challenges faced by LLMs. Techniques such as context denoising, dynamic interactive planning, constructivist agentic memory, directed decoding maps, set-based fine-tuning, disentangled MoE, correctness-first decoding, and novel optimizers are highlighted. The trend towards developing more efficient, scalable, and context-aware methodologies is evident, with an emphasis on integrating domain-specific knowledge and improving the interpretability of model decisions. Additionally, there is a focus on addressing issues related to the generation of diverse yet accurate reasoning paths, and the exploration of alternative formalisms and frameworks to enhance the capabilities of LLMs in specialized domains like music and clinical care.

Datasets and Evaluation Metrics

Evaluation metrics commonly used include F1 scores, ROUGE scores, BLEU scores, NDCG@10, HR@10, AUC, and perplexity. These metrics are tailored to assess different dimensions of model performance, including accuracy, informativeness, and consistency, reflecting the varied goals and challenges addressed by the respective research efforts.


References


  1. EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models ↩︎

  2. MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction ↩︎

  3. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval ↩︎

  4. MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation ↩︎

  5. ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems ↩︎

  6. Prototype-Based Dynamic Steering for Large Language Models ↩︎

  7. Mixture of Neuron Experts ↩︎

  8. Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM ↩︎

  9. Submodular Context Partitioning and Compression for In-Context Learning ↩︎

  10. VAL-Bench: Measuring Value Alignment in Language Models ↩︎

  11. Do Code Models Suffer from the Dunning-Kruger Effect? ↩︎

  12. Prompt reinforcing for long-term planning of large language models ↩︎

  13. Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models ↩︎

  14. WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection ↩︎

  15. The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP ↩︎

  16. Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics ↩︎

  17. Sci-Phi: A Large Language Model Spatial Audio Descriptor ↩︎

  18. Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs ↩︎

  19. Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA) ↩︎

  20. Advancing Automated Spatio-Semantic Analysis in Picture Description Using Language Models ↩︎

  21. DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization ↩︎

  22. EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget ↩︎

  23. DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision ↩︎

  24. Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning ↩︎

  25. Adversarial Reinforcement Learning for Large Language Model Agent Safety ↩︎

  26. Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens ↩︎

  27. The fragility of “cultural tendencies” in LLMs ↩︎

  28. Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer ↩︎

  29. To model human linguistic prediction, make LLMs less superhuman ↩︎

  30. Automated Boilerplate: Prevalence and Quality of Contract Generators in the Context of Swiss Privacy Policies ↩︎

  31. Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes ↩︎

  32. The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures ↩︎

  33. SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation ↩︎

  34. Paying Attention to Hybrid Attention: Untangling the Issues with Conversion Methods ↩︎

  35. AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning ↩︎

  36. Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices ↩︎

  37. Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech ↩︎

  38. Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities ↩︎

  39. Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input ↩︎

  40. RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts ↩︎

  41. A Single Character can Make or Break Your LLM Evals ↩︎

  42. Domain-Shift-Aware Conformal Prediction for Large Language Models ↩︎

  43. Proactive defense against LLM Jailbreak ↩︎

  44. A novel hallucination classification framework ↩︎

  45. Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling ↩︎

  46. Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment ↩︎

  47. KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance ↩︎

  48. AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering ↩︎

  49. WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives ↩︎

  50. Rationale-Augmented Retrieval with Constrained LLM Re-Ranking for Task Discovery ↩︎

  51. Towards Structured Knowledge: Advancing Triple Extraction from Regional Trade Agreements using Large Language Models ↩︎

  52. Collaborative and Proactive Management of Task-Oriented Conversations ↩︎

  53. Generative AI-Driven Hierarchical Multi-Agent Framework for Zero-Touch Optical Networks ↩︎

  54. SocialNLI: A Dialogue-Centric Social Inference Dataset ↩︎

  55. Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment ↩︎

  56. Improving Metacognition and Uncertainty Communication in Language Models ↩︎

  57. CARE: Cognitive-reasoning Augmented Reinforcement for Emotional Support Conversation ↩︎

  58. On the Role of Difficult Prompts in Self-Play Preference Optimization ↩︎

  59. LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation ↩︎

  60. A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis ↩︎

  61. Chronological Thinking in Full-Duplex Spoken Dialogue Language Models ↩︎

  62. MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization ↩︎

  63. Language Model as Planner and Formalizer under Constraints ↩︎

  64. Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification ↩︎

  65. Exploring Large Language Models for Financial Applications: Techniques, Performance, and Challenges with FinMA ↩︎

  66. Revisiting Long-context Modeling from Context Denoising Perspective ↩︎

  67. Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs ↩︎

  68. CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension ↩︎

  69. Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs ↩︎

  70. Training Large Language Models To Reason In Parallel With Global Forking Tokens ↩︎

  71. Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation ↩︎

  72. Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs ↩︎

  73. NorMuon: Making Muon more efficient and scalable ↩︎

  74. Quantum Concept Music Score from Quantum Picturalism: Musical Incarnation of a Bell-Pair under Measurements ↩︎

  75. Probing the Difficulty Perception Mechanism of Large Language Models ↩︎

  76. InforME: Improving Informativeness of Abstractive Text Summarization With Informative Attention Guided by Named Entity Salience ↩︎

  77. Hallucination is Inevitable for LLMs with the Open World Assumption ↩︎

  78. Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care ↩︎