2025年10月13日NLP论文汇总(英文)
- Topic 1: Reasoning and Logical Flow (6 papers)
- Topic 2: Multimodal and Cross-Modal Integration (6 papers)
- Topic 3: Knowledge Retrieval and Augmentation (7 papers)
- Topic 4: Model Efficiency and Optimization (5 papers)
- Topic 5: Language Processing and Generation (6 papers)
- Topic 6: Machine Learning and Deep Learning Techniques (5 papers)
- Topic 7: Temporal and Sequential Data Handling (4 papers)
- Topic 8: Benchmarking and Evaluation (4 papers)
- Topic 9: Security and Privacy in AI (6 papers)
- Topic 10: Cross-Linguistic and Cultural Studies (3 papers)
- Topic 11: misc (15 papers)
Topic 1: Reasoning and Logical Flow
Topic Overview
The research topic of “Reasoning and Logical Flow” is centered around enhancing the reasoning capabilities of large language models (LLMs) in various scenarios, including multi-hop reasoning, scientific reasoning, and theory of mind simulation. These studies aim to address the inherent limitations of LLMs when handling extended contexts, complex reasoning tasks, and nuanced social interactions, which are critical for advancing AI systems towards more human-like cognitive functions and practical applications in fields such as customer service, education, and healthcare.
Individual Paper Contributions
-
Jia-Chen Gu from University of California, Los Angeles and colleagues studied the inefficiency and cognitive overload experienced by LLMs in retrieval-augmented generation (RAG) systems when dealing with extensive contexts. They proposed BRIEF-Pro, a lightweight context compressor designed for multi-hop reasoning tasks, which synthesizes short-to-long contexts and includes a user-controllable compression instruction. The main innovation points are the synthetic data pipeline and the universal applicability to different LLM sizes. The value lies in its ability to improve QA accuracy and reduce inference latency while consuming fewer computational resources. Experiments on MuSiQue, HotpotQA, 2WikiMultiHopQA, and LongSeal datasets showed significant improvements in QA performance and reductions in TFLOPs consumption compared to LongLLMLingua and other baselines, concluding that BRIEF-Pro is suitable for large-scale long-context processing without compromising accuracy1.
-
Giovanni Monea from Cornell University and colleagues addressed the scalability issue of LLMs for long-context reasoning, which is hindered by the linear growth of the Transformer key-value cache. They introduced Breadcrumbs Reasoning (BR), a method that compresses key-value cache entries with learned beacon tokens and evicts the compressed entries to reduce memory usage. The innovation points include the joint reinforcement learning (RL) and distillation framework for training and the use of an attention mask to simulate KV cache removal. The value lies in maintaining high accuracy with reduced memory usage, making it ideal for real-world applications where memory and computational resources are limited. Experiments on Qwen2.5-1.5B and Phi-4-Mini models using Countdown, LinSys, and StarGraph benchmarks demonstrated superior memory-efficiency and accuracy compared to uncompressed models and training-free compression techniques, concluding that learned compression schemes are essential for complex reasoning tasks2.
-
Xiaoshu Chen from National University of Defense Technology and colleagues focused on enhancing reasoning capabilities in LLMs through Chain of Thought (CoT) fine-tuning. They provided a structured review of CoT fine-tuning methods using a bi-level taxonomy based on the Six Thinking Hats framework. The innovation lies in the taxonomy’s categorization of reasoning development into six dimensions and its focus on human-like reasoning mechanisms. The value is in offering a comprehensive guide for researchers to understand the development trajectories of different CoT fine-tuning methods and select appropriate baselines. Although the paper does not detail specific experimental results, it highlights the evolution of CoT fine-tuning into the Insight Model stage, where LLMs develop advanced reasoning capabilities akin to humans3.
-
Kehua Feng from Zhejiang University and colleagues tackled the challenge of achieving high-quality scientific reasoning through CoT distillation. They developed CoT-Evo, an evolutionary CoT distillation framework that creates and refines reasoning trajectories using domain-specific knowledge and iterative processes driven by a fitness function. The innovation points include the novelty-driven selection (NDS) strategy and the integration of recombination and mutation modules. The value is in generating a high-quality CoT dataset for scientific reasoning, leading to state-of-the-art performance on BioProBench and ChemCoTBench. The ablation study and scalability analysis suggest the importance of NDS and the diminishing returns of increasing iteration budgets and population sizes, concluding that CoT-Evo significantly improves scientific reasoning capabilities4.
-
Agnese Lombardi from University of Pisa and colleagues assessed Theory of Mind (ToM) capabilities in LLMs, focusing on their ability to interpret indirect speech acts and contextual cues. They proposed Concordia, a Generative Agent-Based Model, to embed utterances and narratives within situational contexts and evaluated it using a modified False-Belief task. The innovation lies in adapting ToM tasks into a new experimental format and employing a Chain of Thought approach. The value is in identifying the limitations of current LLMs in achieving true ToM-like abilities, emphasizing the need for considering extralinguistic factors. The structured evaluation method, including LLM-as-a-Judge, revealed that GPT-4o-mini often fails to appropriately interpret social context, concluding that LLMs have significant room for improvement in understanding human intentions and beliefs5.
-
Ine Gevers from University of Antwerp and colleagues examined the abstract and abductive reasoning skills of LLMs through their ability to play the board game ‘Concept’. The game requires players to provide and interpret clues hierarchically, which differs from typical linguistic data. They collected a multilingual dataset of game logs and implemented static and dynamic prompting methods to evaluate LLMs. The innovation lies in using a game-based approach to probe reasoning abilities and conducting multilingual experiments. The value is in assessing LLMs’ strategic thinking and hypothesis updating skills, critical for complex problem-solving. Experiments showed that LLMs struggle with interpreting strategic linguistic signals and updating hypotheses, especially in non-English languages, concluding that while LLMs can be efficient once on the right path, they face significant challenges in simulating human strategic intent and handling cultural nuances6.
Technical Trends
The papers in this collection demonstrate several evolving trends in the field of reasoning and logical flow for LLMs:
- Context Compression: Methods like BRIEF-Pro and Breadcrumbs Reasoning focus on compressing long contexts to reduce computational overhead and improve inference speed.
- Chain of Thought (CoT) Techniques: Papers like “Putting on the Thinking Hats” and “CoT-Evo” emphasize the development of CoT strategies, aiming to align LLM reasoning with human cognitive processes.
- Memory Management: Breadcrumbs Reasoning introduces a memory-efficient mechanism for managing the Transformer key-value cache during reasoning tasks.
- Domain-Specific Adaptation: CoT-Evo tailors reasoning to specific scientific domains, enhancing model performance in specialized knowledge areas.
- Evaluation Frameworks: The introduction of new benchmarks and evaluation methods, such as the ‘Concept’ game and the modified False-Belief task, allows for a more nuanced assessment of LLM reasoning capabilities.
Datasets and Evaluation
The papers utilized a variety of datasets and evaluation methods to test the reasoning capabilities of LLMs:
- MuSiQue, HotpotQA, 2WikiMultiHopQA, LongSeal: Used by BRIEF-Pro to evaluate multi-hop reasoning performance.
- Countdown, LinSys, StarGraph: Benchmarks used by Breadcrumbs Reasoning to assess memory-efficiency and reasoning accuracy.
- BioProBench, ChemCoTBench: Scientific reasoning datasets employed by CoT-Evo to evaluate domain-specific reasoning.
- Concept Dataset: A multilingual dataset of game logs used to assess abstract reasoning and hypothesis updating in LLMs.
- False-Belief Task: Adapted to evaluate ToM capabilities in LLMs, specifically their ability to interpret social context.
These datasets and evaluations highlight the multifaceted nature of reasoning and logical flow research, covering a range of tasks from multi-hop question-answering to scientific and social reasoning.
Topic 2: Multimodal and Cross-Modal Integration
Topic Overview
Multimodal and Cross-Modal Integration is a rapidly evolving area in artificial intelligence that focuses on developing models capable of understanding and generating content across multiple modalities such as text, image, audio, and video. These models are essential for creating advanced AI systems that can interpret complex, real-world data and interact with humans more naturally and effectively. The integration of different modalities presents significant challenges, including the need for unified representations, robustness to environmental variations, and efficient training and inference mechanisms. Addressing these issues is crucial for applications ranging from voice assistants and document processing to robotic manipulation and error correction in speech recognition.
Individual Paper Contributions
-
Run Luo from Shenzhen Institutes of Advanced Technology and colleagues studied the limitations of existing multimodal models, particularly those constrained by autoregressive architectures. They proposed NExT-OMNI, an omnimodal foundation model based on discrete flow matching (DFM) techniques, to overcome these limitations. The main innovation points of this method are its unified representation with intermediate feature fusion, a three-stage progressive training framework, and a dynamic length generation strategy. The value lies in the model’s ability to perform any-to-any cross-modal generation and multi-turn interaction efficiently, showcasing architectural advantages over traditional autoregressive methods. Experiments on benchmarks like OmniBench, WorldSense, AV-Odyssey, and OpenING demonstrated significant improvements in multimodal understanding and generation, as well as in multi-turn multimodal interaction and cross-modal retrieval, compared to existing models7.
-
Santiago Cuervo from Université de Toulon and colleagues addressed the ’text-speech understanding gap’ in Large Language Models (LLMs), where speech-adapted models often underperform when processing spoken inputs compared to their text-based counterparts. They introduced SALAD, a method combining cross-modal distillation with active learning to improve alignment and minimize forgetting. The main innovation points are the use of natural speech corpora and synthesized text subsets, along with a sample-efficient approach to adaptation. The value of this method is its scalability and reproducibility, offering a way to mitigate the text-speech gap without requiring extensive speech data. Experiments on benchmarks such as StoryCloze, MMSU, OpenBookQA, HellaSwag, ARC-Challenge, and PIQA showed that SALAD models retain text capabilities more effectively and outperform baselines, even with less speech data8.
-
Weishi Wang from SAP and colleagues provided a comprehensive survey on Document Intelligence (DAI) in the era of Large Language Models (LLMs). The paper categorized research into five key tasks and highlighted new methods like layout-aware Chain-of-Thought prompting, ‘box tokens’, and visually guided generative text-layout pretraining. The main innovation is the focus on unified representation learning for multimodal and multilingual document processing, addressing challenges in specialized tasks like Document Layout Analysis (DLA) and Key Information Extraction (KIE). The value lies in the systematic categorization of research efforts and the identification of necessary advancements for improving DAI tasks. Although the paper does not present experimental conclusions, it provides valuable insights into the current limitations and future directions of LLMs in DAI9.
-
Senyu Fei from Tongji University and colleagues investigated the robustness of Visual–Language–Action (VLA) models in real-world conditions. They introduced LIBERO-Plus, a benchmark designed to evaluate generalization across seven dimensions, and proposed a highly automated generalization pipeline for constructing diverse training datasets. The main innovation is the statistical framework for analyzing the compositional generalization gap, revealing interaction effects between perturbations. The value is in providing a rigorous evaluation practice that goes beyond static, ideal conditions, helping researchers understand the true adaptability of VLA models. Systematic perturbation experiments on the LIBERO-Plus benchmark indicated significant improvements in model robustness, especially in camera view, noise, and layout perturbations, compared to other baselines10.
-
Sungnyun Kim from KAIST and colleagues focused on improving the robustness of Automatic Speech Recognition (ASR) systems in noisy conditions by integrating visual cues into the speech recognition process. They introduced DualHyp, a new paradigm for generative error correction (GER) in audio-visual speech recognition (AVSR), which maintains modality-specific pathways and avoids cross-modal contamination. The main innovation is the use of RelPrompt, a noise-aware guidance mechanism, to leverage the reliability of visual cues. The value lies in enhancing ASR performance in challenging environmental conditions, demonstrated through significant improvements in Word Error Rate (WER) across various corruption scenarios. Experiments on the LRS2 benchmark and the multilingual MuAViC dataset showed substantial performance gains, particularly in noisy audio/clean video settings, compared to conventional GER methods11.
-
Keyan Zhou from Soochow University and colleagues developed MMLongCite, a benchmark for evaluating the fidelity of Large Vision Language Models (LVLMs) in handling long multimodal contexts. The benchmark includes 8 tasks spanning 6 context length intervals and covers diverse modalities. The main innovation is the detailed construction methodology to resize and manipulate context lengths and content, addressing the shortcomings of existing benchmarks. The value lies in providing a more rigorous evaluation framework that tests the models’ ability to ground responses in context, crucial for applications requiring accurate and reliable information extraction. Experimental results on datasets such as LongDocURL, HotpotQA, Visual Haystack, and Video-MME revealed significant discrepancies in model performance, particularly in visual grounding tasks, with proprietary models generally outperforming open-source models12.
Technical Trends
The papers collectively highlight several technical trends in multimodal and cross-modal integration:
- Unified Representation Learning: Methods such as NExT-OMNI and the models discussed in the DAI survey emphasize the importance of unified representations to effectively integrate understanding and generation capabilities across different modalities.
- Discrete Flow Matching (DFM): NExT-OMNI employs DFM to achieve more efficient and flexible cross-modal interactions, showcasing its potential as a viable alternative to autoregressive architectures.
- Cross-modal Distillation and Active Learning: SALAD utilizes cross-modal distillation and active learning to bridge the text-speech understanding gap, emphasizing the importance of these techniques in adapting models to new modalities without sacrificing performance in their original domain.
- Robustness and Generalization: LIBERO-Plus underscores the need for rigorous evaluation frameworks that test model robustness under varying conditions, advocating for a broader and more diverse set of training data to improve generalization.
- Modality-Specific Pathways: DualHyp maintains separate pathways for audio and visual modalities to prevent cross-modal contamination, illustrating the benefit of preserving modality-specific features in multimodal processing.
- Benchmark Development: MMLongCite introduces a new benchmark specifically tailored to evaluate long-context vision-language models, emphasizing the importance of task diversity and context length in assessing model fidelity.
Datasets and Evaluation Metrics
- OmniBench, WorldSense, AV-Odyssey, OpenING: Used for evaluating NExT-OMNI’s performance in multimodal understanding and generation.
- StoryCloze, MMSU, OpenBookQA, HellaSwag, ARC-Challenge, PIQA: Benchmarks for assessing SALAD’s effectiveness in closing the text-speech understanding gap.
- LibriHeavy, Emilia-YODAS-EN, FineWeb-Edu: Corpora used for training and testing SALAD.
- LIBERO-Plus: New benchmark introduced for evaluating the robustness of VLA models across various perturbation types.
- LRS2, MuAViC: Datasets used to test the effectiveness of the DualHyp framework in correcting speech errors.
- LongDocURL, HotpotQA, Visual Haystack, Video-MME: Datasets included in MMLongCite to evaluate the fidelity of LVLMs in long-context scenarios.
These datasets and metrics underscore the growing complexity and diversity required in multimodal and cross-modal research, reflecting the increasing sophistication of AI models and the need for thorough and realistic evaluations.
Topic 3: Knowledge Retrieval and Augmentation
Topic Overview
Knowledge retrieval and augmentation in AI systems, particularly in chat assistants and large language models (LLMs), have become increasingly important as these systems are integrated into high-stakes domains such as healthcare, finance, and scientific research. Ensuring that these systems provide accurate, reliable, and trustworthy information is paramount, given the potential for misinformation to cause significant harm. The papers reviewed here address various challenges related to improving the credibility, reliability, and reasoning capabilities of AI systems through innovative methodologies and frameworks.
Individual Paper Contributions
-
Ivan Vykopal from Brno University of Technology and colleagues studied the issue of misinformation and lack of reliable evidence in responses generated by chat assistants that use web search functionality. They proposed a methodology for evaluating the credibility of web sources cited by chat assistants and the groundedness of their responses. The main innovation points include a systematic approach to assess the reliability of responses and the introduction of a curated list of 100 claims across five misinformation-prone topics. The value lies in providing the first systematic comparison of fact-checking behaviors among different chat assistants, including GPT-4o, GPT-5, Perplexity, and Qwen Chat. Experiments showed that Perplexity achieved the highest rate of source credibility, while GPT-4o cited more non-credible sources, especially on sensitive topics. The conclusion is that the way chat assistants perform web searches and cite sources can significantly impact the reliability of their responses 13.
-
Shujun Xia from City University of Hong Kong and colleagues addressed the problem of outdated or inaccurate information generated by LLMs in medical applications. They introduced a novel benchmark, MedVersa, and a retrieval-based editing framework called MedREK, which includes a shared query-key MLP and an attention-based prompt encoder. The main innovation points are the precise knowledge retrieval mechanism and the ability to perform batch-editing in medical LLMs. The value lies in overcoming the challenges of inaccurate retrieval and updating medical knowledge efficiently without full retraining. Experiments demonstrated that MedREK outperformed baselines such as MEND, MEMIT, MedLaSA, and RECIPE across various medical benchmarks, showing superior performance in Efficacy, Generality, Locality, and Fluency metrics. The ablation study revealed the significant contributions of both the shared query-key MLP and the attention-based prompt encoder to the final performance 14.
-
Zhichao Xu from Amazon AI Fundamental Research and colleagues tackled the lack of faithfulness in intermediate reasoning steps during training for retrieval-augmented generation tasks, particularly in domains requiring complex reasoning like math and coding. They proposed VERITAS, a framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. The main innovation points include a multi-faceted reward function and an efficient, distilled reward model. The value lies in fostering faithful reasoning while maintaining or improving task accuracy. Experiments on seven QA benchmarks showed that VERITAS-R1, trained with the VERITAS framework, substantially improved faithfulness metrics, especially in Information-Think faithfulness, across all dataset categories, including challenging multi-hop datasets. The conclusion is that process-based faithfulness rewards can significantly enhance the reliability and effectiveness of search agents 15.
-
Jiamin Chen from City University of Hong Kong and colleagues focused on the brittleness and instability in long-context Retrieval-Augmented Generation (RAG) systems. They proposed Contextual Normalization (C-Norm), a lightweight, model-agnostic framework that enhances long-context RAG performance by adaptively reformulating the input context. The main innovation points are the Attention Balance Score (ABS) for selecting the most effective context format and the emphasis on the influence of context presentation format. The value lies in addressing the gap in existing literature that focuses more on retrieval quality and prompting strategies. Experiments on NQ-Open and LongBench-v2 datasets demonstrated consistent improvements in robustness and reasoning capacity across various LLMs, especially in challenging long-context scenarios. The conclusion is that the presentation format of context significantly impacts long-context RAG systems, and C-Norm can effectively enhance performance 16.
-
Zhiqi Huang from Capital One and colleagues addressed the issue of generating trustworthy responses in RAG systems, particularly in high-stakes domains like finance and healthcare. They proposed a method for estimating the confidence of responses generated by LLMs using raw feed-forward network (FFN) activations as auto-regressive signals. The main innovation points include the use of FFN activations for uncertainty estimation and the incorporation of a Huber loss term for robustness. The value lies in providing a scalable, architecture-aware solution for enhancing the trustworthiness of RAG systems. Experiments on a proprietary financial industry customer-support dataset showed that their model achieved high precision with a reasonable masking rate, outperforming other baselines such as Vectara (HHEM2.1) and a logits-based uncertainty model. The conclusion is that using raw activations from specific layers of the LLM can achieve high precision and effective calibration, balancing accuracy and system responsiveness 17.
-
Xiuyuan Chen and colleagues aimed to evaluate AI clinician systems more rigorously, proposing the GAPS framework. This multidimensional system evaluates clinical competence along four axes: Grounding, Adequacy, Perturbation, and Safety. The main innovation points are the guideline-anchored pipeline that constructs a GAPS-aligned benchmark and the automatic generation of questions and rubrics across the GAPS dimensions. The value lies in offering a scalable and reproducible method for evaluating AI clinicians, moving beyond simple factual recall to encompass deeper reasoning and safety considerations. Empirical evaluations on the GAPS-NCCN-NSCLC-preview benchmark showed that while models perform well on factual and explanatory tasks, their performance declines with increased reasoning depth and complexity, indicating a need for improved inferential reasoning under uncertainty 18.
-
Subhendu Khatuya from Indian Institute of Technology Kharagpur and colleagues studied the difficulty of LLMs in performing numerical reasoning tasks, especially within the financial domain. They introduced FINDER, a retriever-generator framework that integrates dynamic in-context example selection and generative retrieval methods. The main innovation points include leveraging clustering techniques for diverse yet representative in-context examples and using a fine-tuned FLAN-T5 model for fact retrieval and GPT-4 for generating Python code. The value lies in improving the precision of fact retrieval and the flexibility of in-context example usage, leading to higher execution accuracy. Experiments on FinQA and ConvFinQA datasets showed that FINDER outperformed previous state-of-the-art models, including APOLLO and ENCORE, achieving significant improvements in execution accuracy 19.
Technical Trends
The papers collectively highlight several key technical trends in knowledge retrieval and augmentation:
- Evaluation Methodologies: There is a growing focus on developing systematic and comprehensive evaluation frameworks to assess the reliability, credibility, and reasoning capabilities of AI systems. For example, the GAPS framework emphasizes a multidimensional evaluation approach, while the methodology proposed by Vykopal et al. distinguishes between different user perspectives.
- Attention Mechanisms and Prompt Engineering: Several papers, such as MedREK and C-Norm, leverage advanced attention mechanisms and prompt engineering to improve the precision and effectiveness of knowledge retrieval. The MedREK framework uses an attention-based prompt encoder to generate high-quality prompts, while C-Norm utilizes the Attention Balance Score for context adaptation.
- Faithfulness and Uncertainty Estimation: Enhancing the faithfulness of reasoning steps and estimating the uncertainty of generated responses are emerging as critical areas. VERITAS incorporates faithfulness rewards into the reinforcement learning process, while Huang et al. propose using FFN activations for uncertainty estimation.
- Automated Benchmark Construction: Automated pipelines for constructing benchmarks, as seen in the GAPS framework, are becoming essential for scalable and reproducible evaluations of AI systems in specialized domains like medicine.
Datasets and Evaluation Metrics
- MedVersa: Used in the MedREK paper for assessing medical LLMs under single and batch-edit scenarios, focusing on metrics like Efficacy, Generality, Locality, and Fluency.
- NQ-Open and LongBench-v2: Utilized in the C-Norm paper to evaluate long-context RAG systems, measuring Overall Averaged Accuracy (OAA) and Optimal Positioned Accuracy (OPA).
- FinQA and ConvFinQA: Employed in the FINDER paper to assess financial numerical reasoning tasks, with improvements measured in terms of execution accuracy.
- GAPS-NCCN-NSCLC-preview: Introduced in the GAPS paper to evaluate AI clinician systems across dimensions like Grounding, Adequacy, Perturbation, and Safety.
These datasets and metrics reflect the evolving landscape of knowledge retrieval and augmentation, emphasizing the importance of domain-specific evaluations and the need for comprehensive assessment beyond simple factual recall.
Topic 4: Model Efficiency and Optimization
Topic Overview
Model efficiency and optimization is a critical area of research in the field of large language models (LLMs) and neural networks. As LLMs grow in size and complexity, the need for efficient resource management, reduced computational overhead, and enhanced performance becomes increasingly important. Research in this domain aims to address bottlenecks related to memory usage, processing speed, and the ability to maintain learned knowledge during fine-tuning. The advancements in this area are vital for scaling LLMs to handle larger datasets and more complex tasks, and for integrating them into real-world applications where performance and resource consumption are key considerations.
Individual Paper Contributions
-
Yuxiang Huang from Tsinghua University and colleagues studied the inefficiency in the decoding process of large language models when dealing with long contexts, proposing NOSA (Native and Offloadable Sparse Attention) to solve the memory-bound bottleneck. The main innovation points of this method are the incorporation of an explicit locality constraint into the training process, which enables effective KV cache offloading without degrading task performance. The value lies in its ability to maintain model performance while significantly improving decoding throughput, particularly in scenarios involving large batch sizes and longer input sequences. Experiments on LongBench and RULER datasets showed a 2.3 times improvement in decoding throughput over vanilla InfLLM-V2, concluding that locality constraints are essential for efficient KV cache offloading20.
-
Jingmin An from Peking University and colleagues addressed the challenge of comparing syntactic structure processing mechanisms in large language models and the human brain. They introduced the Hierarchical Frequency Tagging Probe (HFTP) as a novel tool for this comparison, employing frequency-domain analysis. The main innovation lies in the unified approach that allows direct comparisons between artificial and biological systems, bridging computational linguistics and cognitive neuroscience. The value of HFTP is in providing a deeper understanding of how LLMs and the human brain process syntactic structures differently, which could inform future improvements in model design. Experiments on various LLMs and human sEEG data revealed distinct strategies for syntactic processing and identified brain regions like A1, STG, MTG, and IFG that align more closely with LLM representations, concluding that while LLMs can model syntactic structures, they do so through mechanisms that may not fully replicate human brain processes21.
-
Pasin Buakhaw from Chulalongkorn University and colleagues tackled the issue of balancing character authenticity with task execution in LLM-based NPCs in gaming environments. They introduced the Deflanderization prompting technique to suppress excessive role-play and improve task fidelity, alongside leveraging Qwen3-14B with supervised fine-tuning (SFT) and Low-Rank Adaptation (LoRA). The main innovation is the technique’s ability to prevent ‘flanderization,’ maintaining character depth and realism. The value is in enhancing player immersion and narrative depth in video games. Experiments in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2 demonstrated significant improvements in CPDCscore(all) with Deflanderization, showing that overly strong role-playing can negatively affect functional correctness. Combining Deflanderization with few-shot examples further improved performance in function name and argument matching, concluding that simpler prompting strategies are more effective for achieving high performance in both functional reasoning and persona-grounded dialogue22.
-
Chen Zheng from an unspecified institution and colleagues focused on the functional redundancy among experts in Mixture-of-Experts (MoE) models, proposing GatePro, a parameter-free method designed to promote expert selection diversity. The main innovation is a localized competition mechanism that prevents similar experts from being co-activated, emphasizing the importance of diversity over load balancing. The value of GatePro is in its ability to enhance model capacity and efficiency, offering a versatile solution applicable across different training phases. Evaluations on benchmarks such as MMLU-Pro, MMLU, BBH, HellaSwag, GSM8K, and MBPP showed substantial improvements in arithmetic reasoning, complex reasoning, factual knowledge, and code generation, concluding that GatePro significantly enhances expert selection diversity and accelerates expert activation during training23.
-
Yifeng Xiong from Organization 1 and colleagues addressed catastrophic forgetting during parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). They proposed OPLoRA, which constrains LoRA updates to the orthogonal complement of the top-$k$ singular directions of the pre-trained weight matrix. The main innovation is a theoretically grounded solution that guarantees the preservation of the top-$k$ singular triples of the original weight matrices, thereby reducing forgetting. The value lies in retaining foundational knowledge while acquiring new task-specific skills, which is crucial for real-world model deployment. Experiments across datasets like Commonsense170k, MetaMathQA, and CodeFeedback showed that OPLoRA outperforms existing methods in preventing catastrophic forgetting and maintaining performance on task-specific benchmarks, concluding that lower alignment with top-$k$ singular directions correlates with better knowledge preservation24.
Technical Trends
The papers reviewed adopt a range of innovative techniques to optimize model efficiency and reduce computational overhead. These include:
- Sparse Attention Mechanisms: Enhancing KV cache offloading efficiency through locality constraints.
- Unified Probing Methods: Bridging computational linguistics and cognitive neuroscience to understand syntactic processing.
- Prompting Techniques: Using Deflanderization to balance authenticity and task execution in game dialogues.
- Parameter-Free Optimization: Introducing localized competition mechanisms in MoE models to promote expert diversity.
- Orthogonal Projection Methods: Preventing catastrophic forgetting during fine-tuning by ensuring update constraints in the orthogonal complement of pre-trained weights.
Datasets and Evaluation
- LongBench and RULER: Used for evaluating the performance and memory efficiency of LLMs in long-context scenarios.
- Human sEEG Data and Existing Corpora: Employed to probe syntactic structure representations in LLMs and the human brain.
- Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2: Assesses agents in task-oriented dialogue and context-aware dialogue.
- MMLU-Pro, MMLU, BBH, HellaSwag, GSM8K, and MBPP: Benchmarks used to evaluate the performance of MoE models.
- Commonsense170k, MetaMathQA, and CodeFeedback: Datasets utilized to test the effectiveness of OPLoRA in preventing catastrophic forgetting and maintaining task-specific performance.
Topic 5: Language Processing and Generation
Topic Overview
The topic of Language Processing and Generation encompasses the development and analysis of systems that can understand, generate, and manipulate human language. With the rapid advancement of Large Language Models (LLMs), there is increasing interest in understanding their capabilities and limitations, particularly in areas such as text detection, narrative question answering, machine translation, and text-to-speech synthesis. This research is crucial for ensuring the ethical and responsible use of LLMs, enhancing their performance across various tasks, and addressing issues such as memorization and style-content mismatch. As LLMs become more ubiquitous, studies in this area aim to improve their reliability, accuracy, and naturalness, making them more suitable for real-world applications.
Individual Paper Contributions
-
Matthieu Dubois from Sorbonne Université and colleagues studied the robustness of automatic text detection systems against variations in the decoding strategies used by LLMs to generate text. They proposed a large-scale benchmark dataset that includes texts generated using six different decoding strategies across 37 configurations, allowing for a detailed examination of how subtle variations affect detector performance. The main innovation points include the comprehensive range of sampling parameters examined and the theoretical analysis of their impact on detectability. The value lies in providing a more rigorous evaluation protocol for text detectors, revealing significant performance drops under specific sampling conditions and emphasizing the need for detectors that are resilient to diverse generation settings. Experiments on this benchmark dataset showed that AUROC scores dropped drastically under certain configurations, and models trained with a mixture of parameters performed better but still faced challenges with high temperature settings and repetition penalties. The main conclusion is that current detection systems might overfit to human training data and fail to generalize well under varied conditions 25.
-
Tommaso Bonomo from Sapienza University of Rome and colleagues addressed the unreliability of the NarrativeQA benchmark for evaluating long-document narrative question answering (QA) systems. They introduced a refined subset of NarrativeQA named LiteraryQA, which focuses exclusively on literary works and undergoes a detailed refinement process to eliminate noisy documents and question-answer pairs. The paper’s main innovation points are the pipeline for dataset refinement and the analysis of various evaluation metrics in the context of narrative QA. The value lies in providing a more robust framework for assessing model performance in narrative QA tasks. Experiments on the refined LiteraryQA dataset showed that traditional n-gram-based metrics had poor correlations with human judgment, whereas METEOR demonstrated better performance due to its stemming and synonym-resolution features. The LLM-as-a-judge paradigm, especially with additional context like book summaries, correlated most closely with human judgments, indicating its effectiveness in handling narrative nuances. The conclusion drawn is that cleaner datasets and more suitable evaluation metrics can significantly improve model performance in narrative QA tasks 26.
-
Hao Wang from Alibaba International Digital Commerce and colleagues tackled the limitations of current preference learning methods in machine translation (MT), specifically the issues with flawed reward signals from Quality Estimation (QE) models and inefficient data utilization. They proposed M2PO: Multi-Pair, Multi-Perspective Preference Optimization, which includes a multi-perspective reward engine and a multi-pair optimization strategy. The reward engine incorporates a hallucination penalty and a dynamic scoring curriculum, while the optimization strategy maximizes data exploitation by constructing preference pairs from the entire pool of translation candidates. The value lies in the development of a more comprehensive and robust MT evaluation method that enhances the reliability and quality of translations. Experiments on WMT21-22 benchmarks demonstrated that M2PO significantly outperformed previous preference models and leading general-purpose LLMs in terms of translation quality and faithfulness, indicating its capability to bridge the gap between open-source and proprietary models 27.
-
Kristýna Onderková from Charles University and colleagues focused on the problem of LLMs memorizing common benchmarks and performing inconsistently across different domains in table-to-text generation. They introduced FreshTab, a method for generating fresh benchmark datasets by leveraging recent Wikidata/Wikipedia entries. FreshTab includes domain labels and random logical operation labels for each table, enabling domain-specific evaluation and suggesting the type of insight to generate. The main innovation points are the dynamic generation of up-to-date datasets and the inclusion of domain labels. The value lies in creating a versatile framework for evaluating models across linguistic and domain boundaries. Experiments on FreshTab datasets revealed that while automatic metrics indicated a performance drop on newer data, human evaluations suggested otherwise, indicating a potential bias in the metrics. The study concluded that domain-balanced data pose greater challenges than sport-heavy data, and models tend to produce simpler insights on fresh data, suggesting the need for more nuanced evaluation metrics 28.
-
Yizhou Peng and colleagues aimed to resolve the issue of style-content mismatch in auto-regressive (AR) Text-to-Speech (TTS) models, which can lead to unnatural emotional expressions. They proposed Semantic Mismatch Guided Classifier-Free Guidance (SMG-CFG), an adaptive CFG scheme that adjusts based on the semantic mismatch detected by LLMs or NLI models. The main innovation points include the adaptive adjustment of CFG scales and the introduction of a random condition instead of an unconditional one. The value lies in offering a method for achieving fine-grained emotional control in TTS systems, enhancing their naturalness and realism. Experiments on the CosyVoice2 model showed that SMG-CFG significantly improved emotional recognition accuracy (ER ACC) while maintaining or slightly degrading other metrics like word error rate (WER) and mean opinion score (MOS). The ablation study confirmed that the method is robust and generalizable, making it a promising tool for enhancing emotional control in TTS models 29.
-
Ming Dong from Central China Normal University and colleagues addressed the generation of toxic or harmful content by LLMs, proposing Detoxification with Self-Constrained Decoding (DSCD). DSCD operates by dynamically adjusting the next-token distribution during the decoding phase to enhance the safety of generated content without compromising fluency. The main innovation points are the self-constrained approach and the two operational modes (MODE-1 and MODE-2) tailored for precision and efficiency. The value lies in offering a lightweight and efficient alternative to existing detoxification methods that typically require parameter fine-tuning. Experiments on datasets like SafeEdit and AlpacaEval demonstrated that DSCD could achieve an average improvement of 11.78% in detoxification performance and maintain generation fluency. MODE-2, while less precise, offered comparable fluency and better efficiency compared to existing methods like DINM. The paper concluded that DSCD can effectively prevent the generation of toxic content without negatively impacting the overall performance of the model 30.
Technical Trends
The papers collectively highlight several key trends in the field of language processing and generation. Firstly, there is a growing emphasis on the variability and robustness of LLMs under different generation settings, as seen in Dubois et al.’s study on text detection and Wang et al.’s work on preference optimization in machine translation. Secondly, there is a focus on refining and updating datasets to ensure that evaluations remain relevant and unbiased, exemplified by Bonomo et al.’s LiteraryQA and Onderková et al.’s FreshTab. Thirdly, the integration of advanced techniques like adaptive guidance schemes (as in Peng et al.’s SMG-CFG) and self-constrained decoding (in Dong et al.’s DSCD) is evident, reflecting efforts to improve the naturalness and ethical considerations of LLM outputs. Lastly, there is a notable trend towards using multi-perspective and dynamic evaluation frameworks to capture the complexity of language tasks, enhancing the reliability of LLMs in practical applications.
Datasets and Evaluation Metrics
The papers utilized a variety of datasets and evaluation metrics to assess the performance of their respective models and methodologies. Notable datasets include the custom benchmark dataset for detecting machine-written texts 25, the refined LiteraryQA subset of NarrativeQA 26, the WMT21-22 benchmarks for machine translation 27, the FreshTab datasets sourced from recent Wikidata/Wikipedia entries 28, and the SafeEdit and AlpacaEval datasets for LLM detoxification 30. Evaluation metrics covered a broad spectrum, ranging from traditional n-gram-based measures like METEOR 26 to more advanced metrics such as COMET22, XCOMET, and Coverage Score in machine translation 27, TAPEX for table-to-text generation 28, and emotional recognition accuracy (ER ACC), word error rate (WER), and mean opinion score (MOS) for TTS models 29. The inclusion of human evaluations alongside automated metrics in several studies underscores the importance of aligning model performance with human perception and judgment.
Topic 6: Machine Learning and Deep Learning Techniques
Topic Overview
Machine Learning and Deep Learning Techniques have become indispensable in advancing artificial intelligence systems, particularly in areas like natural language processing (NLP) and personalized learning. The focus of this research area is to enhance the interpretability, efficiency, and performance of AI models through innovative methodologies and architectures. By addressing challenges such as opaque reasoning processes, inefficient training methods, and the need for more personalized and adaptive learning systems, these techniques aim to create more reliable and effective AI solutions that can handle complex reasoning tasks and evolving user intents in conversational contexts.
Individual Paper Contributions
-
Yang Li from Shanghai Jiao Tong University and colleagues studied the opacity and inefficiency of Large Language Models (LLMs) in their reasoning processes, proposing a novel approach based on attention dynamics to enable fine-grained policy optimization. The main innovation points are the introduction of two new metrics, Windowed Average Attention Distance (WAAD) and Future Attention Influence (FAI), to quantify local and global attention patterns of tokens. Three structure-aware reinforcement learning strategies were also introduced to dynamically reweight token-level advantages. The value lies in aligning credit assignment with the model’s intrinsic reasoning patterns, which offers a more targeted and effective approach compared to existing uniform credit distribution methods. Experiments on benchmarks like Countdown, CrossThink-QA, and mathematical reasoning datasets showed significant performance improvements, with coupled rhythm credit achieving 63.1% on the Countdown puzzle compared to 52.6% for the baseline GRPO method31.
-
Simon Lupart from University of Amsterdam and colleagues addressed the limitations of reinforcement learning (RL)-based reasoning frameworks in handling multi-turn conversational question answering (CQA). They proposed ChatR1, an RL-based reasoning model that integrates search and reasoning dynamically and introduces an intent-aware reward mechanism to address sparse and delayed rewards in RL. The main innovation points include the dynamic integration of search and reasoning, and the turn-level supervision provided by the intent-aware reward mechanism. The value lies in enhancing the model’s ability to generalize across different conversational domains and improve query formulation and retrieval behavior. Through experiments on five diverse CQA datasets, ChatR1-3B and ChatR1-7B achieved substantial performance gains over baselines, with an average improvement of 2.2 F1 points when the intent-aware reward was included32.
-
Xuxin Cheng from LongCat Interaction Team and colleagues focused on improving Meituan’s intelligent interaction systems by proposing WOWService, a comprehensive framework that includes a robust training pipeline with stages such as Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning (RL). The main innovation points are the hybrid data-knowledge driven approach and the multi-agent architecture that collaborates to improve service quality. The value lies in enhancing customer satisfaction, reducing operational costs, and supporting scalability and adaptability of business services. Experiments demonstrated a 75% efficiency improvement in balancing general and domain-specific capabilities during CPT, and a significant improvement in user satisfaction metrics with the SRT framework, including a 27.53% decrease in USM 1 and a 25.51% increase in USM 2. The evaluation framework showed a high human-machine agreement rate of over 95%, indicating effective and available answers in complex scenarios33.
-
Jiacheng Guo from Princeton University and colleagues investigated the inefficiency and high cost associated with collecting human preference data for aligning LLMs using Direct Preference Optimization (DPO). They introduced Preference Variance (PVar) as a metric to measure variability in preference probabilities for response pairs, allowing for the identification of more informative training examples. The main innovation points include the theoretical justification for offline data selection and empirical validation of the effectiveness of high-PVar prompts. The value lies in making the alignment process more scalable and cost-effective. Experiments on multiple datasets and models, including UltraFeedback, Chatbot Arena Conversation, HH-RLHF, and WebGPT, showed that models trained on the top 50% PVar prompts performed better on AlpacaEval 2.0 and Arena-Hard benchmarks, indicating faster convergence and lower final loss values compared to lower PVar or randomly selected prompts34.
-
Joy Jia Yin Lim from Tsinghua University and colleagues tackled the challenge of personalized learning path planning (PLPP) by integrating reinforcement-based training and an LLM-driven educational architecture into the Pxplore framework. The main innovation points involve the goal-driven learner state model that captures cognitive and motivational aspects of learners and an automated reward function to quantify progress towards educational goals. The value lies in producing adaptive and interpretable learning paths, addressing the need for more sophisticated, dynamic, and goal-aligned personalized learning systems. Experiments showed that Pxplore achieved the highest overall alignment rate of 65.47% with GRPO-based optimization and generated more cohesive, context-aware, and motivational learning experiences, leading to improved test scores and higher user ratings in relevance, personalization, motivation, and satisfaction35.
Technical Trends
The papers collectively highlight the trend towards leveraging reinforcement learning (RL) and attention mechanisms to enhance the performance and interpretability of AI models. They demonstrate advancements in creating more dynamic and context-aware systems, whether for reasoning, conversational understanding, or personalized learning. The integration of human preference data and intent awareness into RL frameworks is another notable trend, aiming to improve the alignment of AI behaviors with human expectations and to make training processes more efficient.
Datasets and Evaluation
The datasets used in these studies range widely, reflecting the diverse application areas:
- Countdown, CrossThink-QA, and Mathematical Reasoning Datasets: Used in Yang Li’s paper to evaluate reasoning strategies.
- TopiOCQA, QReCC, INSCIT, MultiDoc2Dial, FaithDial: Employed in Simon Lupart’s paper for multi-turn conversational question answering.
- Meituan’s proprietary datasets: Utilized in Xuxin Cheng’s paper for evaluating intelligent interaction systems.
- UltraFeedback, Chatbot Arena Conversation, HH-RLHF, WebGPT, AlpacaEval 2.0, Arena-Hard: Involved in Jiacheng Guo’s paper for assessing preference optimization methods.
- Educational datasets: Not explicitly named but used in Joy Jia Yin Lim’s paper for personalized learning path planning.
Evaluation metrics varied according to the application domain:
- Accuracy and Convergence Rates: Used in reasoning and preference optimization papers.
- F1 Scores and Intent Awareness: Applied in conversational question answering.
- User Satisfaction Metrics (USM 1, USM 2), Resolution Rates, Human-Machine Agreement Rates: Employed in the evaluation of intelligent interaction systems.
- Pedagogical Alignment Rate, Reward Computation, Test Score Improvement, User Ratings: Utilized in the assessment of personalized learning systems.
Topic 7: Temporal and Sequential Data Handling
Topic Overview
Temporal and sequential data handling is a critical area in artificial intelligence and machine learning, especially for large language models (LLMs). These models are increasingly being used in a wide range of applications, from natural language understanding to predictive analytics, where the ability to accurately reason about temporal information and sequence dependencies is paramount. However, traditional LLMs often struggle with these tasks due to their static nature and the limitations of current training methodologies. Research in this domain aims to address these challenges by developing frameworks and methods that can enhance the temporal reasoning abilities of LLMs, improve their adaptability to changing environments, and refine their decision-making processes based on sequential data.
Individual Paper Contributions
-
Xingyu Tan from University of New South Wales and colleagues studied the limitations of LLMs in handling time-sensitive or evolving information, proposing MemoTime, a Memory-Augmented Temporal Knowledge Graph framework, to solve the core problem of temporal reasoning and multi-entity temporal synchronization. The main innovation points of MemoTime include its structured temporal grounding, hierarchical reasoning, dynamic toolkit invocation, and continual memory updating. The value lies in its ability to provide a more reliable and accurate approach for LLMs to deal with complex temporal questions, thereby enhancing their performance in applications like historical research and predictive modeling. Experiments on established datasets like MultiTQ and TimeQuestions showed significant improvements in temporal reasoning accuracy and consistency, with MemoTime achieving a 77.9% Hit@1 on MultiTQ and 71.4% accuracy on TimeQuestions, compared to TempAgent baseline. The conclusion is that MemoTime’s dynamic memory retrieval and hierarchical reasoning enhance the reasoning capabilities of LLMs, making them more adaptable to evolving contexts.36
-
Donald Shenaj from Samsung R&D Institute UK and colleagues addressed the challenge of managing storage for Low-Rank Adapters (LoRAs) in large language models deployed on mobile and edge devices. They introduced K-Merge and K-Merge++, methods designed for online continual merging of LoRAs under a fixed storage budget. The main innovation points are the similarity-based clustering technique and the history-aware merging strategy. The value lies in offering a practical solution for enhancing the adaptability and performance of on-device language models, particularly in resource-constrained environments. Evaluation on benchmarks for five problem types across eight languages demonstrated strong performance, with K-Merge++ achieving up to 90% of the performance of single-task LoRAs with just 8 clusters. The paper concluded that the proposed methods are robust and adaptable, especially in worst-case task orderings, and that LoRAs from the same problem type share more similarities than those from the same language.37
-
Xiaoyu Yan from Northwestern University and colleagues tackled the misalignment issue in transportation policy making between traveler preferences and conventional model predictions. They proposed a multi-agent voting framework that simulates collective decision-making using LLMs, combining traditional utility-based travel demand modeling with AI-driven deliberation. The innovation lies in the integration of LLMs to assess voting mechanisms and contextual information’s impact on policy preferences. The value is in enhancing the social viability and effectiveness of transportation policies, ensuring they are both analytically sound and socially acceptable. Experiments using sentiment analysis and varying voting rules and agent types revealed that LLM agents generally select policies close to the Pareto frontier, with differences in preference influenced by the type of LLM and urban context. The study concluded that the LLM-based framework can effectively simulate policy preferences, though its alignment with benchmarks varies by context.38
-
Anej Svete from ETH Zürich and colleagues explored the reasoning capabilities of Masked Diffusion Models (MDMs) in comparison to traditional autoregressive language models. They proposed a formal framework to characterize the reasoning abilities of MDMs, connecting them with paradigms like Chain of Thought (CoT), Looping, and Pause Tokens. The main innovation points include establishing the equivalence of MDMs to Padded Looped Transformers (PLTs) and demonstrating that MDMs with logarithmically growing denoising steps can recognize regular languages. The value lies in providing a rigorous theoretical basis for understanding MDMs, which could lead to better leveraging their potential for parallel computation and iterative refinement. Theoretical analysis highlighted the inherent inefficiency of CoT in handling parallelizable problems and the critical role of positional encodings and unmasked attention patterns in MDMs’ processing capabilities.39
Technical Trends
The papers reviewed here reflect a trend towards developing more sophisticated and adaptive methods for handling temporal and sequential data within the realm of large language models. There is a notable shift towards integrating external memory systems and knowledge graphs to enhance temporal reasoning and maintain consistency over time. Another trend is the exploration of efficient storage management techniques for adapting models to limited-resource environments, emphasizing the importance of dynamic merging strategies and cluster-based approaches. Lastly, there is a focus on applying LLMs to real-world decision-making processes, such as policy formulation, to better align with human preferences and societal norms. The theoretical underpinnings of reasoning capabilities in diffusion models also indicate a growing interest in understanding and optimizing the computational advantages of these models.
Datasets and Evaluation
- MemoTime: Evaluated on MultiTQ and TimeQuestions datasets, focusing on temporal reasoning accuracy and consistency.
- K-Merge/K-Merge++: Tested on benchmarks covering five problem types across eight languages, including Persona-Chat Synthetic, SAMSum, Sound Natural, SQuAD, and Write & Improve, evaluating model performance and storage efficiency.
- Addressing the alignment problem in transportation policy making: Utilized sentiment analysis and regression analysis on synthetic urban contexts to assess the alignment of LLM-generated policy preferences with benchmarks.
- On the Reasoning Abilities of Masked Diffusion Language Models: No specific datasets were used; the paper focuses on theoretical analysis and does not mention empirical evaluations.
These evaluations collectively aim to measure the effectiveness of the proposed methods in improving temporal reasoning, adaptability, and decision-making capabilities of LLMs, with each paper adopting tailored metrics to assess their unique contributions.
Topic 8: Benchmarking and Evaluation
Topic Overview
Benchmarking and evaluation are essential components in the development and deployment of artificial intelligence (AI) systems, particularly large language models (LLMs). These processes help researchers and developers understand the strengths and weaknesses of AI models across various domains and tasks, enabling targeted improvements and ensuring the models meet practical needs. With the increasing complexity and applicability of LLMs, there is a growing necessity for benchmarks and evaluation frameworks that can accurately gauge performance in specialized contexts, such as consumer intent understanding and advanced mathematical reasoning, and also in diverse linguistic environments like those involving the Arabic language.
Individual Paper Contributions
-
Xiaozhe Li from Tongji University and colleagues studied the evaluation of LLMs on their ability to understand real-world consumer intent within multifaceted and dynamic discussions, proposing CONSINT-Bench to solve this core problem. The main innovation points of this method are the inclusion of a large-scale, dynamic dataset spanning nine consumer domains and the introduction of three evaluation mechanisms—CONSINT-Tree, CONSINT-RAG, and informativeness evaluation through lexical diversity and semantic richness. The value lies in providing a comprehensive benchmark that simulates the complexity and depth of real-world human discourse, especially in consumer domains, which previous benchmarks have not fully addressed. Experiments on over 200k product-level discussions showed that proprietary models lead in overall depth and breadth scores, while open-source models lag in informativeness, concluding that the balance between noise reduction and retaining semantic richness in real-world data processing is critical 40.
-
Shrey Pandit from Salesforce AI Research and colleagues tackled the insufficiency of current step-level verifiers in assessing the mathematical reasoning of LLMs on open-ended, frontier-level problems, proposing Hard2Verify as a solution. The benchmark focuses on the frontier of LLM-based mathematical reasoning and includes 1,860 annotated steps across 200 unique model responses to challenging math problems from competitions like IMO, Putnam, and INMO. The main innovation lies in the strict annotation process and the emphasis on evaluating open-ended questions, which highlights the limitations of current verification technologies. Experiments revealed that GPT-5 demonstrates the highest performance in verifying step-level correctness and identifying errors, with weaker verifiers marking almost every step as correct, indicating a significant drop in TNR while TPR increases 41.
-
Yufei He from National University of Singapore and colleagues focused on the limitation of current AI agents to learn complex skills on the fly at test time, proposing EvoTest, an evolutionary test-time learning framework, to solve this core problem. The main innovation points are the introduction of the Jericho Test-Time Learning (J-TTL) benchmark and the EvoTest framework itself, which evolves the entire agentic system through transcript-level analysis without relying on gradient updates or fine-tuning. The value lies in enhancing the adaptability and reliability of AI agents in dynamic environments, allowing them to improve their strategies based on immediate experiences. Experiments on the J-TTL benchmark across six Jericho games demonstrated a 38% improvement over the strongest prompt-evolution baseline and a 57% improvement over online RL, concluding that prompt evolution is the primary driver of strategic adaptation and that a more capable model can enhance the learning process 42.
-
Ahmed Alzubaidi from Technology Innovation Institute, Abu Dhabi, UAE and colleagues surveyed the evaluation of Arabic LLMs, addressing the unique challenges associated with the Arabic language, such as data scarcity and the presence of multiple dialects. The paper proposes a taxonomy that categorizes 40+ benchmarks into four major categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. The main innovation is the structured overview of the current landscape of Arabic LLM benchmarks, which was previously lacking. The value lies in identifying gaps in the evaluation process, such as limited temporal evaluation and insufficient multi-turn dialogue assessment, and offering insights into the effectiveness and limitations of different benchmark creation methods 43.
Technical Trends
The papers in this collection showcase a trend towards developing domain-specific benchmarks and evaluation frameworks to address the shortcomings of generic benchmarks. There is a clear shift towards incorporating real-world data and complex problem-solving scenarios to more accurately reflect the challenges faced by LLMs in practical applications. Additionally, the use of evolutionary algorithms and transcript-level analysis in test-time learning represents a novel approach to improving AI adaptability and self-improvement. The emphasis on human validation and iterative refinement, especially in the context of culturally diverse languages like Arabic, underscores the importance of aligning AI capabilities with societal and cultural contexts.
Datasets and Evaluation Metrics
-
CONSINT-Bench: Over 200k product-level discussions from real-world user interactions, evaluated using CONSINT-Tree, CONSINT-RAG, and informativeness measures based on lexical diversity and semantic richness.
-
Hard2Verify: 1,860 annotated steps across 200 unique model responses to challenging math problems from competitions like IMO, Putnam, and INMO, evaluated using step-level and response-level correctness measures, and first error identification.
-
Jericho Test-Time Learning (J-TTL): Six text-based adventure games (Detective, Library, Zork1, Zork3, Balances, and Temple), evaluated through improvements in game completion rates and strategic adaptation.
-
Arabic LLM Benchmarks: Multiple benchmarks categorized into Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations, assessed using methodologies like native collection, translation, and synthetic generation, with a focus on identifying gaps in temporal evaluation and multi-turn dialogue assessment.
Topic 9: Security and Privacy in AI
Topic Overview
Security and privacy in AI are critical concerns as the technology becomes increasingly integrated into various aspects of daily life and industry. Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) are pivotal in natural language processing and multimodal reasoning, yet they present unique challenges related to data protection, model robustness, and cultural alignment. Ensuring these models are trustworthy and secure is essential for their ethical deployment and for maintaining user confidence in AI technologies.
Individual Paper Contributions
-
Yuan Feng from University of Science and Technology of China and colleagues studied the fragility of Key-Value (KV) cache eviction strategies in LLM inference, proposing DefensiveKV and Layer-DefensiveKV methods to address the issue of abrupt shifts in importance scores leading to poor generation quality. The main innovation points of this method are the introduction of defensive aggregation techniques focused on worst-case risk management. The value lies in enhancing the robustness and efficiency of LLMs, particularly under reduced cache sizes, which is crucial for optimizing resource usage in large-scale deployments. Experiments on LongBench and Needle-in-a-Haystack benchmarks demonstrated significant improvements in generation quality, with reductions in quality loss from 10.6% to 2.3% for Layer-DefensiveKV compared to baselines like CriticalKV and SnapKV44.
-
Pardis Sadat Zahraei from University of Illinois Urbana-Champaign and colleagues addressed the cultural misalignment and multilingual biases in LLMs concerning the Middle East and North Africa (MENA) region. They proposed the MENAValues Benchmark, a structured dataset of 864 questions derived from large-scale human surveys, aimed at diagnosing cultural alignment issues. The benchmark evaluates models across different language modes and prompting perspectives, employing metrics like the Normalized Value Alignment Score (NVAS) and Consistency Metrics Framework. The value lies in providing a comprehensive method to diagnose and rectify cultural biases, enhancing the inclusivity and trustworthiness of AI applications in diverse sociocultural contexts. Through extensive evaluations, the paper reveals that models struggle with alignment and reasoning-induced degradation, emphasizing the need for more culturally sensitive training and evaluation frameworks45.
-
Ruoyu Sun from University of Alberta and colleagues introduced TrustVis, an automated evaluation framework designed to assess the trustworthiness of LLMs by integrating safety and robustness evaluations. The framework uses an ensemble of safeguard models and an adversarial perturbation generation method (AutoDAN) to detect harmful content and vulnerabilities. The main innovation points include the integration of safety and robustness into a unified assessment tool, supporting custom dataset uploads, and providing an interactive visualization interface. The value lies in offering a transparent, accessible, and comprehensive approach to trustworthiness evaluation, addressing the fragmented nature of existing methodologies. Preliminary evaluations on models like Vicuna-7b and GPT-3.5 revealed significant improvements in detection accuracy and highlighted specific areas of weakness, suggesting that TrustVis can serve as a powerful diagnostic tool for LLM vulnerabilities46.
-
Hamdan Al-Ali and colleagues investigated the leakage of personal attributes from federated learning-based Automatic Speech Recognition (ASR) models. They proposed a non-parametric white-box attack method that infers personal attributes from weight differentials in federated ASR models, revealing vulnerabilities in protecting sensitive information such as gender, age, and accent. The main innovation points involve the use of shadow models and summary statistics to form feature vectors for attribute classification. The value lies in deepening the understanding of federated learning security and the risks associated with attribute leakage, which is crucial for privacy-preserving applications. Experiments on datasets like the Speech Accent Archive and TORGO indicated reliable inference of accents and varying susceptibility of other attributes, underscoring the importance of robust security measures in federated ASR systems47.
-
Karthik Avinash from FutureAGI Inc. and colleagues developed Protect, a multi-modal guardrailing framework for enterprise LLM systems, addressing the need for real-time oversight, multi-modal data handling, and explainability. The framework uses specialized adapters trained via Low-Rank Adaptation (LoRA) on a multi-modal dataset to detect harmful content, bias, and adversarial attacks. The main innovation points are the inclusion of a teacher-assisted annotation pipeline and support for text, image, and audio processing. The value lies in providing a versatile and effective guardrailing system suitable for diverse enterprise applications, enhancing trust and compliance. Experiments showed state-of-the-art performance in detecting prompt injection and privacy violations, with notable improvements in explainability for complex image-based contexts, suggesting the framework’s utility in mission-critical settings48.
-
Juan Ren from Macquarie University and colleagues focused on the vulnerability of LVLMs to adversarial inputs that disguise harmful intentions. They proposed SHIELD, a lightweight, classifier-guided prompting framework designed to mitigate jailbreak and non-following rates. The main innovation points are the introduction of a fine-grained taxonomy of harmful content and explicit safety policy actions. The value lies in enabling a more nuanced approach to moderating LVLM inputs without requiring model retraining, making it cost-effective and adaptable. Experiments across five benchmark datasets and LVLMs showed significant reductions in jailbreak and non-following rates, particularly for models lacking explicit safety alignment, demonstrating the effectiveness of SHIELD in improving model safety49.
Technical Trends
The papers collectively highlight a trend towards developing more sophisticated and nuanced methods for evaluating and enhancing the security and privacy of AI models. Innovations include:
- Defensive Aggregation Strategies: Methods like DefensiveKV aim to improve worst-case risk management in LLMs.
- Cultural and Multilingual Benchmarks: Tools like the MENAValues Benchmark address cultural misalignment and bias in LLMs.
- Unified Evaluation Frameworks: Frameworks such as TrustVis integrate safety and robustness evaluations into a cohesive tool.
- Attribute Inference Attacks: Studies like Personal Attribute Leakage in Federated Speech Models explore the risks associated with federated learning and propose mitigation techniques.
- Multi-Modal Guardrails: Solutions like Protect are being developed to handle multi-modal data and ensure model compliance across different types of input.
- Classifier-Guided Prompting: Techniques like SHIELD are being introduced to guide and moderate model behavior through enhanced prompting mechanisms.
Datasets and Evaluation Metrics
- LongBench and Needle-in-a-Haystack: Used to evaluate KV cache eviction strategies.
- MENAValues Benchmark: Structured dataset for assessing cultural alignment and multilingual biases.
- Vicuna-7b, Llama2-7b, and GPT-3.5: Models used for testing the TrustVis framework.
- Speech Accent Archive (SAA), TORGO, and RAVDESS: Datasets utilized to study personal attribute leakage in federated ASR models.
- Custom Multi-Modal Dataset: Used for training adapters in the Protect framework.
- Five Benchmark Datasets: Employed to test the effectiveness of SHIELD in mitigating adversarial attacks on LVLMs.
These datasets and metrics collectively aim to provide a robust and comprehensive evaluation of AI models’ security and privacy features, contributing to the advancement of safer and more trustworthy AI technologies.
Topic 10: Cross-Linguistic and Cultural Studies
Topic Overview
Cross-linguistic and cultural studies focus on understanding linguistic phenomena across different languages and cultures, aiming to enhance the accessibility and effectiveness of AI technologies in linguistically diverse regions. This research area is vital for addressing digital divides and improving the global applicability of AI tools, particularly in regions where low-resource languages dominate. These studies contribute to the broader field of computational linguistics by providing methodologies and frameworks that can adapt existing technologies to work better with less-studied languages, thereby enriching the overall linguistic landscape and fostering inclusivity in AI.
Individual Paper Contributions
-
Daniil Gurgurov from Saarland University and colleagues studied the uneven performance of large language models (LLMs) across different languages, especially the gap between high-resource and underrepresented languages. They proposed a framework that identifies language-specific neurons using Language Activation Probability Entropy (LAPE) and fine-tunes these neurons’ weights to improve the model’s performance in underrepresented languages. The main innovation points of this method are the parameter-efficient adaptation and the ability to update only up to 1% of the model parameters, which allows for notable improvements in target-language performance without sacrificing general-purpose capabilities. The value lies in offering a cost-effective pathway for adapting state-of-the-art models to underrepresented languages, thus enhancing their accessibility and utility. Experiments on Llama-3.1-8B and Mistral-Nemo-12B across 12 mid- and low-resource languages demonstrated that targeted sparse fine-tuning consistently outperformed other baselines, achieving average gains of up to 5 points on datasets like FLORES, while maintaining or slightly improving performance on general benchmarks such as MMLU and Winogrande50.
-
Kim Gfeller from [Institution] and colleagues investigated the dynamics of lexical change and colexification patterns across different languages and over time. They introduced a distributional phylogenetic modeling approach that combines a continuous-time Markov process with regression models to infer change rates and long-term preferences for colexification. The main innovation is linking meaning change and colexification patterns, allowing for a deeper understanding of semantic evolution across large time spans and multiple languages. The value of this method lies in advancing theories of language evolution and developing more accurate models of semantic change. The study focused on three language families—Austronesian, Indo-European, and Uralic—and validated the method through simulation-based tests. The experiments revealed that associativity positively influences the stability of colexifications, while frequency negatively impacts their stationary probability but positively affects change speed. Borrowability had a significant negative effect on stationary probability for Indo-European but not for Austronesian or Uralic. The full model outperformed restricted and null models in Leave-One-Out Cross-Validation, demonstrating higher expected log pointwise predictive density (ELPD)51.
-
Prawaal Sharma from Infosys and colleagues addressed the scarcity of digital resources for low-resource languages (LRLs), specifically focusing on creating parallel corpora for NLP tasks. They proposed a fully automated and scalable methodology for generating bilingual parallel datasets using image and text analytics. The innovation lies in leveraging newspaper articles where images are reused across different language versions to align articles and map sentences between languages. The value of this method is in its ability to create large-scale, high-quality parallel datasets for LRLs without human annotations, significantly improving NLP applications for these languages. The data augmentation pipeline includes components such as Crawler, Article Extractor, Article Mapper, and Sentence Mapper, using three types of sentence similarity metrics: Language Agnostic Sentence Embedding (LAS), Simple Length-based Heuristics (SLAS), and Lexical Overlap (LO). Experiments on the Konkani-Marathi corpus showed that LAS provided the best results for sentence mapping, achieving an average STS score of 3.7. Machine translation evaluations demonstrated an improvement in BLEU scores by approximately 3 points when using the newly generated parallel corpus, moving from a baseline of 23.5 to 26.4, indicating the effectiveness of the proposed method in creating valuable resources for LRLs52.
Technical Trends
The papers under this topic exhibit a trend towards leveraging innovative methods to address the challenges faced by underrepresented languages in AI and computational linguistics. Techniques such as sparse subnetwork enhancement, distributional phylogenetic modeling, and fully automated data augmentation represent advancements in adapting and expanding the capabilities of AI models to support a wider range of languages. These methods aim to reduce the resource burden typically associated with model adaptation, either through parameter-efficient fine-tuning or by automating the creation of necessary datasets.
Datasets and Evaluation
- Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models: Used Llama-3.1-8B and Mistral-Nemo-12B models across 12 mid- and low-resource languages, evaluated on datasets like FLORES, MMLU, and Winogrande.
- Investigating Lexical Change through Cross-Linguistic Colexification Patterns: Focused on three language families—Austronesian, Indo-European, and Uralic—using simulation-based tests and comparing against a negative binomial regression model.
- A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics: Created a new Konkani-Marathi parallel corpus, evaluated using Sentence Translation Similarity (STS) scores and BLEU scores in machine translation tasks.
Topic 11: misc
Topic Overview
The research topic encompasses a range of studies focused on enhancing the capabilities of large language models (LLMs) and vision-language models (VLMs) in various dimensions. These studies aim to improve the models’ performance in tasks such as symbol grounding, dialogue consistency, authorship attribution, anomaly detection, speech-to-speech translation, continual learning, uncertainty quantification, and multimodal audio generation. Each paper contributes unique insights and methodologies that push the boundaries of what these models can achieve, particularly in terms of understanding and interacting with the physical world, maintaining logical consistency, and generating high-quality outputs while preserving specific attributes like emphasis and coherence.
Individual Paper Contributions
-
Shuyu Wu from University of Michigan and colleagues studied the emergent capability of symbol grounding in large-scale autoregressive LMs and VLMs, proposing a novel framework that uses annotations from the CHILDES corpora to construct a minimal testbed. The main innovation points involve representing words as both environmental and linguistic tokens, ensuring that any learned correspondence must be derived from training. The value lies in providing a deeper understanding of how abstract symbols acquire meaning by connecting to real-world experiences. Experiments on Child-Directed Speech, Caption-Grounded Dialogue, and Image-Grounded Dialogue datasets showed that models like Transformers and Mamba-2 outperform LSTMs in grounding information gain and leveraging environmental context for linguistic prediction, concluding that certain architectural features are necessary for the emergence of grounding.53
-
Xiang Lei from Apple and colleagues addressed the challenge of maintaining logical and factual consistency in extended, multi-turn dialogues using LLMs. They introduced D-SMART, a model-agnostic framework incorporating a Dynamic Structured Memory (DSM) and a Reasoning Tree (RT). The main innovation is the construction of an OWL-compliant knowledge graph and the ability to perform multi-step reasoning. The value lies in enhancing dialogue consistency in conversational AI applications. Experiments on the MT-Bench-101 benchmark demonstrated that D-SMART significantly outperforms state-of-the-art baselines, improving dialogue consistency scores by over 48%, and highlighting its ability to maintain long-term logical coherence.54
-
Ye Yuan from McGill University and colleagues tackled the computational inefficiency and high memory cost of using dense embeddings from large language models for text anomaly detection. They introduced the Simplified Isolation Kernel (SIK), a method that reduces embedding dimensionality to lower-dimensional sparse representations, achieving linear time complexity and significantly reduced space complexity. The value lies in making anomaly detection more scalable and practical. Empirical evaluations across seven benchmark datasets showed that SIK outperforms eleven state-of-the-art anomaly detection algorithms, demonstrating superior performance and computational efficiency.55
-
Pavan Kalyan from Microsoft Research and colleagues focused on continual learning (CL) in language models, proposing CurLL, a comprehensive developmental framework grounded in human developmental trajectories. The main innovation is the detailed skill graph that models knowledge dependencies and skill progression. The value lies in enabling a more nuanced analysis of skill acquisition and forgetting in LMs. Preliminary experiments revealed that models trained sequentially show performance degradation on previously learned skills, indicating a form of catastrophic forgetting, and highlighted the importance of fine-grained evaluation in CL.56
-
Xinchen Zhang from Tsinghua University and colleagues addressed the poor performance of VLMs and UMMs in verifying visual outcomes. They introduced OmniVerifier-7B, the first generative verifier for universal visual verification, along with OmniVerifier-TTS, a sequential test-time scaling paradigm. The value lies in enhancing the reliability of multimodal reasoning and generation tasks. Experiments on ViVerBench showed an 8.3% overall improvement over the base model, and OmniVerifier-TTS outperformed parallel test-time scaling methods on T2I-ReasonBench and GenEval++.57
-
Buwei He and colleagues explored the limitations of LLMs in performing strategic persuasion tasks without pre-commitment to a signaling schema. They introduced a framework for grounding Bayesian Persuasion (BP) in real-world dialogues, proposing SFNL and FNL verbalization approaches. The value lies in improving the persuasiveness of AI-driven communications. Experiments showed that BP strategies significantly outperform non-Bayesian persuasion methods, with FNL achieving notably higher success rates and better human perception.58
-
Arthur Vogels and colleagues aimed to balance control and coherence in LLM generation through In-Distribution Steering (IDS), a novel activation steering method. The main innovation is the dynamic adjustment of steering strength based on input data distribution. The value lies in ensuring the reliability, safety, and controllability of LLMs. IDS demonstrated superior Steering Performance Impact (SPI) performance across various models and datasets, outperforming MERA and CAA in steering effectiveness while maintaining text plausibility.59
-
Xi Chen and colleagues worked on preserving word-level emphasis in speech-to-speech translation (S2ST). They developed EmphST-Instruct, an automated pipeline for generating emphasis-aligned parallel corpora, and EmphST-Bench, a benchmark for evaluating emphasis preservation. The value lies in enhancing the expressive power of S2ST systems. Experiments showed that the StressTransfer system outperformed other methods in preserving expressive stress, achieving a Sentence Stress Reasoning Accuracy (SSR) of 78.0%.60
-
Junichiro Niimi from Meijo University and RIKEN AIP focused on the variability and sensitivity of one-shot LLM predictions in sentiment analysis. The paper proposed a method to select representative examples for LLM ensembles using centroid-based selection and K-Means clustering, along with controlled diversity through sampling temperature. The value lies in improving the robustness and reliability of LLM-based sentiment analysis. Experiments using the Yelp Open Dataset showed that the combination of representative example selection and high sampling temperature significantly improved ensemble performance.61
-
Mingda Li from Harbin Institute of Technology and colleagues addressed the quantification of epistemic uncertainty in LLMs to enhance their reliability and applicability. They introduced ESI, a grey-box uncertainty quantification method that applies semantic-preserving interventions to prompts. The value lies in avoiding the generation of untruthful content. Experiments on CoQA, SciQ, TriviaQA, AmbigQA, and TruthfulQA datasets revealed that ESI outperforms state-of-the-art methods in computational efficiency and effectiveness.62
-
Liesbeth Allein and colleagues investigated the discovery of implicit causal chains in climate discourse. They proposed a zero-shot causal chain generation approach and a diagnostic evaluation framework. The value lies in assessing and refining causal explanations in argumentative settings. Experiments suggested that while LLMs can generate self-consistent causal chains, their reasoning often relies on associative pattern matching rather than genuine causal reasoning.63
-
Zhenyu Liu from Harbin Institute of Technology and colleagues worked on developing a unified model for speech and music generation. They introduced UniMoE-Audio, a model based on a Dynamic-Capacity MoE framework, with innovations including dynamic-capacity routing and hybrid expert design. The value lies in overcoming task conflicts and data imbalances. Experiments on SeedTTS-EN and T2M/V2M tasks showed that UniMoE-Audio achieves state-of-the-art perceptual quality and aesthetic scores, demonstrating better performance than baselines across speech and music domains.64
Technical Trends
The papers collectively demonstrate a trend towards developing innovative frameworks and methods to enhance the robustness, reliability, and contextual understanding of LLMs and VLMs. Key methodologies include:
- Dynamic Adjustment and Contextual Understanding: Techniques such as IDS and D-SMART adjust model behavior dynamically based on input or context, enhancing control and consistency.
- Multimodal Integration and Verification: Approaches like OmniVerifier and StressTransfer focus on integrating and verifying visual and auditory information, respectively, to improve the reliability and expressiveness of multimodal models.
- Continual Learning and Data Efficiency: CurLL and UniMoE-Audio address the challenges of continual learning and data imbalance, proposing frameworks that adaptively manage resource allocation and knowledge integration.
- Uncertainty Quantification and Bias Mitigation: ESI and Stable LLM Ensemble tackle the issue of model uncertainty and variability in predictions, proposing methods that improve model reliability and robustness.
Datasets and Evaluation
- CHILDES Corpora: Used for exploring symbol grounding in language models.
- MT-Bench-101 Benchmark: Utilized for evaluating dialogue consistency.
- ViVerBench: A comprehensive benchmark for visual verification tasks.
- EmphST-Bench: Designed for evaluating emphasis preservation in speech translation.
- Yelp Open Dataset: Employed for sentiment analysis and ensemble evaluation.
- CoQA, SciQ, TriviaQA, AmbigQA, TruthfulQA: Used for uncertainty quantification.
- SeedTTS-EN, T2M, V2M Tasks: Benchmarks for evaluating speech and music generation.
Evaluation metrics varied widely across the papers, including surprisal, Consistency Score (CS), Dialogue Entailment Rate (DER), Steering Performance Impact (SPI), Sentence Stress Reasoning Accuracy (SSR), macro-F1, RMSE, UTMOS, WER, and various domain-specific scores like CLAP and CLaMP3. These metrics help in assessing the effectiveness, efficiency, and robustness of the proposed methods in different tasks and scenarios.
References
-
BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning ↩︎
-
Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons ↩︎
-
Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism ↩︎
-
CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning ↩︎
-
Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models ↩︎
-
Do You Get the Hint? Benchmarking LLMs on the Board Game Concept ↩︎
-
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching ↩︎
-
Closing the Gap Between Text and Speech Understanding in LLMs ↩︎
-
Document Intelligence in the Era of Large Language Models: A Survey ↩︎
-
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models ↩︎
-
Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses ↩︎
-
MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models ↩︎
-
Assessing Web Search Credibility and Response Groundedness in Chat Assistants ↩︎
-
MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts ↩︎
-
Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation ↩︎
-
Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation ↩︎
-
Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation ↩︎
-
GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians ↩︎
-
Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval ↩︎
-
Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain ↩︎
-
Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs ↩︎
-
GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models ↩︎
-
OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning ↩︎
-
How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study ↩︎ ↩︎
-
LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA ↩︎ ↩︎ ↩︎
-
Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation ↩︎ ↩︎ ↩︎
-
FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation ↩︎ ↩︎ ↩︎
-
Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models ↩︎ ↩︎
-
DSCD: Large Language Model Detoxification with Self-Constrained Decoding ↩︎ ↩︎
-
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization ↩︎
-
ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering ↩︎
-
Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan’s Intelligent Interaction Systems ↩︎
-
On the Role of Preference Variance in Preference Optimization ↩︎
-
Personalized Learning Path Planning with Goal-Driven Learner State Modeling ↩︎
-
MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning ↩︎
-
K-Merge: Online Continual Merging of Adapters for On-device Large Language Models ↩︎
-
Addressing the alignment problem in transportation policy making: an LLM approach ↩︎
-
On the Reasoning Abilities of Masked Diffusion Language Models ↩︎
-
ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding ↩︎
-
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math ↩︎
-
EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems ↩︎
-
Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps ↩︎
-
Taming the Fragility of KV Cache Eviction in LLM Inference ↩︎
-
I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs ↩︎
-
TRUSTVIS: A Multi-Dimensional Trustworthiness Evaluation Framework for Large Language Models ↩︎
-
Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems ↩︎
-
SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs ↩︎
-
Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models ↩︎
-
Investigating Lexical Change through Cross-Linguistic Colexification Patterns ↩︎
-
A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics ↩︎
-
The Mechanistic Emergence of Symbol Grounding in Language Models ↩︎
-
D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree ↩︎
-
CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models ↩︎
-
Generative Universal Verifier as Multimodal Meta-Reasoner ↩︎
-
Make an Offer They Can’t Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment ↩︎
-
In-Distribution Steering: Balancing Control and Coherence in Language Model Generation ↩︎
-
StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation ↩︎
-
Stable LLM Ensemble: Interaction between Example Representativeness and Diversity ↩︎
-
ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models ↩︎
-
Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse ↩︎
-
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE ↩︎