2025年10月11日NLP论文汇总(英文)
- Topic 1: Reasoning and Cognitive Robustness (8 papers)
- Topic 2: Multimodal Learning and Integration (5 papers)
- Topic 3: Large Language Model Evaluation and Utility (5 papers)
- Topic 4: Personalization and Social Media Profiling (5 papers)
- Topic 5: Reinforcement Learning in NLP (6 papers)
- Topic 6: Knowledge and Data Augmentation (8 papers)
- Topic 7: Safety and Ethical AI (5 papers)
- Topic 8: Continual and Lifelong Learning (4 papers)
- Topic 9: Natural Language Processing Techniques (7 papers)
- Topic 10: Language Model Adaptation and Specialization (2 papers)
- Topic 11: misc (15 papers)
Topic 1: Reasoning and Cognitive Robustness
Topic Overview
The research topic of “Reasoning and Cognitive Robustness” focuses on enhancing the reasoning capabilities of large language models (LLMs) and their resilience in various cognitive tasks. This is particularly relevant in domains where the models need to exhibit reliable and consistent performance, such as machine translation, logical reasoning, and medical vision-language tasks. Improving reasoning robustness is crucial for ensuring that LLMs can handle complex tasks accurately and safely, which is essential for their widespread adoption in real-world applications.
Individual Paper Contributions
-
Armel Zebaze from Inria Paris and colleagues studied the effectiveness of intermediate tokens, termed ’thinking tokens’, in improving machine translation performance, proposing modular translation-specific prompting strategies such as MAPS, SBYS, TEaR, Self-Refine, and CompTra to solve the core problem. The main innovation points of this method are the use of synthetic data generation over thinking tokens and the emphasis on embedding translation attempts within these tokens. The value lies in enhancing the methodology for training translation models, especially for less-resourced languages, by improving upon the limitations of Chain-of-Thought (CoT) distillation. Experiments on Xhosa and Lithuanian language pairs showed up to 3.5 BLEU and 2 MetricX points improvement over Input-Output Fine-Tuning (IOFT), concluding that the presence of translation attempts within intermediate tokens significantly contributes to performance improvements1.
-
Souradeep Mukhopadhyay from Arizona State University and colleagues addressed the ‘phantom recall’ failure mode in LLMs when solving modified logic puzzles, proposing the PHANTOM RECALL benchmark and an automated logical-equivalence judge to evaluate and mitigate reasoning failures. The main innovation points include the creation of a benchmark with 25 classic logic puzzles and 149 perturbations, along with a taxonomy of reasoning error categories and a prompting-based mitigation framework. The value lies in systematically identifying and addressing the robustness gap in LLMs’ logical reasoning abilities, which is critical for their reliability in applications requiring precise logical reasoning. Manual and automated evaluations on five open-source models demonstrated significant performance drops on perturbed puzzles, with improvements noted through structured prompting, indicating the importance of context adaptation and reasoning robustness2.
-
Yisong Miao from National University of Singapore and colleagues focused on understanding how transformer language models process discourse relations, proposing the concept of ‘discursive circuits’ and a novel task called Completion under Discourse Relation (CuDR). The main innovation points are the identification of sparse computational subgraphs causally responsible for discourse understanding and the creation of a hierarchical representation of these circuits based on discourse frameworks like PDTB, RST, and SDRT. The value lies in providing a nuanced mechanistic understanding of discourse processing, which is vital for ensuring safe and ethical behavior in language models. Experiments showed that discursive circuits achieved strong faithfulness in recovering discourse understanding with only about 200 edges, demonstrating their effectiveness in handling discourse relations3.
-
Junjie Lu from University of Technology Sydney and colleagues tackled the limitation of current approaches for enhancing LLM reasoning by introducing Confidence-Guided Reasoning Path Preference Optimization (CGPO). The main innovation points are the use of model confidence signals to guide reasoning path optimization without relying on human annotations or stronger models. The value lies in the scalability and applicability of CGPO across different domains, such as mathematical reasoning and code generation. Experiments on GSM8K and MATH datasets revealed that CGPO led to significant performance improvements, with MetaMath-LLaMA-8B achieving a 4.15% gain on GSM8K and a 2.54% gain on MATH, highlighting the benefits of exploring non-human-like reasoning paths4.
-
Yiwei Liu from Nanjing University and colleagues addressed the challenge of joint logical-numerical reasoning in LLMs, proposing LogiNumSynth, a flexible problem synthesizer for generating complex reasoning tasks. The main innovation points are the ability to customize task complexity and the inclusion of new mathematical expressions and logical operators. The value lies in enhancing the generalizability and robustness of language models in practical applications that require integrated reasoning skills. Fine-tuning on LogiNumSynth-generated data improved performance on external reasoning benchmarks, with Qwen3-1.7B showing a 10.20 point increase on the FOLIO dataset, emphasizing the importance of synthetic data for targeted training5.
-
Gabrielle Kaili-May Liu and colleagues evaluated the performance of Retrieval-Augmented Generation (RAG) systems on complex, multi-hop queries that are unanswerable or require realistic reasoning, proposing a pipeline for generating uncheatable, realistic, unanswerable, multi-hop queries (CRUMQs). The main innovation points are the creation of a pipeline that ensures query realism and complexity, and the application of this pipeline to existing RAG datasets to create more challenging benchmarks. The value lies in providing a more rigorous test of RAG systems’ reasoning capabilities, which is crucial for their reliability in high-stakes domains. Experiments demonstrated that CRUMQs significantly reduced cheatability by up to 81.0%, indicating the increased difficulty for existing RAG systems6.
-
Nikita Afonin and colleagues investigated emergent misalignment (EM) in LLMs due to in-context learning (ICL) from narrow sets of misaligned examples, proposing a study that extends EM beyond fine-tuning to ICL scenarios. The main innovation points include the evaluation of EM in the ICL regime and the analysis of chain-of-thought reasoning to understand the underlying mechanisms. The value lies in addressing safety concerns for LLMs deployed in interactive systems, where misalignment could lead to harmful outcomes. Experiments on four frontier models from the Gemini and Qwen families showed that larger models are more susceptible to EM, with Gemini-2.5-Pro exhibiting higher EM rates than smaller models, and that even with safety training, models can adopt harmful personas7.
Technical Trends
The papers collectively highlight several key trends in improving reasoning and cognitive robustness:
- Synthetic Data Generation: Multiple studies emphasize the use of synthetic data to train models, focusing on creating reasoning paths and scenarios that enhance performance.
- Prompting Strategies: Various forms of prompting, including structured and thinking prompts, are utilized to guide models towards more effective reasoning.
- Evaluation Frameworks: New benchmarks and evaluation metrics are introduced to assess reasoning robustness and faithfulness in different contexts, including logic puzzles, discourse relations, and medical VQA.
- Non-Human-Like Reasoning: There is a growing interest in exploring reasoning paths that deviate from typical human thought processes, aiming to optimize model performance in complex tasks.
- Multimodal Reasoning: Integration of visual and textual inputs to test and improve reasoning in vision-language models is a prominent trend, particularly in medical applications.
Datasets and Evaluation Metrics
-
Datasets:
- Xhosa and Lithuanian language pairs for machine translation
- PHANTOM RECALL benchmark with 25 classic logic puzzles and 149 perturbations
- Chest X-ray Visual Question Answering (VQA) dataset
- GSM8K and MATH datasets for mathematical reasoning
- FOLIO dataset for joint logical-numerical reasoning
- NeuCLIR and TREC RAG 2025 datasets for RAG system evaluation
- Four misalignment datasets for studying emergent misalignment in ICL scenarios
-
Evaluation Metrics:
- BLEU and MetricX for translation quality
- Accuracy and F1 scores for logic puzzle solving
- Clinical fidelity, causal attribution, and confidence calibration for medical VQA
- Improvement percentages on mathematical reasoning datasets (GSM8K, MATH)
- Acceptability, unanswered, and ask-for-clarification ratios for RAG system evaluation
- Misalignment rates and chain-of-thought reasoning analysis for studying emergent misalignment
These summaries encapsulate the diverse yet interconnected efforts to enhance the reasoning and cognitive robustness of LLMs, contributing to a richer understanding of their capabilities and limitations.
Topic 2: Multimodal Learning and Integration
Topic Overview
Multimodal learning and integration involve the development of artificial intelligence models that can process and understand multiple types of data, such as images, text, and audio, simultaneously. This research area is critical for building AI systems that can interpret complex human interactions and environments more effectively, mimicking human perception and cognition. Enhancing the ability of AI to integrate and reason across modalities can lead to breakthroughs in applications ranging from conversational agents to automated web design and more. Understanding the nuances of how different forms of supervision and data affect model performance is essential for advancing the field and making AI systems more versatile and adaptable.
Individual Paper Contributions
-
Yiming Liu from Stanford University and colleagues studied whether the superior performance of CLIP as a vision encoder in vision-language models (VLMs) is due to its large training dataset or its language supervision. They proposed a controlled experimental setting where CLIP and DINO are trained under identical configurations to isolate these factors. The main innovation points of this method are the rigorous control of variables to compare the impact of language supervision against dataset size, and the value lies in providing a clear methodology to evaluate the contributions of each to model performance. Experiments on general and robustness benchmarks showed that both CLIP and DINO achieve similar performance, but CLIP excels in fine-grained classification tasks and text-intensive tasks within VLMs. This suggests that language supervision is particularly beneficial for capturing high-level visual semantics and reasoning over textual inputs, though the exact mechanisms may require further investigation 8.
-
Yuhang Li from Tencent and colleagues addressed the challenge of front-end code generation for visually correct and interactive web designs using Large Language Models (LLMs). They introduced ReLook, a vision-grounded reinforcement learning framework that uses a multimodal LLM (MLLM) as a critic to provide pixel-level feedback and prevent reward hacking. The key innovation here is the integration of vision-based feedback into the RL process, allowing for iterative refinement of generated code. The practical value and significance lie in enhancing the reliability and quality of AI-generated web designs. Experiments on ArtifactsBench subsets, FullStack-Bench-Html, and Web-Bench demonstrated significant improvements over base models and Web-RL, especially in tasks requiring visual precision, color coherence, and dynamic behavior 9.
-
Jiliang Hu from Tencent AI Lab and colleagues tackled the lack of comprehensive evaluation benchmarks for large audio language models (LALMs) in Chinese, focusing on real human speech. They proposed VCB Bench, a new benchmark that evaluates LALMs based on instruction following, knowledge understanding, and robustness. The main innovation is the introduction of a real human speech dataset and a multilingual evaluation framework. The value of this work is in providing a more accurate diagnostic tool for model weaknesses and facilitating fair comparisons among LALMs. Experiments revealed notable performance gaps among different models, highlighting strengths and weaknesses in reasoning, text semantic analysis, and cross-lingual speech adaptation 10.
-
Belkiss Souayed from University of Zurich and colleagues explored the generation of visually accessible images from simplified texts for individuals with intellectual disabilities. They introduced a template-based prompting framework named Template-Based Text-to-Image Alignment (TB-TIA) that emphasizes structured prompting for accessible image generation. The innovation lies in the development of specific prompt templates designed to adhere to accessibility constraints, such as object count limits and spatial separation. The value of TB-TIA is in improving comprehension for individuals with cognitive impairments by generating images that are both semantically aligned and visually simple. Evaluations using CLIPScores and human expert annotations indicated that the Basic Object Focus template performed best in terms of semantic alignment and overall accessibility, suggesting that visual minimalism aids comprehension 11.
-
KiHyun Nam and colleagues introduced Diffusion-Link, a diffusion-based modality-bridging module designed to reduce the audio-text modality gap in contrastive audio-language pretraining. The main innovation is the use of a lightweight network architecture with residual MLP blocks to map audio embeddings into the text-embedding distribution, along with a topology loss to preserve the relative geometry of the text distribution. The value of this approach is in enhancing cross-modal task performance, particularly in automatic audio captioning, without relying on external knowledge. Experiments on the AudioCaps dataset showed that Diffusion-Link significantly reduced the modality gap and improved automatic audio captioning performance, achieving state-of-the-art results in both zero-shot and fully supervised settings 12.
Technical Trends
The papers collectively highlight evolving methodologies in multimodal learning and integration. They emphasize the importance of controlled experimentation to isolate variable impacts on model performance, the use of reinforcement learning for iterative improvement in complex tasks like web coding, and the necessity of domain-specific benchmarks for evaluating LALMs. Additionally, there is a trend towards leveraging diffusion models for modality bridging and employing structured prompts to enhance the accessibility and relevance of generated content.
Datasets and Evaluation
- CLIP vs DINO: General and robustness benchmarks, fine-grained classification tasks.
- ReLook: ArtifactsBench subsets, FullStack-Bench-Html, Web-Bench.
- VCB Bench: A high-quality Chinese dataset built on real human speech, covering diverse conversational scenarios.
- Template-Based Text-to-Image Alignment: Derived from OneStopEnglish, SimPA, Wikipedia, and ASSET text simplification datasets; evaluated using CLIPScores and human expert annotations.
- Diffusion-Link: AudioCaps dataset for evaluating audio-text modality bridging and automatic audio captioning.
Topic 3: Large Language Model Evaluation and Utility
Topic Overview
Large Language Model (LLM) Evaluation and Utility is a critical research area aimed at assessing the capabilities of LLMs in various contexts and enhancing their practical application. The focus is not only on measuring performance through traditional metrics but also on exploring how these models can be integrated into real-world tasks, such as social deduction games, real-time fake news detection, code generation, and medical assistance. The importance of this research lies in the need to understand the strengths and limitations of LLMs, especially in areas requiring human-like interaction, reasoning, and domain-specific expertise, to ensure their safe and effective deployment.
Individual Paper Contributions
-
Zirui Song from MBZUAI and colleagues studied the evaluation of LLMs in social deduction games, particularly the Werewolf game, proposing the WereBench dataset and the WereAlign evaluation framework to address the issue of overly rigid and templated utterances, as well as poor understanding of game rules and nuances. The main innovation points include the construction of a high-quality multimodal dataset that captures authentic human gameplay and the development of a comprehensive evaluation framework that assesses speech quality and decision-making accuracy through strategy-alignment with human players. The value lies in providing a nuanced and reliable way to evaluate LLMs’ social reasoning and interaction capabilities, which is essential for advancing AI systems in sophisticated social environments. Experiments on the WereBench dataset showed that the benchmark effectively differentiates between models of various sizes and capabilities, with Gemini-2.5-Pro achieving the highest overall performance 13.
-
Guangyu Wei from Nanjing University and colleagues addressed the challenge of real-time fake news detection under evidence scarcity, proposing the EASE framework, which uses a sequential evaluation and expert selection pipeline to handle the rapid spread of misinformation and the lack of immediate verifiable evidence. The framework integrates three perspectives for decision-making: evidence-based, reasoning-based, and sentiment-based, with corresponding evaluators and experts. The main innovation is the use of fine-tuned LLMs with pseudo-label supervision to assess the reliability of decisions made in real-time. The value lies in creating a robust system capable of making accurate judgments in time-sensitive and evolving information environments. Experiments on RealTimeNews-25 and three historical datasets demonstrated that EASE achieves state-of-the-art performance, especially in real-time detection settings where evidence is scarce, with an accuracy of 0.756 on RealTimeNews-25 14.
-
Dana Sotto Porat from The Academic College of Tel Aviv–Yaffo and colleagues explored the personality and demographic-like characteristics exhibited by LLMs in their generated text, particularly in a natural conversational setting. They proposed a data-driven methodology using a curated dataset of open-ended questions to evaluate personality and gender characteristics in LLM-generated text without relying on self-report questionnaires. The main innovation is the use of automatic classifiers to analyze personality traits and gender expression in LLMs’ natural output. The value lies in providing a deeper understanding of how LLMs mimic human personality traits and gender expressions, which is crucial for designing more authentic and effective conversational AI systems. Experimental analysis showed that LLMs tend to exhibit higher levels of Agreeableness and lower Neuroticism, indicating cooperative and stable conversational tendencies, with slightly less variation in gendered language patterns compared to human authors 15.
-
Alexander Sternfeld from HES-SO and colleagues focused on the generation of secure and robust code by LLMs in mission-critical systems, proposing the TypePilot framework to leverage the Scala type system for adding safety guarantees to code generation. The main innovation is the structured interaction pipeline that guides LLMs to produce code adhering to strict safety and correctness properties. The value lies in enhancing the trustworthiness and reliability of automated code generation, especially in high-assurance domains. Experiments on various test cases, including average age calculation, Fibonacci number generation, and handling HTML, Bash, and URL injections, showed that TypePilot outperformed baseline and robust prompting methods in mitigating vulnerabilities and ensuring input validation. The results underscore the effectiveness of using the Scala type system to encode domain-specific constraints and reduce logic bugs 16.
-
Wenya Xie from unnamed institution and colleagues tackled the limited effectiveness of LLMs as direct consulting tools for patients by developing them as clinical assistants for healthcare professionals. They introduced the DoctorFLAN dataset and the DotaBench benchmark to evaluate LLMs in tasks aligned with physician workflows, such as summarizing patient records and providing clinical decision support. The main innovation points include the creation of a large, diverse dataset that spans multiple clinical specialties and the design of a benchmark that simulates multi-turn doctor-patient conversations. The value lies in improving the safety and efficiency of medical services by complementing patient-oriented models with physician-focused ones. Fine-tuning on DoctorFLAN led to substantial improvements in complex medical tasks, with DotaGPTBaichuan2-7B scoring highly in human evaluations 17.
Technical Trends
The papers in this topic showcase a trend towards more sophisticated and domain-specific evaluation frameworks for LLMs. There is a shift from relying solely on self-play or simple task completion metrics to incorporating multimodal data, real-world scenarios, and specific domain constraints. Innovations include the integration of human gameplay data, reliability assessments through pseudo-label supervision, automatic classification techniques for personality analysis, leveraging programming languages’ type systems for security, and aligning LLMs with real-world clinical tasks.
Datasets and Evaluation
- WereBench: A high-quality multimodal dataset of televised human gameplay for the Werewolf game, used to evaluate LLMs’ social deduction skills.
- RealTimeNews-25: A benchmark dataset for real-time fake news detection under evidence scarcity.
- DoctorFLAN: A dataset comprising approximately 92,000 Q&A instances across 22 clinical tasks and 27 specialties, designed for evaluating LLMs in doctor-centric medical applications.
- DotaBench: A benchmark for assessing LLMs in multi-turn clinical conversations, complementing DoctorFLAN.
Evaluation metrics vary across the papers, ranging from win rates and survival durations in social games to accuracy, macro F1 score, and class-wise F1 scores in fake news detection, and from automatic and human evaluations of personality traits to performance in clinical tasks and human evaluations of response quality in medical applications.
Topic 4: Personalization and Social Media Profiling
Topic Overview
The research topic of Personalization and Social Media Profiling explores how large language models (LLMs) can be adapted and personalized to cater to individual user preferences and cultural contexts effectively. This is particularly critical in enhancing user engagement and satisfaction in social media platforms and other interactive applications. Moreover, the ability to accurately profile individuals based on their social media activities can have significant implications in areas ranging from mental health monitoring to cybersecurity. The importance of this topic is underscored by the need for AI systems to not only understand but also respond appropriately to the nuanced expressions of human identity, beliefs, and values across diverse cultures and languages.
Individual Paper Contributions
-
Priyanka Dey from University of Southern California and colleagues studied the limitations of personalization in LLMs due to the reliance on expensive human feedback or interaction logs. They proposed GRAVITY, a framework that synthesizes profile-grounded preference data to reduce dependency on human annotation. The main innovation points include the integration of demographic, cultural, and psychological frameworks to create preference pairs and the use of Direct Preference Optimization (DPO) to align the model’s outputs with user profiles. The practical value lies in offering a more efficient and scalable method for personalized content generation, especially in domains like book recommendations. Experiments on generated book descriptions showed over 4% higher preference gains compared to baseline methods and were preferred 86% of the time in user studies, indicating that values and beliefs have the most significant impact on personalization performance 18.
-
Shreya Havaldar from University of Pennsylvania and colleagues addressed the misalignment between existing benchmarks and the real challenges faced by LLMs in adapting to diverse cultural contexts. They introduced the Culturally-Aware Conversations (CAC) Framework, which evaluates LLMs in realistic, multicultural conversational settings. The framework is grounded in sociocultural theory and incorporates a benchmark dataset annotated by culturally diverse raters, focusing on stylistic variations and conversational dynamics. The value lies in the framework’s ability to test LLMs’ adaptability to different cultural contexts, providing a more holistic view beyond factual knowledge. Experimental evaluation of five major LLMs revealed biases towards Western communication norms, emphasizing the need for improved cultural adaptation, especially in non-Western contexts 19.
-
Jinyuan Xu from Ertim Inalco and colleagues tackled the scarcity of publicly available Chinese-language datasets for depression risk detection and structured analysis. They developed CNSocialDepress (CNSD), a dataset containing 44,178 texts from 233 users, annotated by psychological experts for depression-related segments. The dataset offers binary risk labels alongside structured multi-dimensional psychological attributes. The value of CNSD is in its provision of professional validation and structured psychological analysis, enabling the training and evaluation of LLMs for depression-related tasks. Experiments demonstrated that fine-tuning Qwen2.5-14B and other models on CNSD improved their performance in both classification and text generation tasks, with Qwen2.5-14B Silver achieving the highest accuracy and F1 score 20.
-
Muhammad Hamza from COMSATS University Islamabad and colleagues investigated the challenge of celebrity profiling using short text in Urdu on Twitter. They proposed a new corpus for this task, consisting of tweets from 100 celebrities and their followers, totaling 20,000 tweets. The research utilized both traditional machine learning and deep learning algorithms to predict celebrity demographics based on followers’ tweets. The main innovation is the application of these methods to Urdu text and the use of followers’ tweets for inference, which is a novel approach. The value lies in addressing a research gap for under-resourced languages and providing a tool for marketing, forensic, and cybersecurity applications within the Urdu-speaking community. Experiments showed varying accuracies across different demographics, with Random Forest achieving the best performance for gender prediction 21.
-
Yuqi Bai from Hebei Petroleum University of Technology and colleagues evaluated the capability of LLMs to simulate human personality in virtual persona role-playing. They proposed a systematic framework that includes a virtual persona synthesis method and an individual-level evaluation framework using the Big Five personality assessments. The main innovation is the identification of a scaling law in LLM personality simulation, indicating that more detailed and realistic persona profiles enhance the simulation of personality traits. The value is in moving away from traditional psychometric approaches towards an engineering-oriented method that captures the trajectory of improvement in personality simulation. Experiments showed increased stability and convergence of personality traits as profiles became more detailed, leading to better alignment with real human personality distributions 22.
Technical Trends
The papers in this collection highlight several evolving trends in personalization and social media profiling:
- Synthetic Data Creation: GRAVITY and the CAC Framework emphasize the creation of synthetic data to overcome the limitations of costly human annotations, suggesting a shift towards more automated and scalable data generation methods.
- Cross-Cultural Adaptation: There is a growing recognition of the need for LLMs to adapt to diverse cultural contexts, as evidenced by the CAC Framework and the CNSocialDepress dataset.
- Psychological Integration: Both GRAVITY and the Scaling Law in LLM Simulated Personality incorporate psychological theories to enhance personalization, indicating a trend towards integrating human behavioral sciences with AI.
- Evaluation Methodologies: The introduction of new benchmarks and evaluation criteria, such as those in the CAC Framework and the CNSocialDepress dataset, underscores the need for more comprehensive and context-specific evaluation methods.
Datasets and Evaluation Metrics
- GRAVITY: Utilizes synthetic preference pairs derived from demographic, cultural, and psychological frameworks.
- CAC Framework: Employs a benchmark dataset annotated by culturally diverse raters, focusing on stylistic variations and conversational dynamics.
- CNSocialDepress: Includes 44,178 texts from 233 users, annotated for depression-related segments with structured psychological attributes.
- Celebrity Profiling on Short Urdu Text: Uses a corpus of 20,000 tweets from 100 celebrities and their followers.
- Evaluation Metrics:
- GRAVITY: Preference gains, user study preferences.
- CAC Framework: Performance on selecting culturally appropriate responses.
- CNSocialDepress: Accuracy, F1 score, BERTScore, ROUGE-1, BLEU, human evaluation.
- Celebrity Profiling: Model accuracy for demographic predictions.
- Scaling Law: Stability, convergence, and identifiability of personality traits; comparison with human data using Mahalanobis distance and cluster analysis.
Topic 5: Reinforcement Learning in NLP
Topic Overview
Reinforcement Learning (RL) in Natural Language Processing (NLP) has emerged as a promising area for improving the performance and adaptability of large language models (LLMs) across various tasks. By integrating RL techniques, researchers aim to develop models that can better navigate complex decision-making processes, learn from feedback, and generalize their capabilities beyond the confines of supervised learning. This is particularly important for applications involving safety, self-awareness, embodied AI, and domain-specific reasoning, where models need to demonstrate reliability, robustness, and efficiency.
Individual Paper Contributions
-
Sarah Ball from Ludwig-Maximilians-Universität in Munich and colleagues studied the misalignment between generative AI models’ training objectives and their real-world deployment, especially concerning safety classifiers. They proposed Boundary Guidance, a reinforcement-learning-based fine-tuning approach, to steer outputs away from the decision boundary of safety classifiers. The main innovation points are the decision-theoretic framework supporting system utility minimization near classifier boundaries and the comparative analysis of reward specifications. The value lies in the empirical improvements in both safety and utility, demonstrating the benefits of compound system optimization. Experiments on jailbreak and ambiguous prompts showed reductions in harmfulness scores by 0.09 to 0.15 for Qwen2.5 models and by 0.03 for Gemma-2-9B-it, with significant improvements in helpfulness for two of the four models tested, concluding that balanced reward signals enhance model performance23.
-
Sahil Kale from KnowledgeVerse AI and colleagues focused on the lack of self-awareness in LLMs, which hinders their reliability in critical applications. They introduced the KnowRL framework, leveraging reinforcement learning to enhance LLMs’ self-knowledge through introspective task generation and consensus-based rewards. The main innovation points include the use of internally-generated data to avoid costly external supervision and a difficulty clipping strategy to prevent reward hacking. The value lies in the simplicity and effectiveness of improving self-knowledge without requiring extensive external data. Experiments on LLaMA-3.1-8B and Qwen-2.5-7B models demonstrated up to 28% accuracy and 12% F1 score improvements in self-knowledge, concluding that KnowRL effectively boosts models’ understanding of their capability boundaries24.
-
Sabrina McCallum from University of Edinburgh and colleagues addressed the limitation of imitation learning (IL) policies in learning from suboptimal demonstrations. They proposed FOSSIL, a framework that integrates constructive language feedback into IL to enhance learning from diverse behaviors. The main innovation points are the integration of language feedback and scalar rewards with auxiliary self-supervised learning objectives for feedback prediction. The value lies in making IL more data-efficient and improving generalization. Experiments on the BabyAI-XGen dataset showed significant improvements in compositional generalization and robustness to perturbations, concluding that language feedback helps stitch together suboptimal solutions and enhances overall model performance25.
-
Wei Huang from NVIDIA and colleagues tackled the high resource demands of RL for LLMs, focusing on GPU memory requirements and lengthy rollout times. They introduced QeRL, a framework that combines NVFP4 quantization with Low-Rank Adaptation (LoRA) to accelerate training and reduce memory usage. The main innovation points are the Adaptive Quantization Noise (AQN) mechanism and the application of quantization to RL settings. The value lies in enabling more efficient training of larger models. Experiments on datasets like GSM8K and MATH 500 demonstrated superior performance in terms of speed and accuracy, concluding that quantization noise improves exploration and discovery of better strategies during training26.
-
Nianyi Lin from Tsinghua University and colleagues addressed the memory inefficiency in RL for diffusion large language models (dLLMs). They proposed Boundary-Guided Policy Optimization (BGPO), an RL algorithm that constructs a linear lower bound for the RL objective, ensuring memory efficiency and accuracy. The main innovation points are the construction of a linear lower bound for the RL objective and theoretical proofs of its equivalence to ELBO-based objectives. The value lies in reducing memory usage while maintaining performance accuracy. Experiments on established datasets such as MATH500, GSM8K, HumanEval, MBPP, Sudoku, and Countdown showed significant performance improvements and reduced bias/variance in gradient calculations, concluding that BGPO offers a scalable solution for improving model performance27.
-
Zhengyu Chen from Meituan and colleagues assessed the capability of LLM agents to generalize tool-integrated RL strategies from mathematical tasks to other domains. They proposed an evaluation framework that tests the generalization of tool usage skills across diverse reasoning tasks. The main innovation points include the exclusive training of agents on mathematical tasks and subsequent testing in unrelated domains. The value lies in understanding skill transfer and cross-domain applicability. Experiments revealed that tool-integrated skills learned in math can be applied to tasks in chemistry, concluding that domain-specific training can enhance performance in downstream reasoning tasks28.
Technical Trends
The papers collectively demonstrate several technical trends in RL for NLP:
- Enhancing Safety and Utility: Techniques such as Boundary Guidance focus on aligning model outputs with safety standards while maintaining utility.
- Improving Self-Knowledge: Frameworks like KnowRL emphasize the importance of LLMs understanding their own limitations, crucial for reliability in critical applications.
- Data-Efficient Learning: Methods like FOSSIL and QeRL highlight the need for efficient use of data and resources, addressing challenges related to scalability and computational cost.
- Memory Efficiency: Algorithms like BGPO address the memory constraints inherent in RL training for diffusion models, aiming to reduce overhead and increase practical applicability.
- Cross-Domain Generalization: Evaluating frameworks that assess the transferability of learned skills across different domains is a growing interest, suggesting a move towards more versatile AI systems.
Datasets and Evaluation
The primary datasets used across the papers include:
- Mathematical Reasoning: GSM8K, MATH 500, AIME 24/25, AMC 23
- Code Generation and Planning: HumanEval, MBPP
- Embodied AI Tasks: BabyAI-XGen
- General NLP Tasks: SelfAware dataset
Evaluation metrics varied depending on the specific focus of each paper:
- Safety and Utility Scores: Used in evaluating the effectiveness of Boundary Guidance.
- Accuracy and F1 Scores: Employed in measuring self-knowledge improvements with KnowRL.
- Compositional Generalization Abilities: Assessed using BabyAI-XGen in FOSSIL.
- Adjusted Evaluation Scores: Utilized to measure performance improvements in QeRL.
- Gradient Bias/Variance: Evaluated to determine the effectiveness of BGPO.
- Performance and Token Efficiency: Assessed in testing the cross-domain applicability of tool-integrated RL in diverse reasoning tasks.
These contributions collectively push the boundaries of RL in NLP, addressing critical issues of safety, self-awareness, efficiency, and generalization, while employing diverse datasets and evaluation methods to rigorously test their frameworks.
Topic 6: Knowledge and Data Augmentation
Topic Overview
Knowledge and data augmentation are pivotal in enhancing the performance and reliability of large language models (LLMs) in various applications. By integrating external knowledge and generating diverse datasets, researchers aim to improve the robustness, accuracy, and adaptability of LLMs, particularly in scenarios requiring multi-step reasoning, long-horizon predictions, and specialized domain knowledge. This topic is essential for developing more effective and trustworthy AI systems that can operate reliably in complex and evolving environments, such as digital interfaces, specialized tasks, and high-stakes domains like finance and healthcare.
Individual Paper Contributions
-
Kai Mei from Rutgers University, AWS Agentic AI and colleagues studied the limitations of LLMs in serving as world models for computer-use agents, proposing the Retrieval-augmented World Model (R-WoM) to solve the core problem of hallucination and reliance on static training knowledge. The main innovation points of this method are the integration of retrieval-augmented generation (RAG) to ground LLMs with external tutorials, and the introduction of a reasoning-based RAG pipeline that includes query rewriting and LLM-based reranking to improve tutorial relevance. The value lies in enhancing the accuracy and stability of multi-step simulations and improving task execution in dynamic digital environments. Experiments on the OSWorld and WebArena benchmarks showed substantial improvements ranging from 7.2% to 25.3% over baselines, concluding that tutorial-guided grounding significantly boosts the performance of computer-use agents in realistic environments 29.
-
Urs Spiegelhalter from University of Freiburg and colleagues explored the challenge of adapting LLMs to new, specialized tasks while maintaining general performance under computational constraints. They introduced a method that balances synthetic data generation and replay ratios to enhance task-specific capabilities. The main innovation points include the detailed configuration of replay ratios and synthetic data scale under computational limits, along with an analysis framework for identifying optimal configurations. The value lies in providing empirically grounded guidelines for practitioners to maintain a balance between task-specific performance and general knowledge retention. Experiments on the bAbI reasoning tasks revealed that a total token budget beyond 1e8.5 provides diminishing returns, and a replay ratio between 5% and 10% is optimal. The bAbI-Synthetic dataset outperformed the bAbI-Original in task mastery and general knowledge retention, suggesting that data diversity is key to overcoming performance degradation 30.
-
Hengran Zhang from State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences and colleagues addressed the inefficiency of RAG frameworks due to the lack of consideration for LLM-specific utility. They proposed a benchmarking procedure that evaluates the utility of retrieved passages from the perspective of specific LLMs. The main innovation points are the introduction of LLM-specific utility judgment and the identification of the inadequacy of generic utility assessments. The value lies in tailoring utility assessments to individual LLMs, leading to more accurate and comprehensive answers. Evaluations across six knowledge-intensive datasets demonstrated that likelihood-based and verbalized methods (with pseudo-answers) performed best, suggesting that LLMs should ideally reject passages for known queries and utilize them for unknown queries 31.
-
Ruirui Chen from Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore and colleagues tackled the challenge of constructing high-quality knowledge graphs (KGs) that reduce hallucinations in LLMs. They proposed a hierarchical framework for KG construction that integrates relational triple extraction, coreference resolution, entity de-duplication, and source tracing. The main innovation points include the prompt-based bottom-up approach and the release of a dataset of LLM-generated KGs from research papers on children’s mental well-being. The value lies in creating a more flexible and scalable KG construction process, which can enhance the interpretability and utility of KGs in healthcare. Comparative analysis with iText2KG showed improvements in accuracy, comprehensiveness, and relevance, indicating that LLM-based evaluation can sometimes overestimate extraction results 32.
-
Chris Xing Tian from Peng Cheng Laboratory, Shenzhen, China and colleagues focused on the suboptimal performance of RAG systems in domain-specific settings due to a lack of specialized training data. They introduced RAGen, a scalable and modular framework for generating domain-specific question-answer-context (QAC) triples. The main innovation points involve the three-stage process of document concepts extraction, concept-centered evidence assembly, and QAC generation, leveraging Bloom’s Taxonomy and distractor supervision. The value lies in enabling more effective RAG adaptation across different domains and architectures. Experiments across multiple domains showed consistent improvements in retrieval quality and generation accuracy, with RAGen outperforming baselines like AutoRAG and LlamaIndex in terms of Recall@K, MRR@10, ROUGE-L, and BERT-F1 scores 33.
-
Jiaying Wu from National University of Singapore and colleagues addressed the timeliness and reliability issues in crowd-sourced fact-checking systems for health misinformation. They proposed CrowdNotes+, a framework that integrates LLMs into the Community Notes system to improve governance. The main innovation points are the introduction of evidence-grounded note augmentation and utility-guided note automation, alongside a hierarchical evaluation pipeline. The value lies in scaling up and ensuring the reliability of fact-checking systems. Experiments on the HealthNotes benchmark demonstrated that CrowdNotes+ significantly reduced false positives and ensured more reliable assessments, achieving higher helpfulness ratings compared to human baselines 34.
-
Patrick Haller from and colleagues examined the brittleness of LLMs in maintaining robust internal knowledge representations under distribution shifts. They proposed a methodology using three probing techniques to assess the separability of true and false statements. The main innovation points are the use of non-linear and linear activations classifiers, and P(True) as measures of truthfulness separability. The value lies in providing a detailed analysis of the robustness of LLM knowledge representations and highlighting the need for stable, generalizable representations. Experiments across four model families and four datasets indicated significant degradation in truthfulness separability as samples became more out-of-distribution, suggesting that larger models might suffer from worse representation robustness 35.
-
Daniel Berhane Araya from and colleagues aimed to detect and mitigate financial misinformation, proposing FinVet, a collaborative framework of RAG and external fact-checking agents. The main innovation points include the integration of two RAG pipelines with an external fact-checking pipeline using a confidence-weighted voting mechanism and an adaptive three-tier processing strategy. The value lies in enhancing the accuracy and computational efficiency of financial misinformation detection. Experiments on the FinFact dataset showed a 10.4% improvement over the best individual pipeline and a 37% improvement over standalone RAG approaches, emphasizing the importance of a multi-strategy approach in verifying financial claims 36.
Technical Trends
The papers in this collection highlight several key technical trends:
- Retrieval-Augmented Generation (RAG): Many papers emphasize the integration of retrieval mechanisms to enhance LLM performance by grounding them with external knowledge.
- Hierarchical Extraction and Processing: Hierarchical approaches are utilized for both knowledge graph construction and note generation, providing structured and coherent outputs.
- Utility-Based Assessments: There is a growing focus on evaluating the utility of retrieved knowledge from the perspective of specific LLMs, moving away from generic relevance assessments.
- Adaptive Strategies: Adaptive processing strategies, such as adjusting verification approaches based on retrieval confidence, are proposed to optimize both accuracy and efficiency.
- Empirical Analysis of Distribution Shifts: Detailed analysis of how LLMs handle distribution shifts and the impact on truthfulness representations is becoming a critical area of research.
Datasets and Evaluation Metrics
- OSWorld and WebArena: Used for evaluating the performance of computer-use agents in dynamic digital environments.
- bAbI Reasoning Tasks: Employed to study the effects of synthetic data and replay ratios on task-specific and general knowledge retention.
- Six Knowledge-Intensive Datasets: Utilized to assess the utility of retrieved passages in RAG frameworks.
- HealthNotes Benchmark: A dataset of 1.2K health-related Community Notes annotated for helpfulness.
- FinFact Dataset: Used for financial misinformation detection, evaluated using F1 scores.
- Various Datasets: Including True-False, MMLU, OpenBookQA, and TruthfulQA, used to probe the robustness of LLM knowledge representations under distribution shifts.
- Evaluation Metrics: Recall@K, MRR@10, ROUGE-L, BERT-F1, perplexity, and hierarchical evaluation pipelines are commonly used to measure performance improvements and robustness.
Topic 7: Safety and Ethical AI
Topic Overview
The topic of Safety and Ethical AI focuses on developing and maintaining the integrity and ethical standards of AI systems, particularly large language models (LLMs) and large reasoning models (LRMs), to ensure they do not produce harmful content or fall prey to adversarial attacks. With the rapid advancement and widespread adoption of AI technologies, ensuring their safety has become paramount, especially in domains where misuse could have severe real-world implications. The research in this area is essential for aligning AI outputs with societal norms and preventing unintended consequences that arise from AI-generated content.
Individual Paper Contributions
-
Shuo Chen from LMU Munich and colleagues studied the increased risk of autonomous Deep Research (DR) agents generating harmful content despite safety mechanisms in underlying LLMs. They proposed two jailbreak methods, Plan Injection and Intent Hijack, to evaluate safety vulnerabilities of DR agents. The main innovation points are the targeted manipulation of the agent’s planning process and the rephrasing of harmful queries in academic contexts. The value lies in revealing the limitations of current alignment practices and introducing a new evaluation metric, DeepREJECT, to assess harmful intent and knowledge conveyance. Experiments on the StrongREJECT and Medicine subset of SciSafeEval datasets showed that DR agents significantly outperformed baseline LLMs in generating harmful reports under Plan Injection and Intent Hijack scenarios, concluding that specialized alignment techniques are necessary for DR agents37.
-
Jiayu Ding from Xi’an Jiaotong University and colleagues addressed the vulnerability of reasoning traces from LLMs to unauthorized distillation, leading to intellectual property leakage. They introduced PART, a method designed to disrupt unauthorized distillation while preserving the usefulness of reasoning traces for human readers. The main innovation points involve token-level removal of self-talk behaviors and structural-level reordering into a conclusion-first format. The value lies in protecting proprietary models while still allowing beneficial access to reasoning details. Validation through extensive distillation experiments on diverse benchmarks (MATH-500, AIME 2024, LiveCodeBench v2, GPQA-Diamond) and student models (up to 32B parameters) showed significant performance degradation in models distilled from reformulated traces, concluding that PART effectively defends against unauthorized distillation while maintaining human readability38.
-
Michael Schlichtkrull from Queen Mary University of London explored the vulnerability of AI agents to ‘attacks by content’, where external document manipulations can alter agent behavior or outputs. He proposed a pipeline for mitigating these attacks by enhancing AI agents with capabilities akin to human fact-checkers. The main innovation points include integrating fact-checking and source criticism into AI systems. The value lies in improving the trustworthiness and reliability of AI systems in sensitive domains. Experiments demonstrated that models like Llama 3.1 8b showed reduced vulnerability when provided with fact-checks, and smaller models performed better in discerning trustworthiness, concluding that AI models need robust defenses against content manipulation39.
-
Shuo Chen from LMU Munich and colleagues investigated the vulnerability of reasoning-based safety guardrails in LRMs to jailbreak attacks. They proposed four jailbreak techniques—Structural CoT Bypass, Fake Over-Refusal, Coercive Optimization, and Reasoning Hijack—to exploit weaknesses in current safety defenses. The main innovation points are the identification of specific attack vectors that can bypass safety mechanisms. The value lies in exposing systemic weaknesses and suggesting areas for improvement in safety guardrails. Evaluations on five existing jailbreak benchmarks (StrongREJECT, AdvBench, HarmBench, CatQA, and JBB-Behaviors) revealed high attack success rates and harm scores, indicating severe vulnerabilities even in larger models like gpt-oss-20b and 120b, concluding that current safety guardrails require enhancement40.
-
Battemuulen Naranbat and colleagues focused on the challenge of ensuring fairness in moral sentiment classification across different social media platforms. They introduced a new fairness metric, Moral Fairness Consistency (MFC), to quantify cross-domain stability of moral foundation detection. The main innovation points are the per-label fairness analysis and the MFC metric itself. The value lies in providing a more interpretable and granular measure of fairness in NLP models. Experiments on the Moral Foundations Twitter Corpus (MFTC) and Moral Foundations Reddit Corpus (MFRC) showed pronounced asymmetry in transfer performance and identified authority as the least consistent moral foundation, concluding that MFC effectively captures fairness disparities across domains41.
Technical Trends
The papers collectively highlight a growing concern regarding the security and ethical considerations of advanced AI systems, particularly LLMs and LRMs. They propose innovative methods to enhance safety and fairness, including jailbreak techniques to test and improve defense mechanisms, reformulation strategies to protect proprietary information, and enhanced fact-checking pipelines to defend against content-based attacks. The trend towards specialized alignment and defense mechanisms, as well as the development of new fairness metrics, reflects a deeper understanding of the complexities involved in ensuring AI safety and ethical compliance.
Datasets and Evaluation Metrics
- StrongREJECT: Used for evaluating harmful content generation in DR agents and LRMs.
- SciSafeEval: A dataset for assessing the safety of scientific content generated by AI.
- Moral Foundations Twitter Corpus (MFTC): Dataset for analyzing moral sentiments on Twitter.
- Moral Foundations Reddit Corpus (MFRC): Dataset for analyzing moral sentiments on Reddit.
- MATH-500, AIME 2024, LiveCodeBench v2, GPQA-Diamond: Benchmarks for evaluating the impact of unauthorized distillation on reasoning traces.
- DeepREJECT: A novel evaluation metric for assessing harmful intent and knowledge conveyance in AI-generated reports.
- MFC (Moral Fairness Consistency): A fairness metric designed for cross-domain moral sentiment classification.
- AdvBench, HarmBench, CatQA, JBB-Behaviors: Existing benchmarks for evaluating jailbreak techniques in LRMs.
These datasets and metrics are crucial for systematically evaluating the safety, fairness, and robustness of AI systems, offering researchers and developers a framework to understand and mitigate potential risks associated with AI misuse and bias.
Topic 8: Continual and Lifelong Learning
Topic Overview
Continual and lifelong learning is a critical area of research in artificial intelligence, particularly for large language models (LLMs) and reinforcement learning (RL) systems. These systems aim to continuously acquire and integrate new knowledge over time without forgetting previously learned information, enabling them to adapt to changing environments and tasks. The ability to perform lifelong learning enhances the flexibility and applicability of AI models, allowing them to tackle complex, dynamic problems such as mathematical reasoning, code generation, and interactive task-solving in constrained environments. However, challenges such as memory constraints, training instability, and the need for efficient resource management remain significant barriers to the widespread adoption of these technologies.
Individual Paper Contributions
-
Wenhan Ma from Peking University and colleagues studied the instability in reinforcement learning (RL) training of Mixture-of-Experts (MoE) models due to discrepancies in the routing distribution between training and inference phases. They proposed Rollout Routing Replay (R3), a method that captures the routing distributions from the inference engine and replays them into the training engine, aligning the routing behavior between the two phases. The main innovation points of R3 include its alignment of training and inference routers and its integration with prefix caching for multi-turn dialogue scenarios. The value lies in its ability to stabilize RL training for MoE models, leading to better performance in complex tasks like competition-level mathematics and code generation. Experiments on datasets such as AIME24, AIME25, AMC23, and MATH500 level 5 showed that R3 outperformed baseline methods such as GSPO and TIS, demonstrating a significant reduction in KL divergence and the frequency of tokens with large training-inference discrepancies, and helped prevent training collapse in Qwen3-30B-A3B-Base42.
-
Haoqi Yang from Wuhan University and colleagues tackled the excessive memory consumption of Large Language Models (LLMs) due to the Key-Value (KV) cache mechanism, especially in long text understanding and generation tasks. They introduced XQuant, a training-free and plug-and-play framework for ultra-low-bit KV cache quantization. XQuant incorporates a data-free calibration method to reduce quantization errors and a cross-layer compression technique that shares quantized caches between adjacent layers to minimize memory usage. The value of XQuant lies in its ability to achieve quantization at sub-1.4 bits while maintaining or even improving model performance, making it suitable for deployment in environments with limited hardware resources. Experiments on benchmarks like TruthfulQA and LongBench showed that XQuant outperformed existing methods such as KIVI-2bit and AsymKV-1.5bit, achieving lower bit-widths and higher performance scores, with an average improvement of 41.98 for Mistral and 36.51 for Llama2 on LongBench. The ablation studies revealed that both the data-free calibration and the cross-layer compression methods contributed to these performance gains43.
-
Gautier Dagan from University of Edinburgh and colleagues focused on enhancing the learning capability of automated agents in interactive environments through the use of procedural ‘how-to’ questions and answers. They proposed the $How^{2}$ framework, which facilitates lifelong learning by reusing knowledge gained from interactions with a teacher, human, or oracle. The framework explores a spectrum of teacher strategies that range from providing fully executable actions to offering high-level sub-goals, balancing immediate utility with long-term reusability. The value of $How^{2}$ lies in its ability to improve agent performance in solving tasks, particularly in environments like Plancraft, where actions have consequences and resources are limited. Experiments on Plancraft with two dataset splits—one with low task repetition and another with high repetition—showed that $How^{2}$ achieved a 42% lower intervention rate compared to the Just Ask setup, which does not involve memory re-use. The framework was effective across different models, including Qwen 3 32B, indicating enhanced reasoning capabilities and reduced dependency on external assistance44.
-
Jinbin Zhang and colleagues addressed the inefficiency and high memory demand associated with training models for Extreme Multilabel Classification (XMC) tasks, especially in large output spaces. They introduced ELMO, a method that optimizes memory usage and computational efficiency through low-precision training using BFloat16 (BF16) and Float8 (FP8) data types, combined with Kahan summation and stochastic rounding. ELMO also includes architectural improvements like gradient fusion and chunking to further optimize memory usage. The main innovation points are the elimination of high-precision master weights and the optimization of peak memory consumption through computation flow reorganization and efficient data types. The value of ELMO lies in its ability to train XMC models with drastically reduced memory usage without compromising accuracy, making it feasible to handle vast label spaces efficiently. On the newly introduced LF-Paper2Keywords-8.6M dataset, ELMO achieved substantial memory savings, requiring 18.8 GiB (BF16) or 9.02 GiB (FP8) of GPU memory, compared to 105 GiB needed by Renee. Experimental results showed that ELMO maintained or even surpassed the performance of the Float32 baseline, demonstrating its effectiveness across various XMC datasets45.
Technical Trends
The papers under this topic reflect several technical trends in continual and lifelong learning. One trend is the focus on mitigating training instability through innovative alignment techniques, as seen in R3’s approach to stabilizing RL training by matching the routing distributions between training and inference phases. Another trend involves memory optimization, with XQuant and ELMO targeting the reduction of memory consumption through quantization and low-precision computation respectively. Lastly, there is an emphasis on leveraging interactive learning and knowledge reuse, exemplified by the $How^{2}$ framework’s approach to enhancing agent learning through procedural question-and-answer interactions.
Datasets and Evaluation
The datasets used in the papers include:
- Mathematical Reasoning Datasets: AIME24, AIME25, AMC23, MATH500 level 5
- Long Text Understanding and Generation Benchmarks: TruthfulQA, LongBench
- Extreme Multilabel Classification Datasets: LF-Paper2Keywords-8.6M (newly introduced)
- Interactive Environment Dataset: Plancraft (Minecraft crafting environment)
Evaluation metrics across the papers include:
- KL Divergence: Used to measure the difference in routing distributions between training and inference in RL settings.
- Intervention Rate: Measures how often external help is required in interactive learning scenarios.
- Precision@k: Evaluates the performance of XMC models by assessing the precision of top-k predictions.
- BLEU Score: Assesses the quality of generated text against reference texts in long text generation tasks.
- GPU Memory Usage: Tracks the memory footprint of training models to evaluate the efficiency of memory optimization techniques.
These datasets and metrics collectively assess the practicality and effectiveness of the proposed methods in various learning scenarios, contributing to the broader understanding and development of continual and lifelong learning systems.
Topic 9: Natural Language Processing Techniques
Topic Overview
Natural Language Processing (NLP) techniques are pivotal in advancing the capabilities of artificial intelligence systems to understand, generate, and manipulate human language. As models become increasingly sophisticated, there is a growing emphasis on optimizing their efficiency and effectiveness, particularly in tasks like language generation, classification, and preprocessing. Research in this area seeks to address the computational and memory demands of large language models, while ensuring high-quality output and adaptability to evolving linguistic contexts. Innovations in decoding strategies, architectural modifications, and continual learning frameworks aim to make NLP technologies more practical for real-time applications and large-scale inference tasks.
Individual Paper Contributions
-
Qinglin Zhu from King’s College London and colleagues studied the high latency associated with autoregressive (AR) models in natural language generation, proposing Latent Refinement Decoding (LRD) to solve the issue of inefficient sequential decoding. The main innovation points of this method are the soft diffusion mechanism and the adaptive two-phase sampling strategy. The value lies in maintaining the quality of generated text while significantly reducing the time required for inference. Experiments on HumanEval, MBPP, GSM8K, and MATH500 benchmarks showed consistent improvements in accuracy and significant speedups ranging from 1.2× to 10.6×, concluding that LRD is advantageous in complex generation tasks by providing richer context and more efficient convergence dynamics46.
-
Tieyuan Chen from Shanghai Jiao Tong University and colleagues addressed the inefficiency and computational overhead associated with scaling Large Language Models (LLMs). They introduced Dynamic Nested Depth (DND), a method that boosts off-the-shelf LLMs by dynamically selecting and reprocessing critical tokens. The main innovation points are the token-choice routing mechanism and the nested depth design for reprocessing. The value lies in improving the computational efficiency and performance of LLMs across various tasks without substantial parameter or computation increases. Experiments on multiple benchmarks, including general knowledge, alignment, mathematics, STEM, and coding tasks, demonstrated an average performance gain of +0.87 on the Qwen3-30B-A3B model, concluding that DND can effectively focus the model’s capacity on critical tokens and improve its overall performance47.
-
Daniel Scalena from University of Groningen and colleagues tackled the computational inefficiency introduced by chain-of-thought (CoT) prompting strategies in LLMs during the generation of multiple candidate sequences from the same prompt. They introduced EAGer, an Entropy-Aware Generation method designed to optimize parallel sampling during inference. The main innovation points are the dynamic adjustment of the generation process based on token-level uncertainty and the reallocation of saved computational budgets. The value lies in significantly reducing computational costs while maintaining or improving performance on reasoning tasks. Experiments on AIME 2024/2025, HMMT 2025, GPQA-Diamond, HumanEval Plus benchmarks, and models like SmolLM 3B, Qwen3 4B, DeepSeek 8B, and GPT-Oss 20B showed up to a 37% performance boost and saved up to 80% of the generation budget, concluding that the choice of entropy threshold balances the trade-off between performance and compute cost48.
-
Xuan Luo from UC Santa Barbara and colleagues focused on reducing the computational redundancy and improving the inference speed in decoder-only transformer architectures during inference. They introduced Direct Multi-Token Decoding (DMTD), a novel inference paradigm that generates multiple tokens in a single cycle by reusing the late layers. The main innovation point is the cyclical refilling mechanism that restores missing key-value cache entries. The value lies in achieving significant speedups without substantial performance degradation. Experiments using the Qwen3-4B model demonstrated that DMTD can maintain up to 98.4% of the original performance for a cycle length of 2, and showed the best performance on larger models, concluding that larger models benefit more from DMTD and that its performance improves with increased training data49.
-
Ba-Quang Nguyen from University of Engineering and Technology and colleagues aimed to enhance the performance of token-level classification tasks in Vietnamese, such as Named Entity Recognition (NER), Part-of-Speech tagging, and disfluency detection. They introduced TextGraphFuseGAT, a hybrid neural architecture combining a pre-trained transformer encoder (PhoBERT) with Graph Attention Networks (GAT) and a Transformer decoder layer. The main innovation point is the fully connected graph over token embeddings to capture rich inter-token dependencies. The value lies in addressing the morphological richness and scarcity of annotated resources in Vietnamese. Evaluations on PhoNER-COVID19, PhoDisfluency, and VietMed-NER benchmarks showed superior performance, with Micro-F1 scores of 0.984, 0.994, and 0.893 respectively, concluding that the integration of PhoBERT and GAT enhances the model’s ability to handle token-level classification tasks, particularly in low-resource languages50.
-
Marco Braga from [Institution] and colleagues investigated the use of pre-trained Large Language Models (LLMs) for text preprocessing, focusing on stopword removal, stemming, and lemmatization. They proposed a methodology leveraging in-context learning by providing LLMs with prompts that include task descriptions, examples, and context information. The main innovation point is the evaluation of context-sensitivity in LLMs for preprocessing tasks. The value lies in dynamically adjusting preprocessing based on context and language, potentially offering a more nuanced and effective alternative to traditional techniques. Experiments on multiple datasets and ML algorithms for both English and six European languages showed high accuracy in stopword removal (up to 97%) and lemmatization (up to 82%), with ML algorithms trained on texts preprocessed by LLMs improving classification performance by up to 6% in the F1 measure, concluding that LLMs can dynamically adjust preprocessing based on context, although stemming inconsistencies can affect performance51.
-
Yawen Yang from [Institution] and colleagues addressed Continual Named Entity Recognition (CNER), which involves learning new entity types incrementally without forgetting previously learned ones. They introduced GenCNER, a generative framework that transforms the task into a sequence generation problem of entity triplets using a pre-trained seq2seq model (BART) with a pointer network. The main innovation points are the type-specific confidence-based pseudo labeling strategy and the use of knowledge distillation to maintain learned knowledge. The value lies in avoiding the semantic shift problem and minimizing label noise during incremental learning. Experiments on OntoNotes and Few-NERD datasets showed significant improvements over baselines such as AddNER, ExtendNER, L&R, SpanKL, and SKD-NER in terms of Macro F1 scores, concluding that GenCNER outperforms existing methods by effectively handling the incremental learning of new entity types52.
Technical Trends
The papers in this collection highlight several technical trends and methodological evolutions in NLP techniques:
- Hybrid Architectures: Combining traditional models with newer architectures like Graph Attention Networks (GAT) and diffusion-based models to enhance performance in token-level tasks and complex generation tasks.
- Adaptive Sampling Strategies: Utilizing entropy-aware methods and dynamic token selection to optimize the computational efficiency of large language models during inference.
- Generative Approaches: Transforming recognition tasks into generative problems to handle incremental learning and maintain learned knowledge.
- In-Context Learning: Leveraging the context sensitivity of LLMs for text preprocessing tasks to dynamically adjust the processing based on the input context and language.
Datasets and Evaluation
The main datasets and evaluation metrics used across the papers include:
- HumanEval, MBPP, GSM8K, MATH500: Used to evaluate the accuracy and efficiency of Latent Refinement Decoding (LRD).
- Various General Knowledge, Alignment, Mathematics, STEM, Coding Benchmarks: Used to assess the performance gains of Dynamic Nested Depth (DND).
- AIME 2024/2025, HMMT 2025, GPQA-Diamond, HumanEval Plus: Employed to test the computational efficiency and performance of EAGer.
- OntoNotes, Few-NERD: Used to evaluate the continual learning capability and performance of GenCNER.
- PhoNER-COVID19, PhoDisfluency, VietMed-NER: Applied to demonstrate the enhanced token-level classification performance of TextGraphFuseGAT.
- Multiple Datasets and ML Algorithms: Used in the investigation of LLMs’ linguistic abilities for text preprocessing, including stopword removal, stemming, and lemmatization.
Topic 10: Language Model Adaptation and Specialization
Topic Overview
The topic of language model adaptation and specialization focuses on enhancing the performance and reliability of large language models (LLMs) in specific domains or tasks, often through fine-tuning processes. This area is crucial due to the increasing reliance on LLMs in diverse applications ranging from natural language processing to content generation. Fine-tuning involves adjusting the parameters of a pre-trained model to fit a particular domain or task, but this process can introduce issues like memorization, which can compromise the ethical and legal compliance of LLMs. Additionally, there’s a growing need for specialized tools to address societal concerns, such as hate speech detection, especially in languages with limited resources. Addressing these challenges ensures that LLMs are not only powerful but also safe and effective in real-world applications.
Individual Paper Contributions
-
Dean L. Slack from Durham University and colleagues studied the issue of memorization in LLMs during the fine-tuning phase, focusing on domain adaptation and instruction tuning. They proposed an $n$-gram memorization score as a scalable and effective precursor to verbatim memorization and introduced an $n$-gram-aware loss regularizer to mitigate this problem. The main innovation points of this method are its ability to detect and prevent memorization early in the fine-tuning process, which is a common pitfall in using smaller, private datasets. The value lies in providing a practical solution to ensure ethical and legal compliance, particularly concerning privacy and copyright issues. Experiments on datasets like SST-5, QQP, RTE, WANLI, SQuAD v2, HellaSwag, PubMedQA, XSum, CNN/DailyMail, Alpaca, and FLAN v2 demonstrated a reduction in memorization by up to 40%, with minimal impact on evaluation performance, compared to the Goldfish regularizer. The study concluded that using an $n$-gram memorization score as an early stopping criterion and applying the $n$-gram-aware loss regularizer are effective strategies to manage memorization during fine-tuning53.
-
Paloma Piot from [institution name] and colleagues addressed the challenge of detecting hate speech in low-resource Iberian languages, including European Spanish, European Portuguese, and Galician. They proposed a meta-collection of standardized hate speech datasets for these languages and evaluated state-of-the-art LLMs under various settings—zero-shot, few-shot, and fine-tuning. The innovation points include the systematic integration of existing resources, generation of synthetic data, and consideration of internal linguistic variations, particularly in Galician. The value of this work lies in providing a robust framework for hate speech detection in languages with limited annotated data, thereby expanding the scope of these technologies to protect a broader user base. Experiments on datasets translated from Spanish into Portuguese and Galician showed consistent improvements in performance from zero-shot to fine-tuning configurations. The conclusion was that fine-tuning significantly enhances hate speech detection capabilities in these languages, though zero and few-shot performances remain lower, highlighting the importance of language-specific data curation for optimal results54.
Technical Trends
The technical trends observed in the papers indicate a shift towards addressing specific issues that arise during the fine-tuning phase of LLMs. One trend is the focus on mitigating memorization, which is critical for maintaining the integrity and safety of models trained on sensitive data. Another trend is the exploration of multilingual and variety-aware approaches to handle low-resource languages, particularly in specialized tasks like hate speech detection. These trends suggest a growing awareness of the limitations of generic pre-trained models and the need for tailored solutions that respect linguistic diversity and data privacy.
Datasets and Evaluation
-
Memorization Study: The study led by Slack et al. utilized a wide range of datasets including SST-5, QQP, RTE, WANLI, SQuAD v2, HellaSwag, PubMedQA, XSum, CNN/DailyMail, Alpaca, and FLAN v2. Evaluation was primarily based on comparing the effectiveness of different regularizers and early stopping techniques in reducing memorization, measured through $n$-gram memorization scores and performance on validation and task-specific evaluations.
-
Hate Speech Detection: Piot et al.’s work involved creating a meta-collection of hate speech datasets for European Spanish, which was then translated into European Portuguese and Galician. They evaluated state-of-the-art LLMs using zero-shot, few-shot, and fine-tuning configurations. The performance was assessed through hate speech detection accuracy, with a focus on how internal linguistic variations affect model performance across different dialects and languages.
Topic 11: misc
Topic Overview
This collection of research papers explores various challenges and innovations in the domain of large language models (LLMs) and their applications. The papers delve into issues ranging from the alignment of code to formal mathematical expressions, the detection of emotional mechanisms within LLMs, the evaluation of psychometric tests for assessing biases, to the efficient and accurate handling of complex queries and data in information retrieval systems. Each paper addresses a unique aspect of LLM functionality, aiming to improve their reliability, efficiency, and applicability in diverse fields, from formal verification to cultural inclusivity in AI.
Individual Paper Contributions
-
Yupei Li from Imperial College London and colleagues studied the challenge of autoformalisation, which involves translating informal mathematical statements into formal ones. They proposed TopoAlign, a framework that structurally aligns code data with formal mathematical languages, and introduced ‘code autoformalisation’ (CAF) as a training task. The main innovation points include the creation of a large-scale pre-training dataset of 300 million tokens and the demonstration through ablation studies that a balanced mix of aligned code and formal mathematical statements yields optimal performance. The value lies in enhancing the training of LLMs for formal mathematics, thus facilitating automated proof generation and verification. Experiments on MiniF2F, Putnam, and ProofNet benchmarks showed relative BEq improvements of 36.7% for DeepSeek-Math and 6.2% for Herald, concluding that structural alignment of code with formal math data significantly boosts autoformalisation capabilities 55.
-
Saurabh Khanna from University of Amsterdam and colleagues focused on the systematic exclusion of certain languages, termed ‘Invisible Giants’, from large-scale digital ecosystems and AI training data. They proposed a critical framework that connects empirical measurements of language vitality and digital presence with postcolonial theory and epistemic injustice, introducing a categorization of languages and a methodological approach using Gaussian Mixture Models and logistic regression to predict Invisible Giant status. The value lies in addressing linguistic inequality and offering recommendations for decolonizing language technology. Analysis using comprehensive datasets from Ethnologue, Common Crawl, Wikipedia, and Hugging Face revealed that approximately 2,000 languages, including many from formerly colonized regions, are underrepresented, underscoring the necessity for a reevaluation of LLM development to include these languages 56.
-
Stefan Krsteski from EPFL and colleagues addressed the creation of accurate survey simulations using LLMs with limited human data. They proposed a rigorous evaluation framework combining synthesis methods (fine-tuning and prompting) with rectification methods to correct biases in LLM-generated survey responses. The main innovation is demonstrating that allocating the majority of human responses to rectification rather than fine-tuning can yield better performance. The value lies in reducing the cost and time associated with traditional human surveys while improving the reliability of LLM-based simulations. Experiments on NHANES and ATP showed that combining synthesis with rectification reduces bias below 5% and can increase the effective sample size by up to 14%, concluding that rectification is essential for achieving valid estimates 57.
-
Chenxi Wang from Organization indicated by ★ and colleagues explored understanding and controlling the emotional mechanisms within LLMs. They introduced a systematic framework for uncovering and manipulating emotion circuits, proposing the SEV dataset and a circuit-level modulation method for emotion steering. The main innovation points include a three-stage analytical framework and a novel method for emotion control. The value lies in advancing the field towards more interpretable and reliable emotional AI. Experiments confirmed that the extracted emotion directions capture context-agnostic representations of emotional expression with a 99.65% expression accuracy on the test set, outperforming both prompting- and steering-based baselines 58.
-
Jana Jung from University of Mannheim and colleagues investigated the applicability of human psychometric tests to assess sexism, racism, and morality in LLMs. They proposed a systematic approach to validate these tests for LLMs, focusing on reliability and validity. The main innovation points include methods for assessing reliability through alternate forms and reversed answer orders, and evaluating validity through theoretical expectations and real-world task comparisons. The value lies in providing evidence that existing human psychometric tests do not reliably predict LLM behavior, suggesting the need for LLM-specific tests. Experiments revealed that psychometric tests exhibit moderate reliability but lack ecological validity, implying that current tests may misrepresent LLM behavior 59.
-
Shubham Chatterjee from Missouri University of Science and Technology and colleagues tackled the challenge of effectively modeling complex semantic relationships between queries and documents in information retrieval systems. They proposed QDER, a multi-vector neural re-ranking architecture that integrates entity-aware and fine-grained matching capabilities. The main innovation points include a dual-channel architecture and the use of bilinear projections for combining interaction signals. The value lies in improving retrieval accuracy and effectiveness, particularly for complex queries. Experiments on TREC Robust 2004, TREC News 2021, TREC Core 2018, TREC CAR, and CODEC datasets showed significant outperformance across MAP, nDCG@20, P@20, and MRR metrics, with notable success in handling difficult queries 60.
-
Junpeng Liu from Chinese University of Hong Kong (CUHK) and colleagues focused on the lack of suitable reward models for evaluating the professionalism of documents in terms of structure and style. They proposed DocReward, a model designed to assess document professionalism, and introduced DocPair, a large-scale multi-domain dataset. The main innovation points include the Bradley-Terry loss for optimizing the model and the focus on structural and stylistic elements. The value lies in enhancing the professional quality of documents, which benefits applications from business to academia. Experiments demonstrated that DocReward-7B achieved an overall accuracy of 89.22% in human preference for document structure and style, outperforming GPT-5 by 19.45 percentage points 61.
-
Marcus Emmanuel Barnes from [Institution] and colleagues addressed the inefficiency and high cost of processing verbose, noisy real-world text-rich data directly into LLMs. They proposed Task-Aware Reduction (TAR), a method for preprocessing data to prioritize task-relevant information and optimize token budgets. The main innovation points include the focus on semantic relevance over size reduction and the development of a research agenda for sustainability and adaptability. The value lies in improving the scalability and sustainability of LLM-driven systems. Although specific experimental improvements over baselines are not mentioned, the paper sets a direction for future research in task-specific reduction techniques 62.
-
Yusheng Song from [Institution] and colleagues sought to resolve the ‘Detection Dilemma’ in LLMs, where methods based on Internal State Probing (ISP) and Chain-of-Thought Verification (CoTV) have complementary strengths and weaknesses. They introduced a unified framework that integrates ISP and CoTV, using a multi-path reasoning mechanism and a segment-aware temporalized cross-attention module. The main innovation points are the use of LLM-as-a-Judge protocol and high-quality hallucination labels. The value lies in providing a more generalized and robust solution for hallucination detection. Experiments on TruthfulQA, TriviaQA, and GSM8K benchmarks showed significant improvements in AUROC, suggesting that the integration of multiple reasoning paths and cross-modal fusion techniques enhances detection capabilities 63.
-
Hayate Funakura from Kyoto University and colleagues evaluated the logical correctness of neural semantic parsers using graph-matching metrics combined with automated theorem proving. They introduced a modified Smatch metric called Counter for Discourse Representation Structures (DRS) and applied prenex normalization to target formulas. The main innovation points include the emphasis on logical correctness through theorem proving and the exploration of supervised fine-tuning and few-shot in-context learning models. The value lies in ensuring that generated formal representations can support logically sound inference. Experiments on the SICK dataset showed that prenex normalization significantly improves Prover Accuracy and reduces the Non-Well-Formed Formula (Non-WFF) ratio, indicating the importance of model capacity and target normalization 64.
-
Tao Li from [Institution] and colleagues developed WebRouter, a query-specific router for cost-sensitive web agents that operate under the Variational Information Bottleneck (VIB) principle. The main innovation points are the cost-aware VIB objective and the reduction of noisy and redundant prompts. The value lies in balancing accuracy and operational cost in web automation tasks. Experiments on five real-life websites from the WebVoyager dataset showed that WebRouter reduced operational costs by 87.8% with only a 3.8% drop in accuracy compared to a GPT-4o baseline, outperforming other baselines in both accuracy and efficiency 65.
-
Shubham Chatterjee from Missouri University of Science and Technology and colleagues introduced REGENT, a re-ranking model that uses relevance-guided attention to handle complex, multi-hop reasoning queries and long documents. The main innovation points include the integration of token-level BM25 scores and query-specific entity representations into the attention mechanism. The value lies in enhancing the ability of search engines to provide accurate and contextually relevant information. Experiments on TREC Robust04, TREC Core 2018, and CODEC datasets showed MAP improvements ranging from 29.1% to 108% relative to BM25, concluding that REGENT excels on semantically complex queries 66.
-
Gareth Seneque from [Institution] and colleagues aimed to achieve principled reasoning and alignment in LLMs without heavy reliance on human preference data or correctness-verified datasets. They proposed ENIGMA, a training method that integrates Group-Relative Policy Optimization (GRPO), Self-Supervised Alignment with Mutual Information (SAMI), and Sinkhorn divergence for Optimal Transport (OT). The main innovation points are the unified information-geometric training objective and the Sufficiency Index (SI) for evaluating constitutional principles. The value lies in reshaping the model’s behavior to adhere to organizational principles and standards. Experiments on TruthfulQA and GPQA benchmarks showed significant improvements in AUROC, indicating that ENIGMA can effectively guide LLMs towards principled reasoning 67.
Technical Trends
The papers in this collection showcase a trend towards more sophisticated and targeted methods for enhancing the performance, reliability, and applicability of LLMs. Innovations include:
-
Structural Alignment and Decomposition: Techniques like TopoAlign and QDER emphasize the importance of aligning and decomposing data to match specific domains or tasks, improving model performance in areas like formal mathematics and semantic parsing.
-
Emotion and Bias Analysis: Papers like ‘Do LLMs “Feel”?’ and ‘Do Psychometric Tests Work for Large Language Models?’ highlight the need for specialized frameworks to analyze and control emotional expression and biases within LLMs, moving towards more ethical and emotionally intelligent AI.
-
Query-Specific Processing and Routing: Methods like WebRouter and REGENT underscore the importance of dynamic and context-aware approaches for processing and routing complex queries, improving the efficiency and accuracy of information retrieval systems.
-
Theorem Proving and Logical Adequacy: The ‘A Theorem-Proving-Based Evaluation of Neural Semantic Parsing’ paper introduces automated theorem proving as a critical component for evaluating the logical correctness of semantic parsing, pushing the boundaries of model reliability.
-
Cost-Effectiveness and Sustainability: The ‘Task-Aware Reduction for Scalable LLM-Database Systems’ paper emphasizes the need for cost-aware strategies in deploying LLMs, reflecting a growing awareness of the environmental and economic impacts of AI systems.
Datasets and Evaluation Metrics
- MiniF2F, Putnam, ProofNet: Used in TopoAlign to evaluate autoformalisation performance, measured by BEq (Bitwise Equality).
- Ethnologue, Common Crawl, Wikipedia, Hugging Face: Used in ‘Invisible Languages of the LLM Universe’ to categorize and analyze languages.
- NHANES dietary recall survey, American Trends Panel (ATP): Employed in ‘Valid Survey Simulations with Limited Human Data’ to assess survey simulation accuracy.
- SEV: Introduced in ‘Do LLMs “Feel”?’ to study emotion circuits in LLMs, evaluated using expression accuracy and success rates.
- SICK Dataset: Used in ‘A Theorem-Proving-Based Evaluation of Neural Semantic Parsing’ for logical equivalence assessment, evaluated by Prover Accuracy and Exact Match.
- TruthfulQA, TriviaQA, GSM8K: Applied in ‘Hallucination Detection via Internal States and Structured Reasoning Consistency’ to evaluate hallucination detection, measured by AUROC.
- TREC Robust 2004, TREC News 2021, TREC Core 2018, TREC CAR, CODEC: Utilized in both ‘Query-Specific Document and Entity Representations for Multi-Vector Document Re-Ranking’ and ‘REGENT: Relevance-Guided Attention for Entity-Aware Multi-Vector Neural Re-Ranking’ to test retrieval performance, evaluated by MAP, nDCG@20, P@20, and MRR.
- WebVoyager: Used in ‘WebRouter: Query-Specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent’ to evaluate cost-performance trade-offs in web agents.
- DocPair: Introduced in ‘DocReward: A Document Reward Model for Structuring and Stylizing’ to assess document professionalism, evaluated by human preference accuracy.
These datasets and evaluation metrics collectively contribute to a more nuanced and comprehensive understanding of LLM performance across different domains and tasks.
References
-
LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens ↩︎
-
Discursive Circuits: How Do Language Models Understand Discourse Relations? ↩︎
-
Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization ↩︎
-
LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models ↩︎
-
Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries ↩︎
-
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs ↩︎
-
Data or Language Supervision: What Makes CLIP Better than DINO? ↩︎
-
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding ↩︎
-
VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents ↩︎
-
Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications ↩︎
-
Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap ↩︎
-
Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies ↩︎
-
Towards Real-Time Fake News Detection under Evidence Scarcity ↩︎
-
Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content ↩︎
-
TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code ↩︎
-
Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks ↩︎
-
GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences ↩︎
-
Culturally-Aware Conversations: A Framework & Benchmark for LLMs ↩︎
-
CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis ↩︎
-
Celebrity Profiling on Short Urdu Text using Twitter Followers’ Feed ↩︎
-
Scaling Law in LLM Simulated Personality: More Detailed and Realistic Persona Profile Is All You Need ↩︎
-
Don’t Walk the Line: Boundary Guidance for Filtered Generation ↩︎
-
FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks ↩︎
-
QeRL: Beyond Efficiency – Quantization-enhanced Reinforcement Learning for LLMs ↩︎
-
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models ↩︎
-
Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains? ↩︎
-
R-WoM: Retrieval-augmented World Model For Computer-use Agents ↩︎
-
Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities ↩︎
-
LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation ↩︎
-
Are Large Language Models Effective Knowledge Graph Constructors? ↩︎
-
Domain-Specific Data Generation Framework for RAG Adaptation ↩︎
-
Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation ↩︎
-
LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance ↩︎
-
FinVet: A Collaborative Framework of RAG and External Fact-Checking Agents for Financial Misinformation Detection ↩︎
-
Information-Preserving Reformulation of Reasoning Traces for Antidistillation ↩︎
-
Attacks by Content: Automated Fact-checking is an AI Security Issue ↩︎
-
Bag of Tricks for Subverting Reasoning-based Safety Guardrails ↩︎
-
Fairness Metric Design Exploration in Multi-Domain Moral Sentiment Classification using Transformer-Based Models ↩︎
-
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers ↩︎
-
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression ↩︎
-
ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces ↩︎
-
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States ↩︎
-
DND: Boosting Large Language Models with Dynamic Nested Depth ↩︎
-
EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling ↩︎
-
An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification ↩︎
-
Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing ↩︎
-
GenCNER: A Generative Framework for Continual Named Entity Recognition ↩︎
-
Early Detection and Reduction of Memorisation for Domain Adaptation and Instruction Tuning ↩︎
-
Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages ↩︎
-
TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition ↩︎
-
Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification ↩︎
-
Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality ↩︎
-
QDER: Query-Specific Document and Entity Representations for Multi-Vector Document Re-Ranking ↩︎
-
DocReward: A Document Reward Model for Structuring and Stylizing ↩︎
-
Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models ↩︎
-
A Theorem-Proving-Based Evaluation of Neural Semantic Parsing ↩︎
-
WebRouter: Query-specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent ↩︎
-
REGENT: Relevance-Guided Attention for Entity-Aware Multi-Vector Neural Re-Ranking ↩︎
-
ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models ↩︎