2025年10月09日NLP论文汇总(英文)
- Topic 1: Large Language Model Optimization and Fine-Tuning (8 papers)
- Topic 2: Multimodal and Multilingual Reasoning (4 papers)
- Topic 3: Reinforcement Learning and Policy Optimization for LLMs (8 papers)
- Topic 4: Knowledge Representation and Reasoning (7 papers)
- Topic 5: Speech and Audio Processing with LLMs (5 papers)
- Topic 6: Evaluation and Benchmarking of LLMs (6 papers)
- Topic 7: Reasoning and Logical Generalization (4 papers)
- Topic 8: Privacy and Security in LLMs (6 papers)
- Topic 9: Human-like Reasoning and Dialogue (8 papers)
- Topic 10: Data and Training Strategies for LLMs (5 papers)
- Topic 11: misc (39 papers)
Topic 1: Large Language Model Optimization and Fine-Tuning
Topic Overview
Large Language Model (LLM) optimization and fine-tuning have become central themes in the advancement of AI applications across various domains. These models, characterized by their massive scale and versatile capabilities, offer unprecedented opportunities for improving task-specific performance. However, challenges remain in understanding how fine-tuning impacts these models, especially in specialized fields like healthcare, and in enhancing their efficiency, reasoning, and adaptability to different tasks and data formats. Addressing these issues can lead to more efficient, reliable, and trustworthy AI systems, which are essential for applications ranging from healthcare to e-commerce and beyond.
Individual Paper Contributions
-
Eshaan Tanwar from University of Copenhagen and colleagues studied the effects of domain-specific fine-tuning on LLMs, focusing on the medical domain. They introduced a novel framework called ’tuning vectors’, inspired by task vectors, to analyze how fine-tuning modifies the parametric space of these models. The main innovation points are the focus on larger, domain-specific models and the examination of fine-tuning’s impact across multiple performance axes, including instruction-following and generation quality. The value lies in providing a general, interpretable method for analyzing specialization and showing the potential for combining tuning vectors from different domains to enhance generalization capabilities. Experiments on medical benchmarks and text generation tasks demonstrated significant drops in performance upon removal of tuning vectors, concluding that fine-tuning primarily writes new directional information into the MLP layers while amplifying existing directions in attention heads1.
-
Hairu Wang from University of Science and Technology of China and colleagues tackled the challenge of product pricing in C2C e-commerce platforms for second-hand goods. They proposed LLP, a system that uses LLMs for generating price suggestions based on a ‘retrieval-then-reasoning’ paradigm. The main innovation is the use of post-training techniques such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) to enhance domain-specific knowledge and reasoning abilities. The value lies in improving transaction efficiency and success rates by providing accurate price estimations for non-standardized products. Evaluation on a real-world Xianyu dataset showed significant improvements in metrics like RMSLE, MALE, SAR, and DAR, highlighting the robust generalization and effectiveness of LLP over existing methods2.
-
Tao Feng from Monash University and colleagues addressed the problem of performing causal discovery in the absence of structured tabular data. They introduced IRIS, a hybrid framework that automates the process of collecting relevant documents and extracting variable values from unstructured text to form structured data for causal discovery. The innovation lies in the integration of statistical algorithms and LLM-based methods, along with a missing variable proposal component that expands causal graphs. The value is in enabling causal discovery without the need for high-quality structured data, which is often costly and time-consuming to obtain. Experiments across Cancer, Respiratory Disease, Diabetes, Obesity, ADNI, and Insurance datasets demonstrated that IRIS achieves the highest F1 scores and lowest NHD ratios, outperforming strong baselines like 0-shot, CoT, and RAG3.
-
Changjiang Gao from National Key Laboratory for Novel Software Technology, Nanjing University and colleagues focused on the degradation in reasoning capabilities of translation-enhanced models. They proposed a layer-selective tuning method to maintain reasoning abilities while improving translation performance. The innovation points include starting from instruct models and selectively tuning certain layers using minimal parallel data. The value is in creating more versatile models that can handle both translation and reasoning tasks efficiently, without needing specialized task data. Qwen3-XPlus-8B and Qwen3-XPlus-14B outperformed their base models and other baselines on 7 multilingual tasks and reasoning tasks, showing that selective tuning of the bottom and top layers is particularly effective for low-resource languages4.
-
Xu Pan from Center for Brain Science, Harvard University and colleagues aimed to bridge the data-efficiency gap between autoregressive and masked diffusion LLMs. They introduced a masked fine-tuning paradigm that mimics the diffusion-style mask reconstruction loss, enhancing arLLMs’ ability to generalize knowledge from fine-tuning to QA tasks without extensive paraphrase data. The innovation lies in emulating the data efficiency of dLLMs for arLLMs. The value is in making arLLMs more adaptable and efficient learners, which is crucial for dynamic environments. Experiments on NameDescription, Biography, and Wiki datasets showed that the proposed method significantly improved arLLMs’ performance on QA tasks, closing the gap with dLLMs5.
-
Weiqing Luo from University of North Carolina at Chapel Hill and colleagues sought to address the difficulty and cost associated with curating large reasoning datasets for LLMs. They introduced Prompting Test-Time Scaling (P-TTS), a method that uses a small pool of manually selected reasoning instances and varies exemplar augmentation through instructional prompts at test time. The innovation is in leveraging test-time data augmentation to enhance reasoning performance. The value lies in reducing the curation cost by an order of magnitude while achieving competitive or superior performance. Experiments on mathematical reasoning benchmarks like AIME2024 and AIME2025 showed absolute accuracy gains of up to +30.00%, demonstrating the effectiveness of P-TTS in improving zero-shot generalization6.
-
Xixi Wang and colleagues focused on the extraction of implicit information from unstructured crash narratives. They proposed a methodology to adapt pre-trained language models (PLMs) and LLMs using parameter-efficient fine-tuning (PEFT) with LoRA for the specific task of crash narrative analysis. The innovation is in using PEFT to adapt open-source PLMs to the traffic safety domain, addressing the challenge of limited crash-specific knowledge. The value is in automating crash narrative analysis, reducing manual labor and errors, and supporting larger-scale traffic safety studies. Experiments on the CISS dataset showed significant improvements in accuracy and macro F1 scores, even with smaller models, indicating the effectiveness of LoRA and fine-tuning in crash narrative analysis7.
Technical Trends
The papers collectively highlight several key trends in LLM optimization and fine-tuning:
- Domain-Specific Adaptation: Techniques such as ’tuning vectors’ and parameter-efficient fine-tuning (PEFT) with LoRA are being developed to better adapt LLMs to specialized domains like healthcare and e-commerce.
- Efficient Learning Methods: Layer-selective tuning and masked fine-tuning paradigms aim to reduce the need for extensive data and resources, making LLMs more efficient learners.
- Inference-Time Augmentation: Methods like P-TTS demonstrate the potential of using test-time data augmentation to improve reasoning performance with minimal additional data.
- Hybrid Approaches: Combining statistical algorithms with LLM-based methods, as seen in IRIS, opens new avenues for causal discovery and handling unstructured data.
- Adaptive Resolution Selection: For Visual Large Language Models (VLLMs), adaptive resolution selection strategies are emerging to optimize performance based on the specific requirements of vision-language tasks.
Datasets and Evaluation
The primary datasets and evaluation metrics used across the papers include:
- Medical Benchmarks and Text Generation Tasks: Used to assess the impact of domain-specific fine-tuning on LLMs.
- Xianyu Dataset: Real-world e-commerce data used for evaluating the effectiveness of LLP in product pricing.
- Cancer, Respiratory Disease, Diabetes, Obesity, ADNI, and Insurance Datasets: Used for causal discovery and assessing IRIS’s performance.
- NameDescription, Biography, and Wiki Datasets: Employed to compare the data efficiency of arLLMs and dLLMs and to evaluate the masked fine-tuning paradigm.
- Mathematical Reasoning Datasets (AIME2024, AIME2025, MATH500, GPQA-Diamond): Utilized to test the effectiveness of P-TTS in enhancing LLM reasoning.
- Crash Investigation Sampling System (CISS) Dataset: Applied for evaluating the performance of domain-adapted models in implicit information extraction from crash narratives.
Evaluation metrics commonly used include accuracy, macro F1 scores, Normalized Hamming Distance (NHD), RMSLE, MALE, SAR, DAR, and task-specific benchmarks like spBLEU and xComet for translation tasks. Each metric helps in quantifying the performance improvements brought about by the proposed methodologies in their respective application areas.
Topic 2: Multimodal and Multilingual Reasoning
Topic Overview
Multimodal and multilingual reasoning is an emerging area of research in natural language processing (NLP) that focuses on enhancing the capabilities of large language models (LLMs) to process and understand information presented in multiple forms (e.g., text, images, videos) and across various languages. This topic is crucial for developing AI systems capable of interacting effectively in global, multicultural environments where users may input data in a variety of formats and languages. Improvements in this domain can lead to more robust, versatile, and inclusive AI technologies, which are increasingly necessary as digital communication becomes more complex and widespread.
Individual Paper Contributions
-
Yihong Liu from LMU Munich and colleagues studied the robustness of large language models against multilingual typographical errors, proposing MulTypo, a novel multilingual typo generation algorithm that simulates realistic human-like errors. The main innovation points are the grounding of typos in language-specific keyboard layouts and typing patterns, enabling a more naturalistic simulation of user input errors. The value lies in providing a comprehensive evaluation framework that assesses LLMs’ performance under noisy conditions, thus offering insights into their practical reliability and usability. Experiments on datasets such as XNLI, Belebele, MMMLU, MGSM, and FLORES200 showed that typographical errors significantly reduced LLM performance, especially in generative tasks and reasoning, and that model size did not necessarily correlate with higher robustness. Human evaluation confirmed MulTypo-generated typos were rated as more natural by humans compared to a naive baseline in most languages tested, highlighting the importance of realistic error simulation in robustness assessments8.
-
Zhenhailong Wang from University of Illinois Urbana-Champaign and colleagues addressed the challenge of efficiently integrating complex policies into multimodal conversational agents. They proposed the Multimodal Policy Internalization (MPI) task and a three-stage training framework named TriMPI, which includes Visually-Masked Continual Pretraining (VM-CPT), Supervised Finetuning with Chain-of-thought (CoT SFT), and a Reinforcement Learning stage utilizing PolicyRollout (PoRo). The innovation lies in the direct integration of policy knowledge into model parameters, bypassing the need for in-context policy inclusion during inference. The value is in enhancing the reliability and efficiency of conversational agents in handling complex, reasoning-intensive policies. Experiments on newly introduced datasets, ClevrPolicy and GTAPolicy, demonstrated that TriMPI achieved up to 70.7% and 79.4% absolute gains in accuracy over the CoT SFT baseline and in-context settings, respectively, showcasing significant improvements in policy adherence and generalization9.
-
Kaiwen Wei from Chongqing University and colleagues tackled the issue of fine-grained multimodal comprehension in video-based Retrieval-Augmented Generation (MRAG) systems. They introduced CFVBench, a comprehensive video benchmark for evaluating fine-grained multimodal retrieval and generation capabilities, along with the Adaptive Visual Refinement (AVR) framework. The key innovation is the use of adaptive frame interpolation and on-demand tool invocation strategies to enhance multimodal reasoning and detail extraction. The value of this work is in improving the factual accuracy, knowledge grounding, and explainability of MLLMs in complex video content. Experiments on CFVBench showed that the AVR framework notably improved the performance of models like Gemma-3-27b and InternVL3_5-30B, increasing recall and visual utilization scores, and demonstrating better handling of internal hallucinations and attention issues10.
-
Mir Tafseer Nayeem from University of Alberta and colleagues analyzed the limitations of fertility as a metric for evaluating multilingual tokenization and introduced the Single Token Retention Rate (STRR) metric. STRR measures the proportion of words preserved as single tokens across languages, complementing fertility by offering a more interpretable measure of cross-lingual fairness. The main contribution is the identification and quantification of implicit biases in tokenizers towards certain languages, such as English, and the fragmentation of others, like Hindi. The value lies in promoting more equitable and efficient multilingual tokenizers, which can improve LLM performance and cost efficiency in multilingual and code-mixed scenarios. Cross-lingual evaluation across seven languages and two domains revealed that STRR provided a clearer picture of vocabulary allocation biases, suggesting the need for refined tokenization evaluation metrics11.
Technical Trends
The papers collectively highlight several evolving trends in multimodal and multilingual reasoning:
- Error Simulation and Robustness Testing: There is a growing emphasis on simulating realistic errors (such as typographical errors) to test and improve the robustness of LLMs in real-world conditions.
- Policy Integration and Multimodality: Efforts are being made to integrate complex policies into multimodal conversational agents, emphasizing the need for advanced training frameworks that can handle diverse data types and reasoning tasks without excessive computational costs.
- Fine-grained Detail Extraction: The focus on fine-grained multimodal comprehension, particularly in video content, underscores the importance of developing sophisticated tools and frameworks to enhance model performance in extracting and utilizing detailed information.
- Cross-lingual Fairness and Tokenization: Research is increasingly directed towards evaluating and ensuring fair vocabulary allocation across different languages, using innovative metrics like STRR to identify and mitigate biases inherent in multilingual tokenizers.
Datasets and Evaluation Metrics
- XNLI, Belebele, MMMLU, MGSM, FLORES200: Used to evaluate the robustness of LLMs against multilingual typographical errors.
- ClevrPolicy, GTAPolicy: Introduced for assessing the ability of multimodal conversational agents to internalize and adhere to complex policies.
- CFVBench: A large-scale video-based MRAG benchmark designed to test fine-grained multimodal comprehension capabilities.
- STRR, Fertility, Subword Entropy, Characters-per-Token: Metrics proposed or utilized to evaluate the effectiveness and fairness of multilingual tokenizers.
Topic 3: Reinforcement Learning and Policy Optimization for LLMs
Topic Overview
Reinforcement Learning (RL) and Policy Optimization for Large Language Models (LLMs) represent a critical area of research aimed at enhancing the adaptability, decision-making, and overall performance of AI agents in complex, long-horizon tasks. These tasks often require the integration of simulation and reasoning capabilities, as well as the ability to handle sparse rewards effectively. By improving RL techniques and policy optimization methods, researchers aim to create more robust, versatile, and human-aligned AI systems capable of navigating diverse and challenging environments.
Individual Paper Contributions
-
Xiao Yu from Columbia University and colleagues studied the enhancement of AI agents’ performance in complex environments that demand multi-step reasoning, proposing Dyna-Mind, a two-stage training framework that enables agents to simulate the environment by learning from real experiences. The main innovation points of this method are ReSim and Dyna-GRPO, where ReSim constructs simulation-guided reasoning traces using search trees derived from real environment interactions, and Dyna-GRPO incorporates future state information as textual signals to refine the policy model during online reinforcement learning. The value lies in improving the agents’ ability to construct accurate world models and simulate potential future scenarios, thereby enhancing their decision-making and adaptability. Experiments on synthetic benchmarks (Sokoban and ALFWorld) and a realistic benchmark (AndroidWorld) showed significant performance improvements, particularly in out-of-distribution test sets, where Dyna-GRPO achieved an average success rate of 90.8% on ALFWorld and 31.8% on AndroidWorld, demonstrating the effectiveness of direct learning from real experiences over synthetic data generation.12
-
Xingyu Lin from Baidu Inc and colleagues addressed the issue of entropy collapse and model collapse in Group Relative Policy Optimization (GRPO) during mathematical reasoning tasks, proposing TEPO (Token-Level Policy Optimization). The main innovation points are the use of Markov Likelihood to link group-level rewards to token-level aggregation and a computation graph that stabilizes training while enabling effective token-level policy optimization. The value lies in improving training stability and performance in environments with sparse rewards, which are common in mathematical reasoning tasks. Experiments on the MATH-500 benchmark demonstrated a +4.8 point jump over GRPO, highlighting TEPO’s superior performance and stability.13
-
Yongding Tao from Peking University and colleagues focused on detecting data contamination during the RL post-training phase of LLMs, introducing Self-Critique. The main innovation is the utilization of token-level entropy sequences to identify RL-induced data contamination. The value lies in providing the first systematic approach to detecting contamination within the RL phase, supported by a new benchmark, RL-MIA, designed to simulate and evaluate this contamination. Experiments across various datasets and models showed up to a 30% improvement in AUC over existing baselines, confirming the effectiveness of entropy-based probes in the RL context.14
-
Chenyang Gu from Sapiens AI and colleagues tackled the instability and inefficiency in training LLMs for agentic search and reasoning tasks using RL, proposing DSPO (Dynamic-filter Sequence-level Policy Optimization). The key innovations are sequence-level policy optimization and a dynamic filtering mechanism that ensures every training batch has a meaningful learning signal. The value lies in improving training stability and efficiency without needing additional supervised demonstration data, which is crucial for complex, open-ended domains like interactive search. Experiments on multiple QA benchmarks, including HotpotQA, NQ, TriviaQA, and others, showed a 34.1% relative improvement over a comparable 7B baseline model and outperformed a 14B model, indicating robustness and effectiveness in maintaining a stable learning trajectory.15
-
Chengyu Wang from (institution not specified) and colleagues addressed the challenge of applying RL to align diffusion large language models (dLLMs) with human preferences or task-specific rewards, proposing SPG (Sandwiched Policy Gradient). The main innovation points are the optimization of sandwiched variational bounds based on reward and a block-wise masking strategy that improves the stability and efficiency of policy optimization. The value lies in providing a tractable log-likelihood estimation and better alignment with desired outcomes. Experiments on reasoning benchmarks like GSM8K, MATH500, Countdown, and Sudoku showed significant improvements in test accuracy, ranging from 2.6% to 27.0% over state-of-the-art RL algorithms for diffusion language models, underscoring SPG’s effectiveness in improving model alignment with human preferences and task-specific objectives.16
-
Haoran Sun from (institution not specified) and colleagues explored the dynamic updating of user preference memory in LLM agents for long-term interactions, proposing PAMU (Preference-Aware Memory Update). The main innovation is a Preference Change Perception Module that uses sliding window averages (SW) and exponential moving averages (EMA) to capture both short-term and long-term user preference trends. The value lies in maintaining consistent and personalized responses over extended periods, enhancing user satisfaction and dialogue quality. Experiments on the LoCoMo dataset, which evaluates memory and consistency in extended multi-session interactions, showed statistically significant improvements in F1 and BLEU-1 scores across various baselines and task scenarios, validating PAMU’s effectiveness and adaptability.17
Technical Trends
The papers in this collection showcase a variety of technical trends aimed at advancing RL and policy optimization for LLMs:
- Integration of Simulation and Reasoning: Methods like Dyna-Mind emphasize the importance of leveraging real experience data to build more accurate simulations, thus enhancing reasoning capabilities in complex environments.
- Handling Sparse Rewards: Innovations such as TEPO and DSPO focus on addressing the instability and inefficiency caused by sparse reward signals in RL, particularly in domains like mathematical reasoning and agentic search.
- Detection of Data Contamination: Papers like Self-Critique highlight the necessity of developing systematic methods to detect and mitigate data contamination during the RL phase, ensuring model integrity and reliable performance.
- Efficient Long-Context Reasoning: DELTA introduces a novel sparse attention mechanism that aims to reduce computational costs while preserving the accuracy of long-context reasoning, crucial for practical deployment of reasoning-intensive models.
- Visual Perception and Multi-Image Reasoning: VisRAG 2.0 advances the field by focusing on enhancing visual perception and reasoning in multi-image scenarios, aiming to provide more accurate and reliable answers in complex vision-language tasks.
- Memory and User Preference Management: Preference-Aware Memory Update (PAMU) emphasizes the dynamic updating of user preference memory to ensure consistent and personalized responses over long-term interactions, a vital aspect for real-world applications involving sustained human-computer interactions.
Datasets and Evaluation
The papers utilize a range of datasets to evaluate their proposed methods:
- Synthetic and Realistic Benchmarks: Sokoban, ALFWorld, and AndroidWorld were used to assess the effectiveness of Dyna-Mind.
- Mathematical Reasoning Datasets: MATH-500 was utilized to evaluate TEPO.
- Reinforcement Learning Induced Contamination Benchmark: RL-MIA was introduced for Self-Critique.
- Question Answering Datasets: HotpotQA, NQ, TriviaQA, PopQA, 2WikiMultiHopQA, Musique, Bamboogle, and others were used for DSPO.
- Reasoning Datasets: GSM8K, MATH500, Countdown, and Sudoku were used to test SPG.
- Visual Question Answering Datasets: ChartQA, InfoVQA, DocVQA, SlideVQA, and ViDoSeek were utilized for VisRAG 2.0.
- Long-Term Interaction Dataset: LoCoMo was used to evaluate PAMU.
Evaluation metrics vary according to the specific tasks and datasets, including success rates, AUC (Area Under the Curve), accuracy, F1 scores, BLEU-1 scores, and decoding latency. These metrics collectively measure the performance, stability, and efficiency of the proposed methods in enhancing LLMs’ reasoning and decision-making capabilities.
Topic 4: Knowledge Representation and Reasoning
Topic Overview
Knowledge Representation and Reasoning (KRR) is a critical area in artificial intelligence that deals with the representation of knowledge in a way that supports automated reasoning. Advances in KRR can significantly enhance the performance of machine learning models in various applications, including social network analysis, anomaly detection, recommendation systems, commonsense reasoning, nutrition question answering, and multilingual video corpus retrieval. By improving how knowledge is structured and processed, researchers aim to develop more effective, reliable, and inclusive AI systems capable of handling complex tasks with limited labeled data or across diverse linguistic contexts.
Individual Paper Contributions
-
Ziyu Zheng from Xidian University and colleagues studied the limitations of existing graph prompt tuning methods, which focus on a single granularity (node-level, edge-level, or subgraph-level) and fail to capture multi-scale structural dependencies in real-world graphs. They proposed Multi-Scale Graph Chain-of-Thought (MSGCOT) to integrate multi-scale information into graph prompt tuning. The main innovation points of MSGCOT include the use of a lightweight coarsening network with a low-rank matrix architecture to reduce parameters while maintaining performance, and a backtracking-based progressive prompt optimization strategy. The value lies in enhancing the effectiveness of Graph Neural Networks (GNNs) in tasks with scarce labeled data, improving alignment between pre-training and downstream objectives. Experiments on eight benchmark datasets for node and graph classification tasks showed MSGCOT achieved the highest accuracy on all datasets, with significant improvements ranging from 5% to 20%, particularly on the COX2 dataset where it surpassed DAGPrompt by 18.62%. The paper concluded that multi-scale prompts are crucial for performance improvement, especially in graph classification tasks, and that MSGCOT maintains high parameter and computational efficiency.18
-
Francesco Maria Molfese from Sapienza University of Rome and colleagues aimed to address the limitations in evaluating small language models (SLMs) in commonsense reasoning tasks, focusing on the reliance on final answer accuracy rather than reasoning process validity. They introduced ReTraceQA, the first benchmark for evaluating reasoning traces in SLMs. The benchmark includes 2,421 reasoning traces manually annotated with step-level errors and qualitative categorizations. The value lies in providing a more comprehensive assessment of SLM capabilities, revealing that up to 24% of flawed reasoning traces still produce correct final answers. Through experiments, they demonstrated that LLMs as judges could detect global correctness but struggled to localize reasoning errors. The best performing LLM-as-a-judge, o1-mini, achieved an average F1 score of 60.8% in reference-based evaluation, underscoring the necessity of considering the reasoning process in model assessments.19
-
Chi Seng Cheang from Singapore Management University and colleagues explored how large language models (LLMs) internally process factual queries and generate outputs, distinguishing between factual associations (FAs), associated hallucinations (AHs), and unassociated hallucinations (UHs). They proposed a detailed theoretical analysis and experimental framework using causal analysis and knowledge triples from Wikidata. The main innovation points involve categorizing knowledge and examining hidden-state geometries. The value lies in challenging the assumption that LLMs can reliably distinguish between correct predictions and AHs, revealing limitations in current hallucination detection methods. Experiments on LLaMA-3-8B and Mistral-7B-v0.3 showed AHs and FAs are harder to distinguish from each other than from UHs, leading to a consistent 18.6 percentage point drop in performance scores when reasoning-aware evaluation is considered.20
-
Adity Khisa from IIT, University of Dhaka and colleagues focused on the underrepresentation and poor performance of language models for the critically low-resource Chakma language. They introduced a novel corpus of contextually coherent Bangla-transliterated Chakma and proposed a method of fine-tuning multilingual and regional transformer models on masked language modeling (MLM) tasks using this dataset. The value lies in demonstrating that high-quality, manually corrected data significantly improves the effectiveness of transfer learning for Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Experiments revealed that fine-tuned models perform better than their pre-trained counterparts, especially with manual data, suggesting that data quality is crucial for enhancing model performance.21
-
Kaiwen Shi from University of Notre Dame and colleagues addressed the limitations of single agents for domain-specific reasoning and the complexity of multi-agent systems in the context of Nutrition Question Answering (QA). They proposed the Nutritional-Graph Router (NG-Router), a framework that integrates multi-agent systems into knowledge graphs for task-aware routing. The main innovation points include the use of a heterogeneous Graph Neural Network (GNN) to produce routing distributions over agents and a gradient-based subgraph retrieval mechanism to filter salient entities. The value lies in demonstrating robust improvements over single-agent and ensemble baselines, particularly in sparse datasets where F1 scores increased by more than 50%. NG-Router also showed strong transferability in binary classification and natural text generation tasks.22
-
Wenbin Guo from Tianjin University and colleagues tackled the challenge of integrating knowledge graph (KG) embeddings into large language models (LLMs) for knowledge graph completion (KGC) tasks. They proposed ReaLM, a unified framework that uses residual vector quantization and ontology-guided class constraints to transform high-dimensional KG embeddings into compact discrete codes. The main innovation points are the transformation method and the incorporation of class-level constraints for semantic consistency. The value lies in enhancing KGC tasks by improving coverage and reliability, as demonstrated on benchmarks like FB15k-237 and WN18RR. ReaLM achieved an MRR of 0.608 and a Hits@10 of 0.699 on WN18RR, and an MRR of 0.467 and a Hits@10 of 0.603 on FB15k-237, surpassing previous methods across all metrics.23
-
Yu Wang and colleagues aimed to improve the retrieval of relevant instructional videos from a multilingual medical archive, particularly when the query language differs from the video’s subtitles. They introduced DIMA, a multi-stage framework for multilingual video corpus retrieval (mVCR) that uses semantic chunking, domain-specific KG enrichment, and multilingual embeddings. The main innovation points are the hierarchical index construction using LaBSE embeddings and the dynamic tree-pruning strategy for efficient search. The value lies in scaling to long videos and bridging language gaps efficiently. Experiments on a specialized medical video dataset showed DIMA outperformed various baselines in Recall@1, Recall@10, Recall@50, Mean Reciprocal Rank (MRR), and overall scores, indicating the critical role of the LLM re-ranking component in achieving nuanced relevance scores.24
Technical Trends
The papers collectively highlight several key trends in KRR:
- Multi-scale Integration: Incorporating multi-scale structural dependencies into graph models to improve performance in node and graph classification tasks.
- Reasoning Process Evaluation: Shifting focus from final answer accuracy to the validity of reasoning processes, especially in commonsense question answering.
- Cross-Linguistic Adaptation: Utilizing transliteration and multilingual embeddings to enhance model performance for underrepresented languages.
- Collaborative Multi-Agent Systems: Integrating multi-agent systems into knowledge graphs to improve task-specific reasoning and decision-making.
- Residual Vector Quantization: Bridging the gap between KG embeddings and LLMs through residual quantization to preserve semantic information and enhance generative precision.
- Hierarchical Structures: Employing hierarchical indexing and lightweight models for efficient and precise retrieval of information from large, unstructured datasets.
Datasets and Evaluation
- Node and Graph Classification Datasets: Eight benchmark datasets were used to evaluate MSGCOT, focusing on node and graph classification tasks.
- Commonsense Reasoning Datasets: ReTraceQA introduced a new benchmark with 2,421 manually annotated reasoning traces for evaluating small language models.
- Nutrition Question Answering Datasets: NG-Router was evaluated on three datasets from the NGQA benchmark: sparse, standard, and complex.
- Knowledge Graph Completion Datasets: ReaLM was tested on standard benchmarks such as FB15k-237 and WN18RR.
- Multilingual Medical Video Dataset: DIMA was evaluated on a specialized medical video dataset, with metrics including Recall@1, Recall@10, Recall@50, MRR, and overall scores.
These datasets and evaluation metrics reflect the diversity of KRR applications and the importance of contextually appropriate benchmarks for assessing model performance and reasoning capabilities.
Topic 5: Speech and Audio Processing with LLMs
Topic Overview
The research topic of “Speech and Audio Processing with LLMs” is critical for advancing the capabilities of large language models (LLMs) in understanding and generating speech in real-time. Enhancing speech recognition, dialogue state tracking, and real-time reasoning in spoken language models can significantly improve human-computer interaction experiences. This topic is not only about technical improvements but also about making speech technologies more accessible and robust across different accents and dialects, which is essential for global applications. Furthermore, the development of unsupervised lexicon learning from speech contributes to the advancement of zero-resource speech technologies, enabling the creation of speech recognition systems for languages with limited textual resources.
Individual Paper Contributions
-
Donghang Wu from Nanyang Technological University and colleagues studied the high latency associated with Chain-of-Thought (CoT) reasoning in real-time spoken language models (SLMs), proposing the Mind-Paced Speaking (MPS) architecture to solve this core problem. The main innovation points of this method include a dual-brain framework consisting of a Formulation Brain and an Articulation Brain, which enables concurrent thinking and speaking. The value lies in its ability to reduce latency while maintaining semantic coherence, thus making SLMs more natural and effective for human-computer interactions. Experiments on benchmarks such as Spoken-MQA and URO-Bench showed that the proposed methods outperform existing techniques in terms of response quality and latency, demonstrating the effectiveness of the dual-brain approach in real-time reasoning 25.
-
Yavuz Durmazkeser from TU Delft and colleagues addressed the efficient selection of the best Large Language Model (LLM) for specific tasks with minimal annotation effort. They introduced LLM Selector, a framework designed for active model selection that uses limited annotations to identify the optimal LLM. The key innovation is the use of a judge-based annotation process with pairwise comparisons and noisy annotations via weak judges, minimizing the need for expensive reference answers. The value lies in its annotation efficiency and ability to operate under strict budget constraints, making it suitable for various application domains. Validation across six benchmarks on 151 LLMs demonstrated that LLM Selector achieved 100% model identification probability with up to 58.33% fewer annotated queries compared to the best baseline on some datasets, outperforming or matching previous baselines 26.
-
Mohammad Hossein Sameti from [Institution] and colleagues tackled the sensitivity of pre-trained transformer-based models to accent and dialectal variations in ASR systems. They proposed a saliency-driven spectrogram masking framework that integrates accent and dialect classification into the ASR pipeline using a spectrogram-based CNN classifier trained with Grad-CAM. The innovation points include the localization of accent-specific features and a probabilistic masking strategy to enhance the model’s focus on accent-neutral linguistic features. The practical value is in improving the robustness of ASR systems to both known and unseen accents, enhancing their utility in diverse real-world scenarios. Experiments on English (LibriSpeech, EdAcc, CommonAccent) and Persian (CommonVoice-fa, PDID) datasets showed consistent reductions in Word Error Rate (WER) and Character Error Rate (CER), demonstrating the framework’s effectiveness in handling diverse accents 27.
-
Nizar El Ghazal from [Institution] and colleagues focused on enhancing spoken dialog state tracking (DST) by reducing reliance on ASR modules within traditional cascade systems. They proposed new context management strategies for spoken DST using Speech-LLMs, evaluating the effectiveness of full spoken context and compressed spoken context methods. The innovation lies in the introduction of attention-pooling-based compression to maintain competitive accuracy with a reduced context size. The value is in improving the DST system’s ability to reason over dialog states and effectively fulfill user requests, particularly in scenarios involving proper nouns and domain-specific terminology. Experiments on the SpokenWOZ dataset showed significant improvements in Joint Goal Accuracy (JGA) with the full spoken context method achieving 39.32% and the compressed method achieving 36.49%, both outperforming previous baselines 28.
-
Danel Adendorff from [Institution] and colleagues explored the limitations in unsupervised lexicon learning from speech, specifically focusing on the impact of speech representations versus clustering methods. They proposed and evaluated various self-supervised learned (SSL) speech representations combined with different clustering approaches, emphasizing the role of continuous representations and graph clustering. The innovation points are the detailed investigation into the effectiveness of SSL representations and the examination of clustering methods under idealized conditions. The value lies in providing a deeper understanding of the current limitations in unsupervised word discovery systems. Experiments on English (LibriSpeech test-clean) and Mandarin data revealed that WavLM Large features, when used with graph clustering, achieve the highest lexicon quality, indicating that speech representations are the primary constraint rather than clustering methods 29.
Technical Trends
The papers in this collection reflect a trend towards more sophisticated and efficient architectures in speech and audio processing. Innovations include dual-brain frameworks for concurrent thinking and speaking, active model selection strategies to optimize resource usage, and advanced spectrogram masking techniques to handle accent variations. Additionally, there is a growing emphasis on context management strategies in spoken dialog systems and the exploration of self-supervised learning (SSL) for unsupervised lexicon learning. These advancements aim to address challenges such as high latency, error propagation, and robustness to linguistic diversity.
Datasets and Evaluation
The papers utilized a variety of datasets and evaluation metrics to validate their contributions:
- Mind-Paced Speaking (MPS): Evaluated on Spoken-MQA and URO-Bench datasets.
- LLM Selector: Tested across six benchmarks including Arena-Hard, AlpacaEval, MT-Bench, Flickr30k, Bingo, and MediQA.
- Accent-Invariant ASR: Utilized English datasets (LibriSpeech, EdAcc, CommonAccent) and Persian datasets (CommonVoice-fa, PDID).
- Spoken DST with Speech-LLMs: Evaluated on the SpokenWOZ dataset using Joint Goal Accuracy (JGA) as the primary metric.
- Unsupervised Lexicon Learning: Used LibriSpeech test-clean for English and unspecified Mandarin datasets for Mandarin data, evaluating the quality of learned lexicons using purity, V-measure, and Normalized Edit Distance (NED).
Topic 6: Evaluation and Benchmarking of LLMs
Topic Overview
The evaluation and benchmarking of Large Language Models (LLMs) have become increasingly important as these models find applications in a wide range of domains. Accurate and comprehensive benchmarks are essential to measure the performance, reliability, and capability of LLMs in specific tasks, thereby guiding improvements and fostering trust in their use. This report summarizes five papers that contribute to the advancement of LLM evaluation across different scenarios, including academic promotion, statistical reasoning, pre-training dynamics, narrative understanding, and safety guardrails for agentic systems.
Individual Paper Contributions
-
Qiguang Chen from Harbin Institute of Technology and colleagues studied the inefficiency and lack of precision in creating promotional materials for academic papers, proposing AutoPR and PRBench to solve this problem. The main innovation points include a three-stage framework for content generation and a multimodal dataset for benchmarking. The value lies in automating the creation of high-quality academic promotion content, which can enhance the visibility and impact of research. Experiments on PRBench showed significant improvements of at least 7.15% across various metrics, and real-world testing on RedNote demonstrated a 604% increase in total watch time and a 438% increase in likes, concluding that PRAgent effectively addresses the limitations of current LLMs in generating accurate and engaging promotional content30.
-
Yuchen Lu from Shanghai University of Finance and Economics and colleagues addressed the underrepresentation and inadequate evaluation of LLMs in statistical reasoning, introducing StatEval. The main innovation points are a comprehensive dataset covering foundational and research-level statistical reasoning problems and a scalable multi-agent pipeline for automated extraction and curation. The value lies in providing a rigorous assessment tool for LLMs’ statistical capabilities, filling a gap in current benchmarking efforts. Experiments revealed that GPT-5 achieved the highest overall score of 82.85% on foundational tasks, but performance on research-level tasks was lower, indicating the need for focused training in advanced statistical reasoning31.
-
Jiapeng Wang from AntGroup and colleagues tackled the instability in the evaluation process during LLM pre-training, proposing the MaP framework. The main innovation points include checkpoint merging and the use of the Pass@k metric to stabilize evaluation against sampling randomness. The value lies in offering a reliable evaluation pipeline that accurately measures model capabilities and guides future research. Experiments showed that MaP significantly enhances the stability of the learning trajectory and predictability of downstream performance, as evidenced by improved Kendall’s rank correlation coefficient ($\tau$) and Pairwise Ranking Reversal Rate (PRR) on benchmarks like GSM8K, MATH, HumanEval, and MBPP, concluding that MaP provides a clearer signal of model progress during pre-training32.
-
Yincen Qu from Trip.com Group and colleagues focused on the evaluation and improvement of travel planning capabilities of LLMs, proposing TripScore. The main innovation points are the integration of multiple criteria into a single reward score and the use of reinforcement learning (RL) to enhance model performance. The value lies in providing a more nuanced and reliable evaluation metric for travel planning scenarios, emphasizing real-world applicability. Experiments demonstrated that GRPO fine-tuning techniques applied to Qwen3-14B model performed best in terms of delivery rate (DR) and commonsense constraint pass rate (CPR), concluding that TripScore effectively evaluates and rewards models for their ability to generate feasible and engaging travel itineraries33.
-
Yue Huang from University of Notre Dame and colleagues aimed to build a foundational guardrail for general agentic systems via synthetic data, proposing AuraGen and Safiron. The main innovation points include a synthetic data engine for generating diverse risky trajectories and a compact guardian model for flagging and categorizing risks at the planning stage. The value lies in ensuring the safe execution of agentic systems, particularly in high-stakes domains like healthcare. Experiments on Pre-Exec Bench showed that Safiron outperformed baselines in terms of detection accuracy, fine-grained categorization, and interpretability, concluding that the proposed framework addresses safety concerns effectively and provides a scalable solution for pre-execution risk assessment34.
Technical Trends
The papers in this collection showcase a variety of technical approaches and methodological advancements in the evaluation and benchmarking of LLMs:
- Automation and Multimodality: AutoPR leverages a multimodal dataset and a structured framework for automating academic promotion, emphasizing the importance of content adaptation for different platforms.
- Domain-Specific Challenges: StatEval highlights the necessity of specialized benchmarks for evaluating LLMs in complex domains like statistics, demonstrating the need for domain-specific training and evaluation.
- Evaluation Stability: MaP introduces methods to stabilize the evaluation process during pre-training, focusing on consistent performance measurement and comparison.
- Unified Scoring Mechanisms: TripScore proposes a unified scoring mechanism that integrates multiple criteria, reflecting the multifaceted nature of real-world applications.
- Safety and Risk Assessment: NarraBench and Building a Foundational Guardrail for General Agentic Systems via Synthetic Data emphasize the importance of safety guardrails and comprehensive benchmarking, respectively, indicating a growing concern for ethical and practical considerations in deploying LLMs.
Datasets and Evaluation Metrics
- PRBench: A multimodal dataset with 512 paired samples linking peer-reviewed papers to promotional posts.
- StatEval: Includes a foundational knowledge dataset of 13,817 problems and a research-level dataset of 2,374 proof-based questions.
- MaP: Utilizes benchmarks such as GSM8K, MATH, HumanEval, and MBPP to assess the stability of learning trajectories.
- TripScore: Employs a dataset of 4,870 travel planning queries, focusing on real-world user requests.
- Pre-Exec Bench: Tailored for evaluating pre-execution safety in agentic systems, with diverse and labeled risky trajectories.
- Evaluation Metrics: Across the papers, metrics such as Kendall’s rank correlation coefficient ($\tau$), Pairwise Ranking Reversal Rate (PRR), delivery rate (DR), commonsense constraint pass rate (CPR), and specific performance indicators like factual fidelity, engagement, and alignment are utilized to assess model performance comprehensively.
Topic 7: Reasoning and Logical Generalization
Topic Overview
Reasoning and logical generalization are fundamental aspects of human intelligence and are increasingly being explored in artificial intelligence (AI) research to enhance the capabilities of large language models (LLMs). These models have shown remarkable proficiency in various reasoning tasks, but they often struggle with maintaining consistency and precision in structured output formats or when dealing with less familiar or encrypted forms of input. Addressing these challenges is crucial for advancing AI applications in areas such as complex decision-making, scientific research, and ensuring AI safety. Research in this domain aims to improve the robustness, reliability, and generalization abilities of LLMs in logical reasoning tasks.
Individual Paper Contributions
-
Yiqi Li from Shanghai Jiao Tong University and colleagues studied the difficulty LLMs face in adhering to user-specified output formats during reasoning tasks, proposing DICE, a lightweight framework to guide small language models (SLMs) to refine the outputs of LLMs. The main innovation points of DICE include its two-stage dataset construction process and dual-tuning strategy that optimizes SLMs without altering LLM parameters. The value lies in enhancing the instruction-following abilities of LLMs without compromising their reasoning capabilities, thereby improving their applicability in practical scenarios. Experiments on five reasoning benchmarks (GSM8K, MATH, CSQA, MedQA-zh, and StrategyQA) with XML, JSON, and YAML output format requirements showed significant gains in format accuracy (F-Acc) and content accuracy (C-Acc) compared to LLMs using In-Context Learning (ICL) and other baselines. The results indicate that DICE can effectively balance output format adherence and reasoning performance, achieving near-perfect format accuracy and improved content accuracy35.
-
Shiyuan Guo from Anthropic Fellows Program and colleagues addressed the challenge of maintaining reasoning capability in LLMs when operating with ciphered text. They introduced the concept of ‘ciphered reasoning’ and presented a detailed study on this topic, proposing new evaluation tasks and methods for generating ciphered text. The main innovation points include the introduction of new evaluation benchmarks and the exploration of scaling laws for ciphered reasoning. The value lies in understanding the impact of ciphered text on model performance, which is crucial for AI safety and interpretability. Experiments conducted on models such as GPT, Qwen2.5, Claude Sonnet, and Opus revealed a significant asymmetry between comprehension and reasoning in ciphered text, with a drop in math accuracy for lesser-known ciphers. The paper concludes that sufficiently capable general CoT monitors can still decode ciphers fluently, suggesting potential avenues for effective monitoring and preventing adversarial attacks36.
-
Manuel Vargas Guzmán from University of Warsaw and colleagues focused on the limitation of LLMs in performing logical reasoning, specifically the lack of compositionality. They proposed a hybrid architecture integrating neural networks with symbolic reasoning to create a reliable logical prover. The main innovation points involve the development of a new research program that evaluates the logical generalization capabilities of LLMs using syllogistic logic and explores how different reasoning components affect neural model performance. The value lies in bridging the gap between neural and symbolic reasoning systems, aiming to achieve logical completeness and correctness while maintaining computational efficiency. Experiments showed that hybrid models incorporating neural assistants trained under different generalization regimes required fewer steps to prove hypotheses compared to purely symbolic models, demonstrating substantial improvements in efficiency and generalization37.
-
Zheng Zhao and colleagues tackled the issue of verifying Chain-of-Thought (CoT) reasoning in LLMs, proposing Circuit-based Reasoning Verification (CRV), a white-box method that constructs and analyzes attribution graphs to detect reasoning errors. The main innovation points include the creation of a benchmark dataset with step-level correctness labels and computational traces, and the focus on structural fingerprints of computational graphs to provide deeper insights into reasoning failures. The value lies in enabling a more thorough understanding of LLM reasoning mechanisms, which is essential for improving model reliability and trustworthiness. Experiments on synthetic Boolean and Arithmetic tasks, as well as the GSM8K dataset, demonstrated that CRV significantly outperformed black-box and gray-box methods, though it showed reduced performance in cross-domain verification scenarios. The authors concluded that CRV can identify error signatures that are highly predictive of reasoning failures, and that specific interventions can correct faulty computations38.
Technical Trends
The papers in this topic reflect a growing trend towards developing hybrid and modular systems that integrate the strengths of neural and symbolic reasoning. Innovations include lightweight frameworks like DICE for guiding LLM outputs, new evaluation paradigms such as ciphered reasoning, and white-box verification techniques like CRV for understanding and correcting reasoning errors. There is a clear shift towards addressing the limitations of LLMs in structured output adherence, handling unfamiliar text formats, and verifying the correctness of reasoning processes, all of which are critical for advancing AI safety and reliability.
Datasets and Evaluation
- DICE: Utilized five reasoning benchmarks—GSM8K, MATH, CSQA, MedQA-zh, and StrategyQA—to evaluate format and content accuracy (F-Acc, C-Acc).
- All Code, No Thought: Used the MATH 500 problem set and PRM800K dataset to measure ciphered reasoning capability and cipher translation capability.
- Hybrid Models for Natural Language Reasoning: Employed syllogistic logic as a benchmark to evaluate the logical generalization capabilities of LLMs, focusing on compositionality and recursiveness.
- Verifying Chain-of-Thought Reasoning via Its Computational Graph: Introduced a new benchmark dataset encompassing synthetic Boolean and Arithmetic tasks, alongside the GSM8K dataset, to analyze step-level correctness and computational traces, emphasizing the use of attribution graphs for error detection.
Topic 8: Privacy and Security in LLMs
Topic Overview
Privacy and security in large language models (LLMs) are critical concerns as these models become increasingly integrated into everyday applications. Issues such as data leakage, re-identification attacks, and vulnerabilities to adversarial attacks pose significant risks to the confidentiality and integrity of personal and sensitive information. Ensuring the trustworthiness of these models is paramount, especially in contexts where the reasoning process must be transparent and reliable. This includes enhancing interpretability, faithfulness, and reliability in reasoning models, safeguarding against poisoning and contamination attacks in retrieval-augmented generation (RAG) systems, and addressing vulnerabilities in multimodal models that process both textual and visual data.
Individual Paper Contributions
-
Chung-En Sun from University of California San Diego and colleagues studied the lack of trustworthiness in large reasoning models, proposing ReFIne, a novel training framework that integrates supervised fine-tuning with the Generalized Regularized Planning Objective (GRPO). The main innovation points include structured, tag-based reasoning traces for interpretability, explicit disclosure of decisive information for faithfulness, and self-assessment of derivation soundness and confidence for reliability. The value lies in improving the usability of these models beyond mere accuracy, ensuring they can be trusted in critical applications. Experiments on mathematical benchmarks showed significant improvements in interpretability (+44.0%), faithfulness (+18.8%), and reliability (+42.4%) across Qwen3 models of varying scales, concluding that optimizing for trustworthiness metrics is essential for practical and reliable AI reasoning systems39.
-
Xiaonan Si from Institute of Software Chinese Academy of Sciences and colleagues aimed to address the vulnerability of RAG systems to corpus poisoning and contamination attacks. They introduced SeCon-RAG, a two-stage framework that involves semantic filtering and conflict-free integration of knowledge into the model’s generation process. The main innovation points are the semantic filtering to remove irrelevant or harmful knowledge sources and the conflict-free integration to maintain reliable generation outcomes. The value lies in balancing security and utility without aggressive filtering that sacrifices useful information. While specific experimental details are not provided, the framework likely demonstrated improved robustness and reliability of RAG systems against simulated attacks, suggesting that SeCon-RAG fills a gap where existing defenses are too restrictive or ineffective40.
-
Raoyuan Zhao and colleagues explored the behavior and effectiveness of Chain-of-Thought (CoT) reasoning in multilingual settings. They proposed a method for evaluating multilingual CoT reasoning that includes crosslingual thinking trace interchanging to measure semantic consistency. The value lies in introducing new metrics for assessing language compliance, final-answer accuracy, and consistency across a diverse set of languages. Using the MMMLU and MGSM datasets, the experiments revealed that models do not reliably follow explicit instructions to think in specific languages, particularly low-resource ones, and that prompt hacking can improve language compliance but sometimes reduces final-answer accuracy. The study concluded that performance gaps across languages are influenced by data exposure during training and that consistency in model predictions is higher among typologically similar languages41.
-
Lucas Georges Gabriel Charpentier from University of Oslo and colleagues focused on enhancing the effectiveness of re-identification attacks against text de-identification techniques. They proposed two strategies: varying the order in which personally identifiable information (PII) spans are re-identified and employing LLMs with reasoning abilities. The main innovation is the introduction of a dense retriever trained specifically for re-identification, addressing biases in pre-trained models towards named entities. Using the Text Anonymization Benchmark (TAB) dataset with European Court of Human Rights (ECHR) court cases, the experiments showed that the reasoning-optimized version of the Qwen3 infilling model significantly boosts re-identification performance, especially in scenarios with extensive background knowledge. The conclusion underscores the importance of considering the order of re-identification and the role of reasoning-optimized models in strengthening attacks42.
-
Natalia Tomashenko and colleagues addressed the anonymization of a specific target speaker’s voice in multi-speaker conversational audio recordings. They proposed a TSA framework that combines target speaker extraction (TSE) methods with anonymization techniques. The innovation lies in introducing the time-constrained minimum-permutation word error rate (tcpWER) and scale-invariant signal-to-distortion ratio (SI-SDR) as evaluation metrics for privacy and utility. The paper showed that the anonymization process degrades the equal error rate (EER) and word error rate (WER) of the target speaker’s voice, with WeSep BSRNN showing less degradation than Conformer TSE. The experiments concluded that improving voice activity masking and joint training of ASR and TSE models can mitigate these issues, and that the anonymization process can still leave exploitable traces of the original speaker’s voice43.
-
Ruizhe Zhu examined the vulnerability of vision language models (VLMs) to text prompt injection attacks. The paper proposed a systematic algorithm for embedding text prompts into images to exploit VLM vulnerabilities. The main innovation is the identification of high-color-consistency regions for prompt insertion, minimizing detectability while maximizing effectiveness. The value lies in providing a detailed exploration of this attack method, which outperforms gradient-based attacks in terms of success rate and computational efficiency. Experiments on the Oxford-IIIT Pet Dataset showed significant increases in untargeted and targeted attack success rates (ASRs) compared to baseline methods, suggesting that text prompt injection is highly effective for high-resolution images and requires fewer computational resources44.
Technical Trends
The papers collectively highlight a trend towards developing comprehensive frameworks and algorithms that enhance the security and trustworthiness of large language models and vision language models. Innovations focus on improving interpretability, reliability, and faithfulness through structured reasoning, semantic filtering, and order-sensitive re-identification. There is also an emphasis on evaluating models across multilingual and multimodal contexts, reflecting the increasing complexity and diversity of real-world applications.
Datasets and Evaluation
- Mathematical Benchmarks: Used for evaluating the interpretability, faithfulness, and reliability of reasoning models.
- MMMLU and MGSM Datasets: Employed to assess the performance, consistency, and faithfulness of multilingual chain-of-thought reasoning.
- Text Anonymization Benchmark (TAB): Focused on court cases from the European Court of Human Rights (ECHR) to test re-identification attacks.
- Oxford-IIIT Pet Dataset: Utilized for conducting experiments on text prompt injection attacks on vision language models.
- tcpWER and SI-SDR Metrics: Introduced for evaluating the utility and quality of anonymized speech in multi-speaker recordings.
These datasets and metrics provide a robust foundation for evaluating the effectiveness and security of various models and frameworks in their respective domains, ensuring that advancements in privacy and security are rigorously tested and validated.
Topic 9: Human-like Reasoning and Dialogue
Topic Overview
The research topic of “Human-like Reasoning and Dialogue” focuses on enhancing the capabilities of large language models (LLMs) to interact with humans more naturally and effectively. This involves improving the models’ ability to understand and respond to ambiguous instructions, personalize interactions, detect and mitigate biases, and evaluate the quality of AI-generated content. The importance of this research lies in addressing the limitations of LLMs in mimicking human reasoning and dialogue, which is crucial for their broader adoption in areas such as customer service, education, and mental health support.
Individual Paper Contributions
-
Mert İnan from Northeastern University and colleagues studied the resolution of ambiguities in user goals for data visualization code generation. They proposed a taxonomy for categorizing types of ambiguity and introduced several metrics to quantify these ambiguities, demonstrating that these metrics correlate better with human annotations than traditional uncertainty baselines. The main innovation points of this method are the integration of multi-turn dialogue inspired by linguistic pragmatics theories to reduce ambiguity and improve code accuracy. The value lies in providing a methodological framework for evaluating and mitigating ambiguity that goes beyond typical model uncertainty. Experiments on the DS-1000 dataset showed that the pragmatic cooperative strategy outperformed others, achieving a pass@1 score of 79.44% compared to the baseline of 68.38%, concluding that pragmatic dialogue significantly enhances task success in resolving semantic ambiguities and underspecifications45.
-
Seiya Ishikura from Institute of Science Tokyo and colleagues focused on modeling and replicating individual personality traits in text chat dialogues using LLMs. They proposed augmenting dialog data with think-aloud utterances (TAUs) to reflect internal psychological states, which are then used to fine-tune LLMs for better alignment with specific personality traits. The main innovation points are the use of the RealPersonaChat (RPC) dataset and the application of LLMs like Qwen2.5-72B-Instruct and gpt-4o-2024-08-06 for TAU augmentation. The value lies in enabling more personalized and natural interactions in text-based applications. Experiments on the RPC dataset showed that LLMs fine-tuned with TAU-augmented data demonstrated better conformity with personality traits, particularly Agreeableness and Neuroticism, concluding that integrating TAUs into dialog data can significantly enhance the accuracy of personality trait representation46.
-
Yijin Ni from Georgia Institute of Technology and Peng Qi from Uniphore introduced abductive preference learning, a new fine-tuning paradigm for LLMs aimed at enhancing sensitivity to prompt variations and reducing overconfidence issues. The main innovation points are the reversal of conditioning direction in preference optimization and the empirical validation across multiple datasets including HaluEval, AlpacaEval, and a multimodal HumorDB dataset. The value lies in complementing conventional preference learning and improving the adaptability and reliability of LLMs. Experiments showed that abductive preference learning achieved up to 85.0% abductive accuracy on A-HaluEval and improved sarcasm detection accuracy from 50.0% to 87.0% on HumorDB, concluding that multitask training combining traditional and abductive preference learning can achieve high accuracy and robustness47.
-
Nafiseh Nikeghbal from Technical University of Munich and colleagues developed the CoBia methods and dataset to detect societal biases in LLMs through constructed conversations. The main innovation points are lightweight adversarial attacks that use just one query to expose biases and the creation of the CoBia dataset with negative descriptors for various social categories. The value lies in identifying and mitigating biases in user-friendly conversation settings. Experiments revealed that models like llama3.3:70b, command-r:35b, and qwen2.5:7b exhibited heavy bias, while gemma2:27b and deepseek-v2:16b showed lower bias scores, concluding that the combination of HCC and SCC methods is effective in uncovering biases that might otherwise go unnoticed48.
-
Steve Han from NVIDIA Corporation and colleagues analyzed the capability of LLMs to act as judges for assessing the accuracy of RAG systems or agentic pipelines against ground truth answers. They introduced the Judge’s Verdict Benchmark, which evaluates LLMs using Cohen’s Kappa and z-scores. The main innovation points are the classification of LLM judges into distinct performance tiers and the evaluation of 54 LLMs. The value lies in making the evaluation process more efficient and scalable. Results showed that 27 models achieved Tier 1 performance, with 23 models mimicking human judgment patterns and 4 exhibiting super-consistent behavior, concluding that there is a trade-off between human-like judgment and inter-rater consistency49.
-
Weibin Cai from Syracuse University and colleagues addressed the challenge of accurately classifying hateful memes by proposing the SHIELD framework. This framework includes the Presupposed Context Module (PCM) and the False Claims Module (FACT) to capture presupposed context and identify false claims in memes. The main innovation points are the integration of philosophical and psychological theories to understand hate speech and the development of cross-modal reference modules to improve classification accuracy. The value lies in enhancing the detection of hateful content in social media. Experiments on datasets like FHM and Harm-P showed that SHIELD outperformed existing baselines, concluding that incorporating societal knowledge can significantly improve detection accuracy and generalization across different hate targets50.
-
Yanran Chen from University of Technology Nuremberg and colleagues explored the impact of AI-driven emotional framing on human fallacy detection. They proposed a method to alter the emotional framing of arguments using LLMs while maintaining logical structure. The main innovation points are the systematic alteration of emotional framing and the use of the LOGIC dataset for assessing cognitive impact. The value lies in understanding the risks posed by emotionally charged content in AI systems. Experiments showed that emotional framing decreased human fallacy detection performance by 14% in F1 score, with enjoyment enhancing detection compared to fear or sadness, concluding that emotional framing can impair human reasoning and fallacy detection51.
-
Xi Fang from Amazon and colleagues investigated the potential for social bias in LLMs due to long-term user memory. They proposed a method for generating diverse user profiles and applied validated emotional intelligence tests to evaluate the emotional reasoning capabilities of LLMs. The main innovation points are the explicit manipulation of social capital and intersectional control of demographic variables. The value lies in preventing the reinforcement of societal biases in personalized AI systems. Experiments revealed that user memory systematically altered emotional reasoning, with advantaged profiles receiving more accurate interpretations, concluding that personalization mechanisms can embed social hierarchies into models’ emotional reasoning52.
Technical Trends
The papers in this collection adopt several technical trends and methodological evolutions:
- Dialogue Integration: Incorporating multi-turn dialogue to refine and clarify user goals, particularly in data visualization and personality trait modeling.
- Bias Detection: Utilizing constructed conversations and lightweight adversarial attacks to expose and mitigate societal biases in LLMs.
- Preference Learning: Developing new paradigms such as abductive preference learning to enhance model sensitivity and adaptability.
- Emotional Framing: Systematically altering the emotional tone of content to study its impact on human cognition and fallacy detection.
- Evaluation Benchmarks: Creating new benchmarks and metrics to assess the human-likeness and effectiveness of LLMs as judges and in other roles.
Datasets and Evaluation Metrics
- DS-1000: Used for evaluating the effectiveness of dialogue strategies in resolving ambiguities in data visualization code generation.
- RealPersonaChat (RPC): A Japanese corpus of casual text chat dialogs used for personality trait modeling with LLMs.
- HaluEval, AlpacaEval, HumorDB: Datasets employed to validate abductive preference learning methods.
- CoBia Dataset: Contains 112 social groups with negative descriptors across six socio-demographic categories for bias detection.
- LOGIC Dataset: Comprises human-written, short, standalone fallacious arguments used to study the impact of emotional framing on fallacy detection.
- FHM, Harm-C, Harm-P: Datasets utilized for the evaluation of the SHIELD framework in hateful meme classification.
- Judge’s Verdict Benchmark: Uses Cohen’s Kappa and z-scores to assess LLMs as judges for RAG systems.
- STEU, STEM Tests: Validated emotional intelligence tests applied to evaluate the emotional reasoning of LLMs based on user memory.
These datasets and metrics collectively contribute to a comprehensive evaluation of LLMs in various aspects of human-like reasoning and dialogue, providing a solid foundation for future research and practical applications.
Topic 10: Data and Training Strategies for LLMs
Topic Overview
The research topic of “Data and Training Strategies for LLMs” focuses on advancing the methodologies and datasets employed in training and refining large language models (LLMs). These strategies aim to enhance LLMs’ performance, reliability, and adaptability across various applications, including automated essay scoring, reasoning tasks, prompt dataset analysis, and abstractive summarization. The importance of this topic lies in addressing the inherent limitations of LLMs, such as their tendency to generate inaccurate or unreliable outputs, and in developing methods that allow for more autonomous and efficient model training and evaluation. By doing so, these studies contribute to making LLMs more effective and trustworthy in real-world scenarios.
Individual Paper Contributions
-
Keno Harada from The University of Tokyo and colleagues studied the suboptimal performance of LLMs in Automated Essay Scoring (AES) due to the static nature of the scoring rubrics. They proposed an iterative refinement approach, named “Reflect-and-Revise,” to dynamically adjust scoring rubrics based on discrepancies between LLM scores and human scores. The main innovation points of this method are its ability to autonomously improve scoring accuracy by reflecting on rationales and refining rubrics. The value lies in achieving better alignment between LLM and human evaluators, thereby providing more accurate and consistent feedback in educational settings. Experiments on the TOEFL11 and ASAP datasets showed QWK score improvements of up to 0.19 and 0.47, respectively, compared to detailed human-authored rubrics, concluding that LLMs can autonomously refine their scoring criteria to match human judgment53.
-
Jiaqi Wei from Zhejiang University and colleagues addressed the fragmentation and inefficiency in the research field of using tree search algorithms and reward design for enhancing LLMs in reasoning tasks. They introduced a unified mathematical framework to compare and dissect various tree search algorithms based on their core components: Search Mechanism, Reward Formulation, and Transition Function. The main innovation is the creation of a systematic, component-based taxonomy that organizes these algorithms. The value lies in providing a cohesive basis for understanding and comparing methods across Test-Time Scaling (TTS) and self-improvement paradigms, facilitating advancements in LLM reasoning capabilities. Although no new methods were proposed, the comprehensive survey and analysis offered significant insights into the challenges and opportunities in the field of LLM reasoning54.
-
Ines Altemir Marinas from EPFL Lausanne Switzerland and colleagues tackled the challenge of systematically examining and curating the vast amounts of training data used for LLMs, focusing on the deployment of Elasticsearch for indexing large-scale datasets like Common Crawl. Their main contribution is the demonstration of Elasticsearch’s full-text indexing capabilities on ARM64 architecture, providing detailed configurations and solutions to overcome compatibility issues. The value lies in promoting transparency and safety in AI systems by allowing for efficient indexing and querying of large datasets. Experiments revealed that multilingual datasets are indexed much slower due to increased complexity, and deduplication can slow down indexing despite reducing duplicate content. The analysis also showed the scaling behavior of Elasticsearch under different query lengths, providing insights into optimizing full-text search in LLM training data55.
-
Yuanming Zhang from School of Computer Science and Technology, Beijing Jiaotong University, and colleagues examined the lack of comprehensive analysis of prompt datasets used for LLM interactions. They proposed a hierarchical taxonomy and a systematic process for discovering and refining prompt datasets from diverse sources, totaling over 1.22 TB of data and 673M prompt instances. The main innovation is the extensive compilation and analysis of prompt datasets, revealing compositional patterns and linguistic properties through multi-level linguistic analysis. The value lies in improving prompt engineering and understanding prompt structures’ impact on LLM performance. The analysis uncovered domain-specific variations in prompt datasets, with medical prompts emphasizing specificity and business prompts favoring conciseness, concluding that a thorough examination of prompt datasets can significantly enhance LLM usability across various applications56.
-
Sicong Huang from University of California, Santa Cruz and colleagues focused on the issue of unfaithfulness in abstractive summarization generated by LLMs. They proposed and evaluated three fine-tuning methods—gradient ascent, unlikelihood training, and task vector negation—to improve summarization faithfulness. The main innovation points are the construction of a novel dataset annotated at the span-level for faithfulness and the evaluation of fine-tuning methods directly on generated summaries. The value lies in mitigating hallucinations and enhancing the reliability of summarization tools, making them more suitable for precise information extraction tasks. Experiments showed that unlikelihood training and task vector negation yielded substantial improvements in faithfulness compared to gradient ascent, with unlikelihood training displaying higher stability across different hyperparameter settings. The paper also noted inconsistencies in BARTScore as an evaluation metric for summarization faithfulness57.
Technical Trends
The papers in this collection showcase a variety of technical approaches to improve LLM performance. Harada et al. focus on dynamic rubric refinement for AES, leveraging LLMs’ ability to reflect on their own scoring processes. Wei et al. propose a unified framework for comparing tree search algorithms and reward designs in LLM reasoning, aiming to standardize and enhance these methodologies. Marinas et al. advance the use of Elasticsearch for full-text indexing of large-scale datasets, offering insights into the scalability and efficiency of such systems. Zhang et al. emphasize the importance of analyzing prompt datasets to understand and improve LLM interaction patterns. Lastly, Huang et al. explore fine-tuning methods to enhance summarization faithfulness, constructing novel datasets and evaluation metrics to address the issue of hallucinations.
Datasets and Evaluation
- Automated Essay Scoring (Harada et al.): TOEFL11, ASAP
- Tree Search Algorithms and Reward Design (Wei et al.): No specific datasets; focus on conceptual and mathematical frameworks.
- Full-Text Search for LLM Training Data (Marinas et al.): Common Crawl (pure English vs. multilingual)
- Prompt Datasets Analysis (Zhang et al.): Seven large-scale, diverse, and representative datasets including Self-Instruct and medical-o1.
- Abstractive Summarization (Huang et al.): Novel dataset with span-level annotations for faithfulness.
Evaluation metrics used include:
- Quadratic Weighted Kappa (QWK) for essay scoring accuracy.
- BARTScore, G-Eval, and AlignScore for summarization faithfulness.
- Various statistical and machine learning methods for prompt dataset analysis.
Topic 11: misc
Topic Overview
This research topic encompasses a variety of studies focused on enhancing the capabilities of large language models (LLMs) and diffusion large language models (DLLMs) in various complex reasoning tasks, as well as addressing challenges related to their deployment, safety, and interpretability. The importance of this topic lies in its potential to advance the practical applicability and efficiency of AI systems, making them more reliable, versatile, and suitable for real-world tasks that require sophisticated reasoning and understanding. Additionally, the topic delves into specialized applications such as medical diagnostics, crisis communication, and software engineering, where the performance and reliability of LLMs are critical for improving outcomes and user experiences.
Individual Paper Contributions
-
Qiguang Chen from Harbin Institute of Technology and colleagues studied the inherent contradiction between the parallel processing capability of diffusion large language models (DLLMs) and the need for sequential reasoning in complex tasks, known as the Parallel-Sequential Contradiction (PSC). They proposed several novel mitigation strategies to enhance DLLM reasoning, including parallel-encouraging prompting, diffusion early stopping, and leveraging the parallel scaling law. The main innovation points are the introduction of new scaling dimensions for inference time specifically designed for DLLMs—parallel, diffusion, and sequential scaling—and the demonstration of their impact on the model’s reasoning abilities. The value lies in filling gaps in the understanding of DLLMs’ limitations and offering practical solutions to improve their performance in reasoning tasks. Experiments showed that these strategies can substantially alleviate the constraints imposed by PSC, enhancing the models’ reasoning performance 58.
-
Feifan Song from Peking University and colleagues addressed the issue of overthinking in large reasoning models (LRMs) during the inference phase, proposing Group Relative Segment Penalization (GRSP), a novel method targeting the overthinking problem at the reasoning-step level. The main innovation points are the introduction of length-aware penalties to control the behavior of LRMs and the use of confidence-based segmentation. The value lies in reducing computational costs while maintaining or improving task performance. Experiments on benchmarks like MATH 500, AIMO Prize 1, and Omni-MATH 500 demonstrated that GRSP effectively mitigates overthinking, achieving superior performance in terms of both task accuracy and token efficiency 59.
-
Fang Yuan from National University of Defense Technology and colleagues developed the NL2GenSym framework, which integrates large language models (LLMs) with the SOAR cognitive architecture to automate the generation and continuous optimization of executable symbolic rules from natural language. The main innovation points are the Execution-Grounded Generator-Critic mechanism and the Self-Evolving Domain Knowledge Base. The value lies in democratizing access to cognitive architectures like SOAR, making it easier for non-experts to develop and refine human-like intelligent agents. Experiments on the Water Jug Problem (WJP) dataset revealed that NL2GenSym significantly outperforms baseline methods, achieving high success rates and efficient decision cycles 60.
-
Yunxiang Zhang from University of Michigan and colleagues introduced Switch Generation, a method for collaborative inference among diverse model checkpoints (pretrained, finetuned, and aligned models) to solve the tradeoff between alignment training benefits and skill loss in language models. The main innovation points are the use of a switcher language model to dynamically select which model should generate the next segment of text. The value lies in enhancing the flexibility and adaptability of AI systems to diverse user needs and contexts. Experiments with 8 model collaboration baselines and 18 datasets demonstrated that Switch Generation significantly outperforms individual models and other collaboration strategies, achieving an average improvement of 12.9% across tasks 61.
-
Yu-Chen Lu from National Yang Ming Chiao Tung University and colleagues tackled the deployment challenge of LLMs on resource-constrained hardware by proposing the Fine-grained Low-Rank Compressor (FLRC) framework. The main innovation points are the Fisher-based Layer-wise Rank Allocation algorithm and the Progressive Low-rank Decoding strategy. The value lies in optimizing the compression ratio for each layer and projection of LLMs, ensuring efficient use of resources without compromising performance. Experiments on benchmarks like DialogSum, CNN/DM, and Wikitext2 showed that FLRC can achieve up to a 17.35% improvement in ROUGE-L scores on summarization tasks while maintaining high BERTScore values even at higher compression rates 62.
-
Jianuo Huang from Shanghai Jiao Tong University and colleagues introduced MaskKV, a novel KV cache eviction framework specifically designed for diffusion LLMs, aiming to reduce memory and computation overhead without sacrificing accuracy. The main innovation points are the concepts of ‘Mask-Voting’ and a two-stage budget allocation scheme. The value lies in adapting cache eviction strategies to the unique characteristics of diffusion models. Experiments on the LongBench benchmark with LLaDA-8B and Dream-7B models demonstrated that MaskKV can reduce memory and computation overhead substantially while maintaining high accuracy, achieving 31 times faster decoding and 65% lower peak memory usage 63.
-
Kohei Oda from Japan Advanced Institute of Science and Technology and colleagues proposed DualCSE, a dual-semantic contrastive sentence embedding framework that assigns two embeddings to each sentence—one for explicit semantics and another for implicit semantics. The main innovation points are the dual-embedding approach and the novel contrastive loss function tailored for explicit and implicit semantics. The value lies in capturing both explicit and implicit meanings of sentences, improving performance in tasks requiring nuanced understanding. Experiments on the Recognizing Textual Entailment (RTE) and Estimating Implicitness Score (EIS) tasks showed that DualCSE outperforms SimCSE (INLI) and SimCSE (SNLI+MNLI) in handling implicit semantics 64.
-
Jiuheng Lin from Peking University and colleagues introduced CLARity, a novel reinforcement learning framework aimed at enhancing the logical consistency and reasoning quality of LLMs trained on multiple-choice questions (MCQs). The main innovation points are the consistency-aware learning mechanism and the two-stage refine-then-monitor pipeline. The value lies in providing a cost-effective solution for improving response consistency and reliable reasoning accuracy without requiring large-scale teacher LLMs or expert-annotated datasets. Experiments demonstrated improvements of 16.5% in response consistency and 7.5% in reliable reasoning accuracy over standard RL baselines 65.
-
Wei Zhou from Bosch Center for Artificial Intelligence and colleagues provided a comprehensive survey on Table Question Answering (TQA) in the era of LLMs. The main innovation points are the coverage of diverse TQA setups, recent advances, and emerging themes in the LLM era. The value lies in offering insights into fine-tuning methods and agentic setups, as well as the importance of robustness and reasoning correctness evaluations. While no specific experimental results are provided, the survey highlights the challenges and opportunities in TQA with LLMs, suggesting the viability of retrieval-augmented generation (RAG) for managing large tables 66.
-
Jiale Guo from Nanyang Technological University and colleagues presented a comprehensive survey on benchmarks and solutions in software engineering empowered by LLMs. The main innovation points are the proposed taxonomy bridging benchmarks and solutions, categorizing solutions into prompt-based, fine-tuning-based, and agent-based paradigms. The value lies in providing a holistic view of LLM-empowered software engineering, identifying critical research gaps and suggesting future research directions. The survey covers a broad spectrum of software engineering tasks, distinguishing itself from existing surveys by its unified approach 67.
-
Yutao Mou from Peking University and colleagues proposed LoRA-based Refusal-training as a method for achieving cost-efficient and performance-preserving safety alignment in LLMs. The main innovation points are the use of LoRA to train safety patches using only safety data and the theoretical explanation of how LoRA decouples safety into a low-rank subspace. The value lies in maintaining general performance while improving safety, which is crucial for building trustworthy AI systems. Experiments showed that LoRA-based Refusal-SFT trained solely on safety data achieves a better trade-off between safety and general performance than full-parameter training 68.
-
Fanwei Zhu from Hangzhou City University and colleagues developed a unified, layout-aware framework for efficient and accurate resume information extraction and evaluation. The main innovation points are the layout-aware parsing model, an inference-efficient LLM extraction strategy, and a robust two-stage automated evaluation framework. The value lies in significantly improving the screening process for enterprises, making it faster, more cost-effective, and less error-prone. Experiments on the SynthResume and RealResume datasets demonstrated that the fine-tuned Qwen-0.6B model outperforms top-tier models like Claude-4 while offering 3-4 times faster inference 69.
Technical Trends
The papers in this collection adopt a variety of technical approaches and methodologies to enhance the capabilities of large language models and diffusion large language models. Key trends include:
- Mitigation Strategies for Limitations: Several papers focus on addressing inherent limitations of DLLMs and LLMs, such as the Parallel-Sequential Contradiction (PSC) 58 and overthinking 59, by proposing innovative prompting methods and penalty schemes.
- Multimodal Integration: Contributions like NL2GenSym 60 and ShiZhi 70 emphasize the integration of multimodal data (e.g., natural language and symbolic rules, visual and textual inputs) to enhance model performance in specialized tasks.
- Efficient Inference Techniques: Papers like FLRC 62 and MaskKV 63 introduce methods for reducing computational and memory overhead, enabling more efficient deployment of large models on resource-constrained hardware.
- Safety and Ethical Considerations: Studies such as LoRA-based Refusal-training 68 and Alif 71 highlight the importance of ensuring safety and ethical compliance in AI systems, proposing methods to enhance model safety without compromising performance.
- Interpretability and Explainability: Papers like iBERT 72 and Narrative Learning 73 focus on improving the interpretability and explainability of language models, which is essential for building trust and ensuring compliance in regulated fields.
Datasets and Evaluation
The main datasets and evaluation metrics used in the papers are:
- Water Jug Problem (WJP): Used to evaluate NL2GenSym’s symbolic rule generation capabilities 60.
- Omni-MATH 500: Utilized to test the effectiveness of GRSP in controlling overthinking 59.
- CrisiText: A dataset of warning messages for LLM training in emergency communication scenarios 74.
- Inf-Streams-Train and Inf-Streams-Eval: Sports commentary datasets for evaluating real-time video understanding in StreamingVLM 75.
- Kaggle and Pandora: Datasets for evaluating HIPPD’s personality detection capabilities 76.
- Synthetic Multilingual, BLiMP, SICK: Used to assess the effectiveness of RISE in mapping semantic relationships across languages and models 77.
- LiveOIBench: A competitive coding benchmark for evaluating LLMs’ performance in solving Informatics Olympiad tasks 78.
- LongBench: Used to evaluate the memory and computation overhead of MaskKV 63.
- GeoOLID and Amazon Reviews 2023: Datasets for testing ranking reliability methods 79.
- Urdu-Instruct: A high-quality synthetic dataset for training the Alif-1.0-8B-Instruct model 71.
- MetricAlign: A dataset of 300 Chinese-English sentence pairs for evaluating web novel translation quality 80.
Evaluation metrics include:
- ROUGE-L and BLEU-1: Commonly used for summarization and translation tasks [^27, ^30, ^44].
- BERTScore: Measures the quality of generated text by comparing its semantic similarity to human-written references [^27, ^44].
- Accuracy, Veracity, Helpfulness, and Consistency Scores: Used in DyReMe for evaluating LLMs in medical diagnostics 81.
- Macro-F1 and Accuracy: Metrics for evaluating the robustness of fake news detection models against adversarial comments 82.
- Effective Update Ratio (EUR) and Update Consistency (UC): Metrics for assessing the efficiency of reinforcement learning models in navigating reasoning paths 83.
- STE (Style Transfer Evaluation): Evaluates the effectiveness of style-specific signals learned by iBERT 72.
- Spearman Rank Correlation: Measures the reliability of performance ranking methods 79.
These studies collectively advance the field by addressing critical limitations, proposing novel methodologies, and establishing new benchmarks and evaluation frameworks for assessing and improving the capabilities of large language models in diverse applications.
References
-
IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data ↩︎
-
LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning ↩︎
-
Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs ↩︎
-
Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation ↩︎
-
Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives ↩︎
-
Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors ↩︎
-
Multimodal Policy Internalization for Conversational Agents ↩︎
-
CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation ↩︎
-
Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation ↩︎
-
Dyna-Mind: Learning to Simulate from Experience for Better AI Agents ↩︎
-
Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood ↩︎
-
Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models ↩︎
-
DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning ↩︎
-
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models ↩︎
-
Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph ↩︎
-
ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering ↩︎
-
Large Language Models Do NOT Really Know What They Don’t Know ↩︎
-
Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language ↩︎
-
NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering ↩︎
-
ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models ↩︎
-
Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval ↩︎
-
Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models ↩︎
-
Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking ↩︎
-
The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach ↩︎
-
Unsupervised lexicon learning from speech is limited by representations rather than clustering ↩︎
-
StatEval: A Comprehensive Benchmark for Large Language Models in Statistics ↩︎
-
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics ↩︎
-
TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation ↩︎
-
Building a Foundational Guardrail for General Agentic Systems via Synthetic Data ↩︎
-
DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction ↩︎
-
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language ↩︎
-
Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic ↩︎
-
Verifying Chain-of-Thought Reasoning via Its Computational Graph ↩︎
-
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability ↩︎
-
SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG ↩︎
-
A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages ↩︎
-
Stronger Re-identification Attacks through Reasoning and Aggregation ↩︎
-
Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation ↩︎
-
Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM ↩︎
-
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs ↩︎
-
Judge’s Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement ↩︎
-
Unpacking Hateful Memes: Presupposed Context and False Claims ↩︎
-
Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection ↩︎
-
The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs ↩︎
-
Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise ↩︎
-
Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey ↩︎
-
Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World ↩︎
-
Large Language Model Prompt Datasets: An In-depth Analysis and Insights ↩︎
-
Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning ↩︎
-
Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models ↩︎ ↩︎
-
NL2GenSym: Natural Language to Generative Symbolic Rules for SOAR Cognitive Architecture via Large Language Models ↩︎ ↩︎ ↩︎
-
FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference ↩︎ ↩︎
-
Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference ↩︎ ↩︎ ↩︎
-
One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations ↩︎
-
CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts ↩︎
-
Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation ↩︎
-
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System ↩︎
-
Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models ↩︎ ↩︎
-
Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation ↩︎
-
ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation ↩︎
-
Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation ↩︎ ↩︎
-
iBERT: Interpretable Style Embeddings via Sense Decomposition ↩︎ ↩︎
-
It’s 2025 – Narrative Learning is the new baseline to beat for explainable machine learning ↩︎
-
CrisiText: A dataset of warning messages for LLM training in emergency communication ↩︎
-
StreamingVLM: Real-Time Understanding for Infinite Video Streams ↩︎
-
HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection ↩︎
-
Steering Embedding Models with Geometric Rotation: Mapping Semantic Relationships Across Languages and Models ↩︎
-
LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads? ↩︎
-
Can We Reliably Rank Model Performance across Domains without Labeled Data? ↩︎ ↩︎
-
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation ↩︎
-
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation ↩︎
-
Group-Adaptive Adversarial Learning for Robust Fake News Detection Against Malicious Comments ↩︎
-
HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness ↩︎