2025年10月06日NLP论文汇总（英文）

Thu, Oct 16, 2025

Topic 1: Reasoning and Explanation in LLMs (1 papers)
Topic 2: Large Language Models and Their Applications (13 papers)
Topic 3: Bias and Fairness in AI (7 papers)
Topic 4: Machine Translation and Multilingual Systems (6 papers)
Topic 5: Data Handling and Annotation (8 papers)
Topic 6: Security and Privacy in AI (8 papers)
Topic 7: Evaluation and Benchmarking of AI Models (10 papers)
Topic 8: Human Interaction with AI (5 papers)
Topic 9: Content Generation and Moderation (8 papers)
Topic 10: AI Development Techniques and Methods (3 papers)
Topic 11: misc (17 papers)

Topic 1: Reasoning and Explanation in LLMs

Topic Overview

The topic of “Reasoning and Explanation in LLMs” focuses on understanding and evaluating the reasoning capabilities of Large Language Models (LLMs). These models have shown remarkable proficiency in generating human-like text but often fail to demonstrate consistent logical reasoning, especially when dealing with out-of-domain tasks or complex problems like mathematics. Ensuring that LLMs can provide clear, coherent, and logically sound explanations is crucial for their reliability and broader application in fields requiring analytical rigor. This research area seeks to develop methods and metrics to assess and improve the reasoning structures within LLM outputs, thereby enhancing their utility in diverse domains.

Individual Paper Contributions

Minju Gwak from Yonsei University and colleagues studied the effectiveness of large language models (LLMs) in generating reasoning traces, specifically addressing the issue of whether these models perform genuine reasoning or simply create superficially coherent text. They proposed the application of the Uniform Information Density (UID) hypothesis, originally from human communication theory, to analyze LLM reasoning. The main innovation points of this method include the introduction of information-theoretic metrics based on entropy to measure the uniformity of information density at both the step and trace levels. The value lies in providing a new framework for evaluating the quality of reasoning traces generated by LLMs, particularly on challenging mathematical benchmarks. Experiments on these benchmarks showed that reasoning traces with low global uniformity but high local uniformity tend to produce correct answers, contradicting the initial hypothesis. This suggests that trends in local and global uniformity can serve as indicators for assessing reasoning quality and could guide future improvements in model reasoning and evaluation[^paper_id_16].

Technical Trends

The main technical approach in the discussed paper involves the adaptation of information-theoretic principles to the domain of LLM reasoning. Specifically, the use of entropy-based metrics to evaluate the structure of reasoning traces represents a significant shift towards more quantitative and rigorous assessments of LLM performance in logical tasks. This methodological evolution moves beyond traditional qualitative evaluations to incorporate quantitative measures that can reveal deeper insights into the reasoning processes of LLMs.

Datasets and Evaluation

The primary dataset used in the paper by Minju Gwak and colleagues consists of challenging mathematical reasoning tasks, which serve as a stringent test for the reasoning abilities of LLMs. Evaluation metrics include the introduced information-theoretic measures for global and local uniformity of information density. These metrics aim to provide a nuanced assessment of reasoning quality, distinguishing between superficial coherence and genuinely logical reasoning processes.

This summary report highlights the evolving methodologies and insights into assessing and improving the reasoning capabilities of LLMs, emphasizing the unique contributions of the paper by Minju Gwak and colleagues. Their work offers a fresh perspective through the lens of information density, suggesting potential avenues for enhancing LLM reasoning quality.

Topic 2: Large Language Models and Their Applications

Topic Overview

Large Language Models (LLMs) have emerged as powerful tools in various applications, from natural language processing to multimodal reasoning and beyond. However, their deployment and effectiveness in real-world scenarios are contingent upon several factors, including the reliability of their outputs, the ability to integrate external tools, and their performance in specialized tasks such as historical document OCR and educational content generation. Research in this domain aims to uncover vulnerabilities, enhance performance, and explore new applications of LLMs to ensure they meet the demands of diverse and complex tasks. This report summarizes recent research efforts that address these challenges, contributing to a more comprehensive understanding of LLMs and their applications.

Individual Paper Contributions

Riku Mochizuki from Keio University and colleagues studied the vulnerability of Generative Engines (GEs) to poisoning attacks via manipulated web content, proposing a classifier λ(c) that categorizes web sources into primary and secondary information types to evaluate the accuracy and reliability of information delivered by GEs. The main innovation points include the introduction of publisher attribute evaluation criteria and a focus on the political domain to safeguard democratic processes. The value lies in providing a more nuanced understanding of citation vulnerabilities and suggesting strategies to mitigate them. Experiments on political questions in Japan and the U.S. revealed that GEs in Japan rely more on official party sites, whereas those in the U.S. depend more on secondary sources, emphasizing the need for increased exposure of primary information sources to protect against poisoning attacks¹.
Milad Aghajohari from Mila and colleagues aimed to solve the inefficiency of reinforcement learning (RL) in training reasoning LLMs due to the unbounded state space, proposing a method to optimize RL thinking environments by bounding state spaces or improving attention-based policies. The main innovation points lie in tackling computational challenges associated with long chains of thought, potentially through state space bounding or attention optimization techniques. The value lies in enhancing the scalability and efficiency of RL for training LLMs with extensive reasoning capabilities. Specific details on datasets and improvements are not provided, but the novelty in addressing the quadratic growth of computational costs is highlighted².
Yitao Long from New York University and colleagues introduced PuzzlePlex, a benchmark for evaluating the reasoning and planning abilities of foundation models through puzzle solving. The main innovation points include supporting a wide variety of puzzle types and environments, as well as employing a flexible framework that can adapt to evolving model capabilities. The value lies in providing a rigorous testbed for assessing models’ performance in complex, dynamic environments. Experiments demonstrated that reasoning models generally outperform non-reasoning models in instruction-based settings, but face challenges in the code-based setting, with DeepSeek-R1 achieving the highest normalized score of 0.62 in instruction-based scenarios³.
Zhanke Zhou from Hong Kong Baptist University and colleagues addressed the limited reasoning capacity of foundation models (FMs) in complex domains like biology, chemistry, and healthcare, proposing AlphaApollo, an agentic reasoning system that integrates FMs with professional tools using a novel rollout framework and computational/retrieval modules. The main innovation points involve a hybrid error-correction mechanism and the Model Context Protocol (MCP) for tool integration. The value lies in enhancing FM reasoning and decision-making capabilities for real-world scientific tasks. Experiments on mathematics benchmarks showed up to 23.34% improvement in Average@32 scores and doubling of Pass@32 scores on larger models like Qwen3-235B-A22B and Llama3.3-70B-Instruct, indicating significant advancements in FM-driven complex reasoning⁴.
Bryan R. Christ from the University of Virginia and colleagues tackled the challenge of generating standards-aligned educational math word problems (MWPs) using LLMs. The main innovation points include the development of the Standards-Targeted Educational Math (STEM) dataset and a joint human expert-LLM judge approach for evaluation. The value lies in automating MWP generation aligned with specific educational standards, enhancing personalized learning experiences. Experiments on the STEM dataset showed that EDUMATH 12B matches the performance of larger models, while EDUMATH 30B outperforms existing baselines, bridging the gap between smaller and larger LLMs in MWP generation⁵.
Xuhang Chen from the University of Cambridge and colleagues focused on the inefficiency and redundancy in Multi-Agent Debate (MAD) systems for LLMs and Multimodal LLMs, proposing SID, a framework that uses self-signals for early-exit and compression mechanisms to improve debate efficiency and performance. The main innovation points are the use of internal confidence and semantic focus signals to drive debate dynamics. The value lies in reducing computational costs and improving accuracy in MAD systems. Experiments across various benchmarks and datasets demonstrated up to 40% reduction in token consumption and improved accuracy compared to existing MAD approaches⁶.
André Greiner-Petter from Georg-August-Universität and colleagues addressed the issue of generative plagiarism in scientific articles, proposing a new framework for evaluating plagiarism detection systems in the era of AI-generated content. The main innovation points include the introduction of a synthetic dataset and the use of semantic similarity based on LLMs for detection. The value lies in adapting plagiarism detection methods to handle AI-generated content, ensuring academic integrity. Experiments showed that Linq outperformed other baselines and participant submissions on the new dataset, though there were challenges with recall on certain subsets of data⁷.
Zecheng Tang from Soochow University and colleagues explored the degradation of reward model (RM) performance in long-context scenarios, proposing LongRM, a multi-stage training strategy and Long-RewardBench, a benchmark for evaluating RMs under long-context conditions. The main innovation points involve the Short-to-Long Dataset Synthesis and Consistency Majority Voting methods. The value lies in preserving RM performance in short contexts while enhancing capabilities in long contexts. An 8B parameter LongRM model surpassed 70B-scale baselines and matched Gemini 2.5 Pro’s performance on long-context evaluations, suggesting effective improvements in handling extended contexts⁸.
Kaixiang Mo from Shopee and colleagues surveyed mid-training strategies for LLMs, proposing a unified paradigm encompassing data quality refinement, learning rate scheduling, and context length extension. The main innovation points include a taxonomy of mid-training strategies and a discussion on their mutual reinforcement. The value lies in providing a systematic review that integrates mid-training strategies to enhance LLM performance efficiently. Evaluation benchmarks included general knowledge, reasoning, mathematics, coding, multilingual understanding, and long-context processing, showing consistent performance gains across different dimensions of model capabilities⁹.
Maria Levchenko from the Italian Institute of Germanic Studies and the University of Bologna studied the evaluation of LLMs for OCR tasks on historical documents, proposing a methodological framework with new metrics HCPR and AIR. The main innovation points include the development of contamination-aware dataset creation protocols and the use of context-enhanced prompts. The value lies in assessing LLM performance on non-standard typographies and evolving orthographic conventions. Experiments on a dataset of 1,029 scanned pages from 428 unique 18th-century books showed that Gemini-2.5-Pro performed best with a CER of 3.36%, indicating the potential of LLMs in historical OCR tasks¹⁰.
Yunzhong Xiao from Carnegie Mellon University and colleagues introduced ToolMem, a framework for multimodal agents to develop learnable tool capability memory, addressing the lack of dynamic, learnable memory in tool selection and performance. The main innovation points are the structured capability memory and dynamic memory update mechanisms. The value lies in optimizing tool selection and improving task performance for multimodal agents. Experiments on text and image generation tasks showed significant reductions in error rates and improvements in correlation scores, particularly beneficial for weaker models¹¹.
Tarek Naous from Microsoft Research and colleagues focused on simulating realistic human users in multi-turn conversations with LLMs, proposing User Language Models (User LMs) trained on real human-assistant conversations. The main innovation points involve the use of <|endconversation|> tokens and evaluation on out-of-domain datasets. The value lies in providing a more realistic evaluation environment for assistant LMs. UserLM-8b outperformed baselines in generating diverse first turns, decomposing intent across turns, and recognizing natural conversation termination, indicating the effectiveness of User LMs in simulating human behavior¹².

Technical Trends

The papers reviewed here showcase a range of technical trends and methodological evolutions in the application of LLMs:

Citation and Plagiarism: Efforts to enhance citation reliability and detect generative plagiarism using advanced classification and semantic similarity techniques.
Reasoning and Planning: Development of benchmarks and frameworks to assess and improve the reasoning and planning capabilities of LLMs through puzzles and debates.
Multimodal Integration: Exploration of how to integrate and optimize the use of external tools and multimodal data to enhance LLMs’ performance in specialized tasks.
Instruction Following: Creation of synthetic datasets and training methodologies to refine LLMs’ instruction-following abilities and adapt to varying context lengths.
Historical Document Processing: Introduction of new metrics and evaluation frameworks to assess LLM performance in OCR tasks for historical documents with unique challenges.

Datasets and Evaluation Metrics

PuzzlePlex: Includes 15 curated puzzles in text-only and text-image formats, evaluated under instruction-based and code-based settings.
Standards-Targeted Educational Math (STEM): Teacher-annotated dataset for math standards-aligned MWPs, evaluated using solvability, accuracy, educational appropriateness, and standards alignment.
Long-RewardBench: First benchmark for evaluating RMs under long-context conditions up to 128K tokens.
GenAI-Bench and BiGGen Bench: Used for evaluating image and text generation tools, respectively.
WildChat and PRISM: Conversational datasets for evaluating User LMs in simulating realistic human users.
18th-century Russian Texts: Containing 1,029 scanned pages from 428 unique books, evaluated using CER, WER, HCPR, and AIR.
MMLUpro, Math, GPQA, ScienceQA, and MMStar: Datasets used to evaluate the performance of SID in various domains.
AlpacaEval 2.0 and Arena-Hard: Used to assess the effectiveness of PiKa in post-training alignment.

These contributions collectively advance the understanding and application of LLMs in diverse fields, addressing key issues related to their reliability, efficiency, and specialized task performance.

Topic 3: Bias and Fairness in AI

Topic Overview

Bias and fairness in AI have become critical areas of focus as the technology becomes more deeply integrated into various aspects of society. Issues related to bias can manifest in different ways, from data-driven biases to biases introduced through training algorithms and methodologies. Addressing these concerns is vital for ensuring that AI systems do not perpetuate or exacerbate societal inequalities. Research in this area aims to develop methods and frameworks that can detect, quantify, and mitigate biases, ultimately leading to more equitable and fair AI applications.

Individual Paper Contributions

Elena Senger from LMU Munich and colleagues studied the challenge of achieving robust automatic term extraction (ATE) across multiple domains without extensive human-annotated datasets. They proposed DiSTER (Distant Supervision for Term Extraction with Robustness), a framework that utilizes synthetic data generated via pseudo-labels from a large language model (LLM) to train smaller, open-source models. The main innovation points include the reduction of noise in synthetic data through manual annotation of common entity types and LLM classification of less common types, along with domain-aware data augmentation. The value lies in enabling more efficient and scalable ATE, enhancing various NLP applications such as document tagging, ontology construction, and patent analysis. Experiments spanning seven diverse domains demonstrated significant improvements in both corpus-level and document-level performance, with the fine-tuned LLaMA model achieving the highest overall performance and surpassing sequence-labeling and few-shot prompting methods. ¹³
Mingkang Zhu from The Chinese University of Hong Kong and colleagues addressed the cross-stratum bias in reinforcement learning (RL) for large language model (LLM) search agents. They introduced Stratified GRPO, which incorporates Stratified Advantage Normalization (SAN) to partition trajectories into homogeneous strata based on structural properties, thus mitigating cross-stratum bias. The main innovation is the rigorous theoretical analysis proving SAN’s effectiveness and practical stability under finite-sample regimes. The value lies in enhancing the training process and performance of LLM search agents, particularly in solving intricate, multi-step tasks. Experiments on seven diverse question-answering benchmarks showed that Stratified GRPO outperformed various baselines by up to 11.3 points in average performance, demonstrating higher training rewards, stability, and learning of more effective search policies. ¹⁴
Geng Liu from Politecnico di Milano and colleagues investigated social identity biases in Chinese large language models (LLMs). They developed a Mandarin-specific evaluation framework to detect biases in both base and instruction-tuned models, covering gendered pronouns and social groups relevant to the Chinese context. The main innovation is extending previous methodologies to a new linguistic and cultural context. The value lies in addressing the gap in research regarding how Chinese LLMs interact with social biases, contributing to fairness and ethical considerations in their deployment. Experiments revealed systematic biases in Chinese LLMs, including stronger outgroup hostility in pretrained models and gender asymmetry in negativity towards female outgroups. The analysis of naturalistic conversations from the WildChat corpus further confirmed the persistence and sometimes intensification of biases in real-world interactions. ¹⁵
Qinhao Zhou from Huazhong University of Science and Technology and colleagues focused on optimizing input prompts for LLMs in NLG and NLU tasks, particularly machine translation. They proposed the Rewriting Original Inputs (ROI) method, which uses smaller parameter models trained via a back-translation strategy to rewrite input data, enhancing alignment with LLM preferences. The innovation points include the specific targeting of the input component and the use of a filtering algorithm to maintain data quality. The value lies in improving LLM performance in real-world applications without extensive retraining. Experiments across various datasets showed significant performance improvements, such as a 2.9 BLEU score increase in the Medical dataset and a 0.28 RougeL score improvement in Xsum summarization. ¹⁶
Elle from University of Oxford and colleagues examined the sociodemographic biases encoded in reward models (RMs) used for aligning LMs with human preferences. They proposed a framework to measure and investigate these biases through attitudes, opinions, and values using multiple-choice questions derived from sociodemographically labeled datasets. The innovation lies in focusing on RMs, which have been relatively unexplored but are crucial for AI alignment and safety. The value is in providing a systematic analysis of RM perspectives compared to demographic groups, which can help prevent the reinforcement of harmful stereotypes. Experiments on established datasets like BBQ, OpinionQA, PRISM, and StereoSet revealed significant biases in RMs, consistent across different models. ¹⁷
Fan Zhou from KU Leuven and colleagues tackled the generation of stylistic text with specific attributes using diffusion models, proposing RegDiff, an attribute-regularized diffusion framework. The innovation involves integrating attribute control into diffusion models during training rather than inference, leveraging a VAE-based encoder-decoder architecture for reconstruction fidelity. The value lies in reducing computational costs and addressing the limitations of existing CFG and CG methods. Experiments on five datasets covering different stylistic attributes demonstrated that RegDiff outperformed baselines like Qwen2-0.5B and ParaGuide, achieving competitive style transfer accuracy and maintaining higher semantic stability. ¹⁸
Ingroj Shrestha from University of Iowa and colleagues aimed to align LLMs’ outputs with desired distributions, whether equal or reflective of real-world statistics. They introduced a weighted adaptive KL loss approach for bias mitigation, dynamically adjusting the loss based on group-specific dynamics and stability-aware weighting. The innovation lies in extending the traditional template-based approach to measure bias in MLMs to ALMs and combining KL divergence loss with MLM loss. The value is in achieving fairness while preserving language modeling capabilities. Experiments showed significant bias reduction with minimal impact on language modeling performance, especially in larger models like Llama3.1-8B-Instruct. ¹⁹

Technical Trends

The papers collectively demonstrate a growing interest in developing methodologies that address bias at multiple levels of AI systems, including data preprocessing, model training, and post-processing. Innovations range from distant supervision techniques to improve cross-domain robustness, to reinforcement learning methods that handle structural heterogeneity, and to the integration of attribute control during training phases of diffusion models. Additionally, there is a trend towards creating culturally and linguistically specific evaluation frameworks and methods for detecting and mitigating biases, particularly in the context of non-English languages like Chinese.

Datasets and Evaluation

DiSTER utilized a diverse set of seven domains to evaluate both corpus-level and document-level performance.
Stratified GRPO was tested on seven question-answering benchmarks.
Probing Social Identity Bias employed a dataset of over 297,600 generated texts and analyzed naturalistic conversations from the WildChat corpus.
Rewriting Original Inputs (ROI) involved datasets such as IT, Medical, Koran, Law for translation tasks, and Xsum for summarization tasks.
Reward Model Perspectives used datasets like BBQ, OpinionQA, PRISM, and StereoSet to assess biases.
Controllable Stylistic Text Generation experimented with five datasets covering different stylistic attributes.
Desired Distributions applied a weighted adaptive KL loss method across various models, comparing it with baselines using GLUE scores and MLM losses.

These papers collectively emphasize the importance of rigorous testing across a variety of datasets to understand and mitigate biases effectively, highlighting the need for both quantitative metrics and qualitative assessments to ensure fairness and robustness in AI applications.

Topic 4: Machine Translation and Multilingual Systems

Topic Overview

Machine Translation and Multilingual Systems represent a critical area of research in Natural Language Processing (NLP), focusing on the development of systems capable of translating text between multiple languages with high accuracy and efficiency. These systems are essential for breaking down language barriers in global communication, education, and research. As NLP evolves, the challenge shifts towards optimizing these systems for specific domains and low-resource languages, ensuring that they can handle diverse and specialized vocabularies while maintaining computational feasibility.

Individual Paper Contributions

Nouman Ahmed from University of Oxford and colleagues studied the identification of optimal word representation and tokenization methods for the scientific domain in NLP. They proposed a comprehensive evaluation framework utilizing the Iris.AI’s Abstracts dataset for training and evaluating word embeddings and tokenization methods. The main innovation points are the focused examination of the scientific domain and the comparison of low-resource models like Word2Vec and FastText with high-compute models like SciBERT and transformer-based architectures. The value lies in the development of an evaluation suite that includes multiple downstream NLP tasks, providing both intrinsic and extrinsic assessments. Experiments on the UNMSRS, SemEval, and Clinical STS datasets showed that Word2Vec models, particularly those using Skipgram with a 200-dimensional vector, achieved the best results in word and sentence similarity tasks, while SciBERT outperformed others in Named Entity Recognition (NER) tasks. The study concluded that static embeddings like Word2Vec can be highly effective for certain NLP tasks in scientific contexts, whereas context-aware embeddings are superior for more complex tasks like NER²⁰.
Phuong Tuan Dat from Hanoi University of Science and Technology and colleagues tackled the issue of improving synthetic speech detection within Automatic Speaker Verification (ASV) systems. They introduced XLSR-Kanformer, a model that integrates Kolmogorov-Arnold Networks (KANs) into the XLSR-Conformer architecture, replacing traditional MLP components. The novelty of this work lies in the application of KANs to SSL architectures, aiming to enhance feature learning and robustness in synthetic speech detection. The value is in the demonstrated improvements in Equal Error Rate (EER) and min t-DCF metrics using the ASVspoof2021 dataset. Experiments showed that XLSR-Kanformer outperforms several state-of-the-art models, achieving a 60.55% relative improvement in EER on the ASVspoof 21LA and DF sets. The conclusion was that KANs, especially when applied to the convolution module, significantly improve the model’s performance and adaptability²¹.
Cheng-Han Chiang from National Taiwan University and colleagues addressed the limitation of current Spoken Language Models (SLMs) in generating timely and accurate responses during real-time interactions. They proposed Shanks, a framework enabling SLMs to process streaming speech input in chunks and simultaneously generate thinking chunks, thereby improving real-time interaction capabilities. The innovation points are the simultaneous hearing and thinking approach, which balances early API call accuracy with response quality. The value lies in the framework’s ability to enhance the responsiveness and interactivity of SLMs, crucial for applications such as tutoring systems and task-oriented dialogues. Experiments on math problem-solving and travel planning tasks demonstrated a 71% increase in interruption accuracy on the wrong subset and a 37.1% improvement in interruption validity. Combining Shanks with a ‘call-after-listen’ approach yielded the best performance, indicating a balanced strategy for real-time interaction²².
Aryan Kumar Singh from Indian Institute of Science (IISc) and colleagues developed Prakriti200, a questionnaire-based dataset of 200 Ayurvedic Prakriti assessments. The dataset is bilingual (English-Hindi) and includes mandatory responses with automated backend scoring. The main innovation is the structured, standardized approach to collecting and analyzing Ayurvedic Prakriti data, which bridges traditional practices with modern data-driven research. The value lies in enabling statistical analyses and AI-based predictive modeling in the context of personalized health and lifestyle recommendations. Analysis of the dataset indicated a predominance of Pitta constitutions among young adult participants, reflecting typical demographic distributions and suggesting avenues for future research in personalized health analytics²³.
Toshiki Nakai from Saarland University and colleagues aimed to address the challenge of improving machine translation quality for low-resource languages into high-resource languages. They introduced TRepLiNa, a method that combines Centered Kernel Alignment (CKA) and REPINA to enforce cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM). The innovation lies in the systematic study of layer-wise alignment in a decoder-only LLM for low-resource machine translation. The value is in providing a low-cost, effective approach for enhancing translation quality, particularly in data-scarce settings. Experiments on the MMLoSo benchmark using the Aya-23 8B model showed that TRepLiNa, especially at layer 15, outperformed baselines in translations from Mundari and Santali to Hindi and English. The study concluded that layer-wise alignment is a promising technique for improving translation quality in low-resource language scenarios²⁴.
Vaibhav Srivastav and colleagues addressed the need for a standardized and transparent evaluation of automatic speech recognition (ASR) systems, especially for multilingual and long-form speech recognition. They introduced the Open ASR Leaderboard, an interactive platform that compares over 60 ASR models across 11 datasets, focusing on accuracy and efficiency metrics. The main innovation is the inclusion of a wide variety of models and languages, along with both WER and RTFx (inverse real-time factor) metrics. The value is in providing a comprehensive and transparent evaluation framework that supports informed decision-making for ASR system selection. Insights from the leaderboard highlighted trade-offs between model specialization and broad language coverage, with Conformer encoders and LLM-based decoders offering high accuracy but lower efficiency, while self-supervised learning models supported many languages but performed poorly on English transcription. Whisper Large v3 emerged as the best-performing open-source model for long-form transcription, though with slower inference times compared to its distilled variants²⁵.

Technical Trends

The papers collectively demonstrate a trend towards domain-specific and resource-efficient methodologies. Nouman Ahmed and colleagues emphasize the importance of tailored word representations for specialized domains like science, while Toshiki Nakai and colleagues focus on improving low-resource language translation through innovative alignment techniques. Phuong Tuan Dat and colleagues highlight the integration of advanced network structures (like KANs) to enhance synthetic speech detection, showcasing the evolution towards more sophisticated and robust models. Cheng-Han Chiang and colleagues push the boundaries of interactive spoken language models by enabling simultaneous processing and reasoning, indicative of efforts to simulate human-like conversational capabilities. Finally, Vaibhav Srivastav and colleagues underscore the need for transparent and comprehensive benchmarking platforms, reflecting a growing emphasis on standardized evaluation and comparison frameworks.

Datasets and Evaluation

Scientific Domain: Iris.AI’s Abstracts dataset
Synthetic Speech Detection: ASVspoof2021 dataset
Spoken Language Models: Math problem-solving and travel planning tasks datasets
Ayurvedic Prakriti: Prakriti200 dataset
Low-Resource Machine Translation: MMLoSo benchmark, Aya-23 8B model
Automatic Speech Recognition: Open ASR Leaderboard, including datasets for German, French, Italian, Spanish, Portuguese, and short-form English transcription.

Evaluation metrics include:

Scientific Domain: Pearson correlation for word and sentence similarity, F-Beta score for NER
Synthetic Speech Detection: Equal Error Rate (EER), min t-DCF
Spoken Language Models: Interruption accuracy and validity, API call success rate, response latency
Ayurvedic Prakriti: Rule-based scoring for Prakriti assessment
Low-Resource Machine Translation: Weighted composite score (0.6 × BLEU + 0.4 × chrF)
Automatic Speech Recognition: Word Error Rate (WER), Real-Time Factor (RTFx)

These summaries encapsulate the key contributions and insights provided by each paper, highlighting advancements in embedding frameworks, synthetic speech detection, interactive spoken language models, Ayurvedic health analytics, low-resource machine translation, and ASR evaluation methodologies.

Topic 5: Data Handling and Annotation

Topic Overview

Data handling and annotation play a critical role in the development and deployment of large language models (LLMs). These processes involve not only the management and preprocessing of vast amounts of data but also ensuring that the data is annotated with precision and consistency. Effective data handling and annotation are essential for improving the reliability, security, and interpretability of LLMs, particularly in domains where data privacy and model transparency are paramount. This report summarizes recent research efforts aimed at addressing these challenges through innovative methodologies and frameworks.

Individual Paper Contributions

Bharti Meena from 1 and colleagues studied the reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts using LLMs, proposing a scalable multilingual data curation framework for high-quality PII annotation. The main innovation points of this framework are its phased, human-in-the-loop annotation methodology and the use of inter-annotator agreement metrics and root-cause analysis to enhance annotation quality. The value lies in its structured approach to address the challenges of multilingual PII labeling, thereby improving model performance and ensuring compliance with privacy regulations. Experiments on various textual corpora showed significant improvements in Recall and reductions in False Positive Rate (FPR) for most locales, concluding that the framework effectively reduces annotation errors and improves PII detection over time²⁶.
Shuo Shao from Zhejiang University and colleagues addressed the problem of reliably identifying whether a third-party LLM is derived from a copyrighted source model under black-box conditions. They introduced ZeroPrint, a novel black-box LLM fingerprinting framework that leverages zeroth-order gradient estimation to approximate the Jacobian matrix as a unique fingerprint. The main innovation points are the use of semantically preserved variations of base queries and the demonstration that gradients contain more information than outputs for fingerprinting purposes. The value lies in its robustness against adaptive attacks and superior performance over existing methods. Experiments on LeaFBench showed that ZeroPrint outperforms other state-of-the-art black-box fingerprinting methods across various metrics, concluding that leveraging gradients in a black-box setting enhances the reliability of model fingerprinting²⁷.
Zaid Alyafeai from KAUST and colleagues tackled the challenge of extracting metadata accurately and efficiently from scientific papers. They introduced MeXtract, a family of lightweight language models fine-tuned from Qwen 2.5 counterparts, employing schema-based preprocessing, supervised instruction tuning, and preference optimization techniques. The main innovation points are the creation of lightweight yet state-of-the-art models and the introduction of preference optimization to adhere to metadata format constraints. The value lies in its ability to generalize across different schemas and maintain performance consistency over time. Experiments demonstrated that the 3B variant of MeXtract achieved state-of-the-art results on the MOLE benchmark, concluding that lightweight models can effectively meet the demands of metadata extraction without pre-training contamination²⁸.
Philipp Mondorf from LMU Munich and colleagues explored new ensemble strategies for circuit localization within LLMs, proposing parallel and sequential ensembling techniques. The main innovation points are the introduction of the EAP-IG-inputs method for warm-starting edge pruning and the integration of different attribution methods to recover sign information. The value lies in enhancing the interpretability of LLMs by accurately localizing circuits responsible for specific tasks. Experiments on the MIB benchmark revealed that the hybrid ensembling strategy, combining parallel and sequential approaches, achieved the best performance, concluding that ensemble methods can significantly improve the accuracy of circuit localization²⁹.
I-Fan Lin from Leiden University and colleagues focused on the challenge of short text clustering in a training-free and label-free environment. They proposed TWIST, a method that iteratively refines sparse vector representations without requiring labeled data, contrastive learning, or fine-tuning. The main innovation points are the ability to approximate the number of clusters dynamically and the use of knowledge distillation to scale implementation. The value lies in its flexibility and low resource requirement, making it suitable for real-time applications. Experiments across multiple benchmarks and clustering algorithms demonstrated that TWIST outperformed existing baselines, concluding that the method significantly enhances the performance of short text clustering³⁰.
Yisha Wu from Airbnb Inc. and colleagues aimed to improve the efficiency and accuracy of customer support summarization by developing an incremental summarization system that generates real-time summary notes. They introduced a fine-tuned Mixtral-8x7B language model for summarization and a DeBERTa-based classifier to filter non-essential information, alongside an Agent-Edits Learning Framework that incorporates agent feedback for continuous model refinement. The main innovation points are the real-time summarization capability and the incorporation of human feedback into the model learning process. The value lies in reducing agent workload and improving summary quality, leading to faster case resolution. Experiments showed a 3% reduction in case handling time and significant improvements in summary quality, concluding that the system enhances productivity and maintains high customer satisfaction³¹.
Manuel Frank from Munster Technological University and colleagues investigated the overfitting of sentence embedding models to static benchmarks. They introduced PTEB, a dynamic evaluation paradigm that uses LLMs to generate stochastic paraphrases at evaluation time, ensuring more robust and contamination-resistant testing. The main innovation points are the dynamic generation of semantically equivalent but textually distinct problem instances and the demonstration of model robustness across different sizes and domains. The value lies in providing a more realistic assessment of model performance under varied input conditions. Experiments revealed that embedding models perform worse on PTEB compared to static benchmarks, concluding that the method effectively stresses the semantic invariance of embedding models and highlights their true robustness³².

Technical Trends

The papers collectively demonstrate a shift towards more sophisticated, adaptable, and robust methodologies for handling and annotating data. Key trends include:

Human-in-the-loop methodologies: Incorporating human judgment and feedback into the annotation and summarization processes to enhance reliability and accuracy.
Ensemble techniques: Combining multiple models or methods to improve performance and reduce bias in tasks such as circuit localization and summarization.
Dynamic and stochastic approaches: Utilizing LLMs to generate dynamic variations of tasks and data at runtime to test robustness and avoid overfitting.
Lightweight models: Developing smaller, more efficient models that can still achieve state-of-the-art performance, particularly beneficial in resource-constrained environments.
Interpretability and transparency: Enhancing the interpretability of LLMs through methods like circuit localization, which helps in understanding model behavior and improving trustworthiness.

Datasets and Evaluation

The datasets and evaluation metrics used in the papers include:

Scalable multilingual PII annotation: Various textual corpora, including user-generated content, were used. Evaluation metrics included inter-annotator agreement, FPR, and Recall.
ZeroPrint: LeaFBench, a benchmark for LLM copyright auditing, was utilized. Metrics such as AUC, pAUC, TPR@1%FPR, and MD were employed.
MeXtract: The MOLE benchmark was extended with model-specific metadata. Performance was evaluated using accuracy and adherence to metadata format constraints.
BlackboxNLP-2025 MIB Shared Task: The MIB benchmark was used for evaluating circuit localization methods. CMD and CPR scores were the primary metrics.
TWIST: Multiple benchmarks and clustering algorithms (HDBSCAN, K-means) were used. NMI and ACC were key evaluation metrics.
Incremental Summarization for Customer Support: Real-world customer support cases from Airbnb were analyzed. Summary quality was assessed using completeness, truthfulness, and Diff-in-Diff analysis.
PTEB: Static benchmarks were complemented with dynamic paraphrases generated at evaluation time. Factual accuracy and semantic consistency were prioritized over lexical overlap metrics.

These evaluations underscore the importance of using diverse datasets and metrics to ensure that LLMs and related methodologies perform well under varied and real-world conditions.

Topic 6: Security and Privacy in AI

Topic Overview

Security and privacy in AI, particularly in the context of Large Language Models (LLMs), have become increasingly important as these models find widespread application in sensitive domains such as healthcare, finance, and personal data management. Ensuring that AI systems respect user privacy and maintain robust security measures is essential for their ethical and safe deployment. This topic explores innovative methods and frameworks aimed at mitigating privacy risks, enhancing model reliability, and developing sophisticated cybersecurity defenses using AI technologies.

Individual Paper Contributions

Junki Mori from NEC Corporation and colleagues studied the privacy risks associated with Retrieval-Augmented Generation (RAG) systems interacting with sensitive databases. They proposed DP-SynRAG, a differentially private synthetic text generation method for RAG databases. The main innovation points of this method are the generation of synthetic data in advance, which can be reused without additional privacy budget consumption, and the use of soft clustering based on keyword and document embeddings. The value lies in maintaining high utility in RAG applications while preserving privacy, particularly in domains requiring strict data confidentiality. Experiments on Medical Synth, Movielens, and SearchQA datasets showed that DP-SynRAG outperformed existing DP-RAG and DP-Synth methods in terms of accuracy and privacy preservation, concluding that DP-SynRAG effectively mitigates privacy risks by reducing sensitive information leakage³³.
Nyal Patel from Imperial College London and colleagues focused on the lack of interpretability and safety in LLMs due to hidden reward signals during Reinforcement Learning from Human Feedback (RLHF). They introduced FA-IRL, a failure-aware inverse reinforcement learning algorithm, to infer more accurate and stable reward functions. The main innovation points include a dual-path reward model with a correction head for handling failures and a curriculum that increases the complexity of addressed failures during training. The value lies in significantly reducing the non-identifiability problem in standard IRL and improving the detoxification capabilities of LLMs. Experiments using the RealToxicityPrompts and Jigsaw Toxicity datasets demonstrated that FA-IRL outperformed standard IRL baselines in various metrics, including classification accuracy, F1 score, ROC-AUC, and STARC, concluding that FA-IRL enhances the safety and reliability of LLMs³⁴.
Boyi Zeng from Shanghai Jiao Tong University and colleagues tackled the challenge of identifying LLMs to verify their origins and prevent intellectual property theft. They developed AWM, a training-free fingerprinting method that uses Linear Assignment Problem (LAP) and Centered Kernel Alignment (CKA) similarity to create a robust and high-fidelity similarity metric. The main innovation points are the resilience against post-training modifications and the low false-positive rate. The value lies in protecting the intellectual property of LLMs and ensuring the authenticity of models in the market. Experiments on a dataset of 60 positive (base-offspring) and 90 negative (independent) model pairs confirmed the effectiveness of AWM in distinguishing between related and unrelated models, achieving perfect scores on all classification metrics across various scenarios, concluding that AWM offers superior discriminative power and reliability³⁵.
Tiancheng Xing from National University of Singapore and colleagues explored the vulnerability of LLMs as rerankers in information retrieval systems, where minor changes to item descriptions can manipulate rankings. They introduced Rank Anything First (RAF), a gradient-guided prompt optimization framework. The main innovation points include optimizing prompts token-by-token to balance attack success and text fluency. The value lies in exposing and analyzing the weaknesses of LLMs as rerankers, which is crucial for improving system reliability and user trust. Experiments against strong baselines like SRP and STS across various LLMs and product categories showed that RAF achieved lower average ranks and perplexity while maintaining a low bad word ratio, concluding that RAF is both robust and stealthy³⁶.
Sri Durga Sai Sowmya Kadali from University of California, Riverside and colleagues investigated detecting jailbreak prompts in LLMs that aim to bypass safety mechanisms. They proposed a tensor-based methodology using CP decomposition to identify latent factors that differentiate jailbreak from benign prompts. The main innovation points are the focus on internal model representations and the use of tensor decomposition methods. The value lies in enhancing the robustness and trustworthiness of conversational AI systems by preventing unauthorized access to sensitive content. Experiments using the HuggingFace Jailbreak Classification dataset revealed that the tensor decomposition method, combined with simple classifiers, outperformed baselines, especially in the middle layers of both GPT-J and Mamba2 models, concluding that the middle layers are particularly informative for jailbreak detection³⁷.
Pierre Lison from Norwegian Computing Center and colleagues addressed the risk of search-based linkage attacks on de-identified documents, proposing a method to prevent such attacks by rephrasing unique N-grams. The main innovation points include the use of an inverted index to identify and rephrase infrequent N-grams, ensuring semantic integrity. The value lies in providing a comprehensive solution to linkage attacks, a gap in current de-identification practices. Experiments on a dataset of 13,759 English-language court cases from the European Court of Human Rights (ECHR) showed that the proposed method significantly reduced linkage risks while maintaining high semantic similarity and fluency, concluding that the method effectively prevents linkage attacks³⁸.
Muris Sladić from Czech Technical University in Prague and colleagues improved the realism and interactivity of AI-based deception systems in cybersecurity by introducing VelLMes. The main innovation points are the simulation of multiple network protocols and services, and the use of prompt engineering to ensure realistic interactions. The value lies in enhancing the effectiveness of cyber deception strategies by better engaging and monitoring attackers. Unit tests and human attacker evaluations involving 89 participants, along with real-life deployment, confirmed that LLMs can accurately simulate network protocols and services, with shelLM correctly responding to over 90% of commands issued by real attackers, concluding that integrating LLMs into deception frameworks improves their realism and engagement³⁹.

Technical Trends

The papers in this collection highlight several evolving technical trends in addressing security and privacy challenges in AI:

Differential Privacy: Methods like DP-SynRAG leverage differential privacy to ensure data confidentiality in interactive AI systems, emphasizing the importance of maintaining utility while preserving privacy.
Inverse Reinforcement Learning: FA-IRL uses inverse reinforcement learning to understand and improve the safety and interpretability of LLMs, particularly in detoxification tasks.
Fingerprinting Techniques: AWM employs advanced fingerprinting techniques to verify the origins of LLMs, safeguarding against intellectual property theft.
Prompt Optimization: RAF focuses on optimizing prompts to manipulate and analyze LLMs’ ranking capabilities, revealing the models’ vulnerabilities.
Internal Layer Analysis: The tensor-based approach in “Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?” examines internal model dynamics for enhanced security.
Deception Frameworks: VelLMes integrates LLMs into cybersecurity deception frameworks to simulate various network services realistically and engage human attackers effectively.

Datasets and Evaluation

The papers utilized a diverse set of datasets and employed various evaluation metrics to assess their proposed methods:

Medical Synth, Movielens, and SearchQA: Used to validate DP-SynRAG’s effectiveness in generating synthetic text under differential privacy constraints.
RealToxicityPrompts and Jigsaw Toxicity: Employed to test FA-IRL’s detoxification capabilities and compare it with standard IRL methods.
GPT-J and Mamba2: Models analyzed to identify patterns for jailbreak detection using tensor decomposition.
European Court of Human Rights (ECHR): Dataset of 13,759 court cases used to evaluate the effectiveness of methods in preventing search-based linkage attacks.
Various Open-Source LLMs and Product Categories: Used to benchmark RAF’s performance in manipulating rankings.
Unit Tests and Human Attacker Evaluations: Conducted to assess the realism and interactivity of VelLMes in simulating network services.

Evaluation metrics included classification accuracy, F1 score, ROC-AUC, STARC, average rank, perplexity, bad word ratio, similarity scores, and probing accuracy, reflecting the varied goals of each paper, from privacy preservation to model reliability and cybersecurity enhancement.

Topic 7: Evaluation and Benchmarking of AI Models

Topic Overview

The evaluation and benchmarking of AI models, particularly large language models (LLMs), is a critical area of research aimed at improving their reliability, ethical alignment, and performance across diverse applications. These studies focus on addressing specific challenges such as aligning AI-generated content with user preferences, enhancing the quality and coherence of generated texts, mitigating overthinking and computational inefficiencies, and ensuring factual accuracy in specialized domains like finance. By proposing innovative benchmarks and methodologies, these papers contribute to the broader goal of making AI models more adaptable, efficient, and trustworthy for various real-world scenarios.

Individual Paper Contributions

Kshitish Ghate from University of Washington and colleagues studied the lack of steerability of large language models (LLMs) and reward models (RMs) towards diverse user preferences and values. They proposed EValueSteer, a benchmark designed to measure the steerability of reward models towards user-defined value and stylistic preferences. The main innovation points are the use of a synthetic dataset based on prompts from the PRISM corpus and value-loading questions from the World Values Survey (WVS) to systematically vary value and style dimensions. The value lies in providing a novel method for assessing the alignment of AI systems with diverse human values and preferences, enhancing ethical considerations and usability in global contexts. Experiments on six different models and two types of reward models showed significant improvements in pairwise preference accuracy with full user context, reaching around 75%, compared to 42% without context, and revealing biases towards certain values and styles.⁴⁰
Mingzhe Zheng from The Hong Kong University of Science and Technology and colleagues addressed the challenge of generating movie scripts with emotional depth, thematic meaning, and narrative coherence using large language models (LLMs). They introduced CML-Bench, a comprehensive evaluation framework that includes nine interpretable metrics to assess dialogue coherence, character consistency, and plot reasonableness. The key innovation is the development of CML-Instruction, a prompting strategy to enhance the quality of generated scripts. The value lies in systematically defining and measuring the qualities essential for compelling screenplays, thereby advancing the potential of LLMs in creative writing industries. Experiments demonstrated that instruction-tuned models nearly matched human-level coherence, with Qwen3-30B achieving the highest score in Narrative Innovation (PR3) and strong correlation with human judgments.⁴¹
Jiakang Wang from Kuaishou Technology and colleagues tackled the issue of overthinking in large reasoning models (LRMs) which can degrade performance and increase computational costs. They proposed Gold-Switch, a training-free superposition strategy that balances slow-thinking LRMs and fast-thinking LLMs using low-rank approximation to selectively apply an overthinking component. The main innovation is the entropy-based method for determining the optimal rank of the low-rank approximation. The value lies in reducing overthinking while preserving complex reasoning capabilities, leading to more efficient and practical models for real-world applications. Experiments on datasets like ASDIV, GSM8K, AIME, and GPQA showed up to 2.7× speedup and negligible performance drops, with the hard superposition method being more effective in controlling overthinking.⁴²
Haiquan Lu from National University of Singapore and colleagues focused on the inefficiency and redundancy in elaborate reasoning in large reasoning models (LRMs) for chain-of-thought (CoT) processes. They introduced MixReasoning, a dynamic framework that adjusts the depth of reasoning based on task complexity, using lightweight LoRA adapters to control reasoning modes. The innovation lies in integrating thinking and non-thinking abilities without degrading the base model, and a hardware-friendly design allowing KV-cache reuse. The value is in improving the accuracy-efficiency trade-offs in reasoning tasks. Experiments on GSM8K, MATH-500, and AIME24 showed significant reductions in token usage while maintaining or improving accuracy, confirming the effectiveness of MixReasoning in managing reasoning depth.⁴³
Markus Reuter from Technical University of Darmstadt and colleagues addressed the ‘Document-Level Retrieval Mismatch’ (DRM) in Retrieval-Augmented Generation (RAG) systems for large-scale legal document datasets. They proposed Summary-Augmented Chunking (SAC), a lightweight technique that enriches text chunks with synthetic summaries to maintain global context during retrieval. The value lies in improving the reliability of RAG systems in legal applications, reducing DRM and enhancing retrieval precision and recall. Experiments on privacy policies, non-disclosure agreements, and merger agreements revealed that SAC outperformed standard RAG approaches, reducing DRM rates and improving overall retrieval performance.⁴⁴
Luca Giordano from ScaDS.AI Dresden/Leipzig and colleagues examined the challenge of extracting implicit factual knowledge from LLMs into a structured, explicit format using the GPTKB methodology. They introduced miniGPTKBs, domain-specific subcrawls of the LLM knowledge base, and developed a systematic method to evaluate reproducibility and robustness. The innovation lies in providing a proof of termination for GPTKB and proposing ensembling techniques to improve output stability. The value is in making LLM knowledge more accessible and reliable for various applications. Experiments confirmed the termination of the GPTKB approach and showed improved output stability through ensembling, with high semantic similarity and yield consistency across different topics.⁴⁵
Elena Chistova from [institution not specified] and colleagues aimed to develop a unified Rhetorical Structure Theory (RST)-style discourse parser that can handle various treebanks across different languages. They proposed UniRST, which uses two training strategies: Multi-Head (MH) and Masked-Union (MU), to respect individual treebank relation inventories while leveraging shared representations. The innovation lies in the Masked-Union strategy and a data augmentation technique for end-to-end mono-treebank parsing. The value is in improving the robustness and applicability of discourse analysis tools across languages and genres. Experiments indicated that MU outperforms MH in parsing accuracy and efficiency, especially for overlapping relations, and UniRST surpassed mono-treebank baselines in most datasets.⁴⁶
Pranav Gupta from Lowe’s and colleagues addressed the scarcity of high-quality specialized training and evaluation datasets for LLMs used in STEM education, particularly at the college level and for languages other than English. They introduced OpenStaxQA, a multilingual dataset derived from open-source college textbooks, and proposed the use of quantized low-rank adapters (QLoRa) for fine-tuning LLMs. The innovation lies in the structured approach to dataset creation and the focus on end-of-chapter exercises in multiple languages. The value is in enhancing LLM performance for complex, college-level STEM materials and promoting generalization across domains. Fine-tuning experiments showed improved performance on OpenStaxQA but varying results on zero-shot AI2RC datasets.⁴⁷

Technical Trends

The papers in this collection adopt a range of technical approaches and methodologies to address the evaluation and benchmarking of AI models:

Benchmark Development: Several papers introduce new benchmarks to measure specific aspects of AI model performance, such as steerability towards values and preferences, screenplay generation quality, and knowledge materialization.
Prompting Strategies: Some papers emphasize the role of tailored prompting techniques to guide model behavior, such as EValueSteer’s use of value-loading questions and CML-Bench’s CML-Instruction strategy.
Adaptive Reasoning Techniques: Papers like MixReasoning and Gold-Switch propose methods to dynamically adjust reasoning depth or switch between fast and slow thinking modes, optimizing both performance and efficiency.
Retrieval Methods: The CML-Bench and SAC approaches highlight advancements in retrieval mechanisms to improve the reliability and accuracy of generated content, especially in specialized domains like legal and financial documents.
Knowledge Extraction: Studies such as Foundations of LLM Knowledge Materialization focus on extracting and structuring implicit knowledge from LLMs, aiming to enhance transparency and reliability.
Cross-Linguistic Analysis: The UniRST paper showcases efforts to develop unified discourse parsers capable of handling multiple languages and genres, bridging the gap between linguistic diversity and model generalization.

Datasets and Evaluation Metrics

EValueSteer: Utilizes a synthetic dataset based on PRISM corpus prompts and WVS questions to evaluate steerability towards values and stylistic preferences.
CML-Bench: Employs a specialized dataset derived from 100 classic, high-rated movie scripts, using nine interpretable metrics to assess screenplay qualities.
ASPO: Uses established mathematical reasoning and coding benchmarks such as AIME24, AIME25, AMC23, MATH-500, Minerva Math, OlympiadBench, and LiveCodeBench versions 5 and 6 to evaluate training stability and performance.
MixReasoning: Tests on reasoning benchmarks including GSM8K, MATH-500, and AIME24, focusing on accuracy-efficiency trade-offs.
Towards Reliable Retrieval in RAG Systems for Large Legal Datasets: Develops a new benchmark, LegalBench-RAG, to isolate and measure the retrieval component of RAG systems for legal documents.
Foundations of LLM Knowledge Materialization: Utilizes miniGPTKBs for three topics (Ancient Babylon, The Big Bang Theory, and DAX 40) to evaluate reproducibility and robustness.
Gold-Switch: Conducts experiments on datasets such as ASDIV, GSM8K, AIME, and GPQA to measure overthinking reduction and performance retention.
OpenStaxQA: Derives from 43 open-source college textbooks in English, Spanish, and Polish, covering STEM fields, and evaluates on the AI2 reasoning challenge dev dataset.
Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser: Employs 18 treebanks across 11 languages, testing parsing accuracy and efficiency using Full F1 scores.

Topic 8: Human Interaction with AI

Topic Overview

The topic of human interaction with AI is critical in the development of more intuitive, safe, and effective artificial intelligence systems. As AI, particularly large language models (LLMs), becomes more integrated into daily life and professional settings, understanding and optimizing these interactions is essential for enhancing user experience, ensuring safety, and aligning AI outputs with human values and preferences. Research in this area seeks to address the challenges of designing AI systems that can adapt to the varied needs and conversational styles of users, while also ensuring that these systems behave ethically and reliably across diverse applications.

Individual Paper Contributions

Renee Shelby from Google Research and colleagues studied the inadequacy of current taxonomies in capturing the nuanced nature of human-AI interactions, proposing the Taxonomy of User Needs and Actions (TUNA) to solve this problem. The main innovation points of TUNA are its integration of both instrumental goals and conversational strategies, covering 57 request types mapped onto 14 distinct strategies and six high-level interaction modes. The value lies in its ability to provide a more systematic approach to identifying user confusion and misuse patterns, thereby informing the design of safer and more responsive AI systems. Experiments on the WildChat and ShareGPT dialogue corpora, along with additional qualitative analysis of 1193 public conversation logs, confirmed the taxonomy’s ability to classify user turns effectively without needing new categories, concluding that TUNA offers a robust classification system for user actions in AI interactions⁴⁸.
Prateek Humane from Mila, Québec AI Institute, and colleagues tackled the challenge of defining quality in chain-of-thought (CoT) data used for fine-tuning LLMs to improve reasoning capabilities, particularly in mathematics. They introduced a method using influence functions (IFs) to measure the causal effect of individual CoT examples on downstream accuracy, proposing influence-based pruning as a novel technique for selecting high-quality data for fine-tuning. The main innovation here is the direct measurement of the impact of training examples on model performance. The value lies in potentially more efficient fine-tuning processes and improved model performance by focusing on impactful data. Evaluations on the LIMO dataset and comparisons with baselines like Random, Mid-PPL, and RDS+ on math reasoning benchmarks (AIME24, AMC23, OlympiadBench, and GSM8k) showed that the ‘Combined’ pruning strategy achieved the highest performance on GSM8k and OlympiadBench, while the ‘Correct’ strategy performed well on AMC23, concluding that influence functions can guide effective data selection for fine-tuning LLMs in reasoning tasks⁴⁹.
Matthieu Bou from Imperial College London and colleagues focused on the opacity of LLM training objectives, a significant barrier to diagnosing and mitigating issues like reward hacking and preference inconsistencies. They proposed a Bayesian Inverse Reinforcement Learning (IRL) framework called The Alignment Auditor, which recovers distributions over training objectives and provides a structured process for verification. The main innovation is the use of variational inference to approximate the posterior distribution over reward functions, offering a more nuanced understanding of LLM objectives. The value lies in enabling a more principled and transparent approach to aligning LLMs with human values. Experiments using the AllenAI RealToxicityPrompts dataset demonstrated that the framework could recover an uncertainty-aware reward signal and effectively quantify and reduce ambiguity in the reward function, concluding that the framework can diagnose and mitigate issues related to LLM alignment and ensure safer, more reliable AI systems⁵⁰.
Shangjian Yin from University of California, Riverside and colleagues addressed the high costs and inefficiencies associated with traditional reinforcement learning methods for aligning LLMs with human preferences. They proposed Self-Alignment Optimization (SAO), a fully self-synthetic method that instructs the LLM to generate its own prompts and responses and then rank these responses through self-judgment. The main innovation is the elimination of the need for external data collection and annotation, relying instead on synthetic data generated by the model itself. The value lies in making LLM alignment more accessible and scalable. Experiments on benchmarks such as AlpacaEval 2.0, MT-Bench, and Arena-Hard showed that Gemma-2-9B-it with SAO tuning outperformed vanilla models and those trained on externally labeled datasets, concluding that SAO is a promising approach for cost-effective LLM alignment without compromising performance on downstream NLP tasks⁵¹.
Haofei Yu from University of Illinois Urbana-Champaign and colleagues developed TinyScientist, an interactive, extensible, and controllable framework for building research agents. This framework aims to simplify the complexity involved in extending and maintaining automatic research workflows that integrate LLMs and multi-agent systems. The main innovation points are the modular, tabular-based interface for enhanced interactivity, the Model Context Protocol (MCP) for tool integration, and built-in safety and budget controllers for better controllability. The value lies in making advanced research tools more accessible and adaptable to new technologies while maintaining safety and ethical standards. Evaluations against the Agent Laboratory showed that TinyScientist significantly improved the quality of research outputs, especially in the biological domain, concluding that it provides a user-friendly and effective platform for generating research papers⁵².

Technical Trends

The papers collectively reflect a trend towards more sophisticated and nuanced approaches in understanding and managing human-AI interactions. They emphasize the importance of transparency, efficiency, and adaptability in AI systems. Methodologically, there is a shift towards leveraging advanced statistical and machine learning techniques such as Bayesian IRL and influence functions to better understand and control the behavior of LLMs. Additionally, there is a focus on developing frameworks that can seamlessly integrate with existing tools and technologies, promoting a more modular and interactive design philosophy.

Datasets and Evaluation

Taxonomy of User Needs and Actions: Utilized WildChat and ShareGPT dialogue corpora, analyzed 1193 public human-AI conversation logs.
Influence Functions for Efficient Data Selection in Reasoning: Evaluated on the LIMO dataset, with benchmark tests on AIME24, AMC23, OlympiadBench, and GSM8k.
The Alignment Auditor: Used the AllenAI RealToxicityPrompts dataset for studying detoxification capabilities.
Aligning Large Language Models via Fully Self-Synthetic Data: Tested on AlpacaEval 2.0, MT-Bench, and Arena-Hard benchmarks.
TinyScientist: Compared against the Agent Laboratory, evaluated on tasks in biological and machine learning domains.

Topic 9: Content Generation and Moderation

Topic Overview

The research topic of content generation and moderation is critical in today’s digital landscape, where large language models (LLMs) are increasingly used to create and filter textual content. This topic encompasses a wide range of applications, from generating role-playing dialogues and educational materials to moderating online discussions and analyzing lyrical content for inappropriate material. The advancements in LLMs have brought about new challenges, such as the need for more versatile benchmarks, the issue of ‘hallucination’, and the requirement for more efficient and reliable reinforcement learning datasets. Addressing these challenges is vital for ensuring that LLMs can be deployed safely and effectively across various domains, enhancing user experiences and promoting responsible AI use.

Individual Paper Contributions

Haotian Wu from The Hong Kong University of Science and Technology (Guangzhou) and colleagues studied the limitations of existing benchmarks for evaluating role-playing conversational agents (RPCAs), proposing FURINA-Builder and FURINA-Benchmark to solve the core problem of inadequate adaptability and narrow scope. The main innovation points are the integration of both established and synthesized characters in group chat scenarios, detailed evaluation criteria, and a judge model for selecting the most appropriate evaluation dimension. The value lies in the ability to test RPCAs across diverse user-specified character personas, varying dialogue structures, and evolving evaluation dimensions, thereby improving their RP abilities and reliability. Experiments on FURINA-Benchmark revealed that o3 and DeepSeek-R1 achieved the strongest overall performance on English and Chinese RP tasks, respectively, and that established characters outperformed synthesized ones across all models. Additionally, the study highlighted a trade-off between RP performance and reliability, suggesting that future models must balance these aspects. ⁵³
Zhepeng Cen from Salesforce AI Research and Carnegie Mellon University and colleagues tackled the scarcity and limited diversity of reinforcement learning (RL) datasets for training LLMs, proposing Webscale-RL to address the core problem. The main innovation points include an automated pipeline for converting large-scale pretraining corpora into diverse and verifiable RL datasets, preserving the scale and diversity of web data. The value lies in enhancing the general reasoning capabilities of LLMs and making them more adaptable to unseen situations. Experiments on benchmarks such as MMLU-pro, BigBench, and GPQA-D demonstrated that models trained with Webscale-RL data outperformed continual pretraining and advanced data refinement baselines, particularly in math reasoning tasks like MATH500, achieving comparable performance with 100 times fewer tokens. ⁵⁴
R. Alexander Knipper from Bridge-AI Lab@UCF and colleagues focused on the misalignment between third-party instructional materials for virtual labs and teachers’ educational goals, proposing a novel instructional goal-aligned framework for question generation to solve this core problem. The main innovation points involve understanding instructional goals via teacher–LLM dialogue, constructing a structured representation of lab knowledge units, and employing a question taxonomy and TELeR taxonomy to control prompt detail. The value lies in leveraging LLMs to create high-quality, pedagogically aligned questions for educational purposes. Evaluations using over 1,100 questions across 19 LLMs showed that larger models produced higher quality questions, with open-ended and relationally grounded questions fostering higher-order thinking. ⁵⁵
Ranjan Mishra from University of Amsterdam and colleagues aimed to address the lack of interpretability in recommender systems, particularly those using collaborative filtering (CF) techniques. They reproduced and extended the XRec framework, utilizing Llama 3 as an alternative to GPT-3.5-turbo for evaluation. The main innovation points include exploring the impact of MoE modules and GNN embeddings on generated explanations. The value lies in enhancing user trust and providing meaningful explanations for user-item interactions, which is crucial for transparency in AI applications. Experiments showed that while XRec generates unique explanations, the injection of collaborative information improves explainability and stability, but it does not always outperform baseline models across all metrics and datasets. ⁵⁶
Rohitash Chandra and colleagues addressed the lack of longitudinal studies on abusive content trends in popular music, specifically in the Billboard Music Charts, proposing a deep learning and LLM-based framework to analyze lyrical content over time. The main innovation points are the utilization of BERT and RoBERTa models for sentiment analysis and abuse detection, and the introduction of the SenWave and RAL-E datasets. The value lies in providing a robust and adaptable approach to detecting inappropriate content in evolving linguistic environments. Analysis indicated a significant increase in explicit content in popular music since 1990, peaking at 65% in the late 2010s, and highlighted the changing thematic focus in lyrics over the decades. ⁵⁷
Aisha Alansari and Hamzah Luqman provided a comprehensive survey of hallucination in LLMs, detailing the causes, detection, and mitigation strategies. The main innovation points include the introduction of taxonomies for detection and mitigation techniques and a deep analysis of hallucination causes throughout the LLM development lifecycle. The value lies in offering a thorough review of the current state of research on hallucination, which is crucial for developing more reliable LLMs, especially in sensitive applications. While specific experimental results were not provided, the survey emphasized the importance of multilingual and low-resource context challenges. ⁵⁸
Mattia Samory from 1 and colleagues tackled the inconsistency and complexity in predicting rule infractions in online content moderation, proposing ModQ, a novel QA framework with ModQ-Extract and ModQ-Select model variants. The main innovation points involve the design of lightweight, interpretable models that can handle an open set of community-specific rules. The value lies in enhancing the governance and automation of content moderation, ensuring transparent and consistent practices. Experiments on Reddit and Lemmy datasets showed that ModQ models outperformed state-of-the-art baselines in identifying moderation-relevant rule violations, particularly with ModQ-Select achieving higher F1 scores in most categories. ⁵⁹
Yining Wang and colleagues addressed the length bias in RLHF for improving LLM reasoning capabilities, proposing $\lambda$-GRPO to unify previous GRPO frameworks with learnable token preferences. The main innovation points are the adaptive reweighting of token contributions based on response lengths, contrasting with fixed heuristics in prior methods. The value lies in offering a more adaptable and efficient optimization process, enhancing the training stability and performance of LLMs in various reasoning tasks. Empirical evaluations on Qwen2.5 models of different scales across eight reasoning benchmarks demonstrated that $\lambda$-GRPO consistently outperformed GRPO and DAPO, achieving higher accuracy and more diverse outputs without inflating response lengths. ⁶⁰

Technical Trends

The papers collectively reflect a trend towards more sophisticated and adaptive methodologies for content generation and moderation. Innovations such as FURINA-Builder and Webscale-RL emphasize the need for scalable and diverse datasets that can accommodate evolving user needs and interaction paradigms. The instructional goal-aligned framework for question generation and ModQ showcase the importance of aligning AI-generated content with human-defined goals and community-specific rules. The survey on hallucination in LLMs highlights the necessity for robust detection and mitigation strategies, while $\lambda$-GRPO underscores the value of context-aware and flexible optimization schemes.

Datasets and Evaluation

FURINA-Benchmark: A role-playing benchmark for evaluating conversational agents.
Webscale-RL Dataset: Contains 1.2 million QA pairs across nine domains, used for scaling up RL training for LLMs.
Amazon-books, Yelp, Google-reviews Datasets: Used for evaluating the XRec framework in recommendation systems.
SenWave and RAL-E Datasets: Utilized for sentiment analysis and hate speech detection in lyrical content analysis.
Reddit and Lemmy Datasets: Employed for training and evaluating ModQ in online content moderation.
Qwen2.5 Models: Various scales (1.5B, 3B, and 7B) were used to evaluate $\lambda$-GRPO in reinforcement learning scenarios.

Evaluation metrics included LlamaScore, BERTP, BERTF1, BLEURT, F1 scores, and USR (Unique Score Ratio) across different tasks and domains, reflecting the varied nature of content generation and moderation challenges.

Topic 10: AI Development Techniques and Methods

Topic Overview

The research topic of AI development techniques and methods encompasses a wide range of advancements aimed at improving the efficiency, reliability, and adaptability of AI models across various tasks. This includes innovations in training paradigms, model architectures, and integration with external tools to enhance AI’s ability to reason and perform complex operations. Specifically, the papers discussed here focus on advancing reasoning capabilities in large language models (LLMs) and enhancing automatic speech recognition (ASR) and text-to-speech (TTS) systems, which are pivotal for applications ranging from numerical analysis and logical reasoning to voice-based interfaces in everyday technology.

Individual Paper Contributions

Jiaru Zou from UIUC and colleagues studied the challenge of providing reliable step-level supervision for Large Reasoning Models (LRMs) in tabular reasoning tasks. They proposed TaTToo, a Table Thinking Process Reward Model (PRM) that integrates tool capabilities for precise supervision. The main innovation points of this method include a dual-stage training approach combining supervised fine-tuning and reinforcement learning with a custom reward shaping scheme, along with a large-scale data curation pipeline that generates high-quality supervision instances. The value lies in its ability to address the limitations of existing PRMs in verifying table retrieval and schema interaction, thereby improving the performance of downstream policy models. Experiments on five tabular reasoning benchmarks demonstrated that TaTToo significantly improved performance by 30.9% on average, achieving up to 9x parameter efficiency compared to baselines like Qwen-2.5-Math-PRM-72B and GenPRM-32B. The dual-stage training paradigm yielded a 10.2% improvement over standard PRM training, showing consistent gains across various test-time scaling strategies.⁶¹
Lei Xu from Idiap Research Institute and colleagues addressed the limitation of static neuro-symbolic methods in NLP, which hinder the flexibility of AI systems in employing diverse reasoning strategies. They introduced an adaptive framework that dynamically composes logical solvers based on the type of reasoning required for a given task. The framework utilizes large language models (LLMs) for problem decomposition and routing, enabling the selection of appropriate formal logical solvers. The main innovation lies in the dynamic selection and application of solvers, which allows for more versatile and robust handling of reasoning tasks. The value of this method is in its scalability and adaptability, overcoming the constraints posed by static solver integration. Experiments on five diverse benchmarks, including ProntoQA, ProofWriter, FOLIO, LogDed7, and TRECtrials, as well as a newly designed multi-question stress test, showed that the adaptive framework achieved 92.1% accuracy on the Mixed dataset, significantly outperforming competing baselines. This indicates the framework’s capability to enhance robustness in sequential reasoning tasks and suggests that supervised fine-tuning can improve the performance of smaller models in adaptive neuro-symbolic reasoning.⁶²
Mingxuan Wang and colleagues focused on enhancing ASR and TTS systems by adopting a fully discrete speech chain, TokenChain, which uses semantic tokens instead of continuous intermediates like mel-spectrograms or waveforms. This approach simulates the human perception-production loop through a closed-loop training method. The key innovation involves coupling semantic-token based ASR with a two-stage TTS process, incorporating semantic distillation to guide acoustic tokenization towards more meaningful linguistic representations. The value of TokenChain is in its potential to improve the accuracy and efficiency of ASR and TTS models, making them more reliable for practical applications such as voice assistants and speech-to-text transcription services. Experiments on LibriSpeech and TED-LIUM datasets indicated that TokenChain converged 2-6 epochs faster and achieved 5-13% lower error rates compared to baselines at equal epochs. Under domain adaptation, TokenChain reduced ASR WER by 56% and T2S WER by 31% on TED-LIUM, demonstrating minimal forgetting of the source domain. The paper concluded that the choice of discrete interface, specifically ST-argmax and ST-Gumbel with appropriate temperature settings, significantly influences the effectiveness of ASR and TTS systems.⁶³

Technical Trends

The papers in this collection showcase evolving trends in AI development techniques, particularly emphasizing the integration of specialized tools and methodologies to enhance model performance in specific domains. TaTToo exemplifies the trend towards tool-grounded reasoning in large models, while the adaptive framework by Lei Xu and colleagues highlights the move towards dynamic and multi-paradigm neuro-symbolic integration. TokenChain represents a shift towards discrete-token-based modeling in speech processing, aiming to simulate human cognitive processes more closely. These trends indicate a growing emphasis on hybrid approaches that combine symbolic reasoning with deep learning, as well as the importance of custom training strategies and data curation to refine AI models.

Datasets and Evaluation

The datasets utilized across the papers vary according to the specific domain of the research. In the context of tabular reasoning, the unnamed datasets used by TaTToo cover a broad spectrum of tasks, including numerical analysis, fact-checking, and question answering. For neuro-symbolic reasoning, datasets such as ProntoQA, ProofWriter, FOLIO, LogDed7, and TRECtrials were employed to evaluate the adaptive framework’s performance across different reasoning types. In the domain of speech processing, LibriSpeech and TED-LIUM datasets were used to assess the effectiveness of TokenChain in ASR and TTS tasks. Evaluation metrics included accuracy for neuro-symbolic reasoning tasks, and Character Error Rate (CER) and Word Error Rate (WER) for ASR and TTS performance assessments.

Topic 11: misc

Topic Overview

The research papers collected under the “misc” topic focus on advancing the capabilities of large language models (LLMs) and multimodal large language models (MLLMs) through innovative methodologies and structured frameworks. They address challenges such as interpretability, compositional reasoning, syntactic competence, and robustness to linguistic variations, as well as practical deployment considerations in specific domains like legal and industrial applications. The importance of this research lies in improving the reliability, accuracy, and efficiency of AI systems in handling complex tasks and ensuring that they align with human expectations and standards, which is crucial for their broader adoption and trustworthiness in real-world applications.

Individual Paper Contributions

Angie Boggust from MIT CSAIL and colleagues studied the precision and consistency issues in automated interpretability of LLM features using natural language descriptions. They proposed semantic regexes, a structured language designed to capture diverse activation patterns while offering the benefits of structured language. The main innovation points are the structured language’s conciseness and consistency, and the value lies in facilitating the detection of redundant features and aiding in building mental models of LLM features. Experiments on features from GPT-2-RES-25k, Gemma-2-2B-RES-16k, and Gemma-2-2B-RES-65k showed that semantic regexes achieved comparable accuracy to natural language descriptions and outperformed baselines in nine out of twelve tested features, concluding that structured language can improve interpretability⁶⁴.
Qihua Dong from Adobe Research and colleagues addressed the compositional reasoning deficiency in MLLMs when dealing with complex referring expressions in vision-and-language tasks. They introduced CoT Referring (CoTR), a reasoning mechanism inspired by Chain-of-Thought prompting, and RefLM, a new MLLM architecture, to improve localization on complex referring expressions. The main innovation points are the structured reasoning approach and the new evaluation benchmark for composite referring expressions. The value lies in enhancing the ability of AI systems to interpret and interact with multimodal inputs. Experiments on the Composite Referring Benchmark demonstrated significant improvements in IoU@Box and gIoU@Mask scores compared to baselines like GLaMM-7B and OMG-LLAVA-7B, concluding that CoTR and RefLM improve localization accuracy⁶⁵.
Kaichun Yang from University of Illinois Urbana-Champaign and colleagues examined the effectiveness of different prompting strategies and LLMs in interpreting charts and visualizations. They found that GPT-5 outperforms GPT-4o and GPT-4V across all datasets, with detailed chart descriptions generated by GPT-5 not significantly enhancing performance and sometimes decreasing accuracy on complex datasets. The main innovation is the rigorous statistical analysis using GEE and LMM to assess the impact of model type and prompting conditions. The value lies in understanding the minimal effect of enhanced prompts compared to advancements in model capability. Experiments concluded that GPT-5’s model type is a significant factor in performance, while prompting conditions have a minor impact⁶⁶.
Zhi Zhang from The Hong Kong Polytechnic University and colleagues aimed to automate the end-to-end scientific research process through a Double-Loop Multi-Agent (DLMA) framework. This framework includes leader and follower loops to evolve and execute research plans. The main innovation points are the use of double-loop learning in multi-agent systems and the distinction from previous single-loop approaches. The value lies in enhancing the efficiency and quality of scientific discovery. Experiments on ACLAward and Laboratory datasets showed that the DLMA framework generates research papers rated highly in soundness, excitement, and overall quality, outperforming other models and multi-agent systems⁶⁷.
Chengwei Wu from Beijing Academy of Artificial Intelligence and colleagues tackled the challenge of creating a benchmark for comprehensive evaluation of Chinese LLMs. They proposed the Chinese Data-Text Pair (CDTP) dataset and the Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) to assess the performance of Chinese LLMs on knowledge-intensive tasks. The main innovation points include the structured representation of Chinese corpora and the coverage of four critical domains. The value lies in providing a detailed evaluation framework for Chinese NLP tasks. Experiments revealed that larger models perform better in tasks like T2T and KGC, while Supervised Fine-Tuning (SFT) improves performance across all tasks, with the QA task showing the most substantial improvements⁶⁸.
Leonardo Bertolazzi from University of Trento and colleagues investigated the systematic bias in LLMs where plausibility influences logical validity judgments. They introduced steering vectors and metrics like content effect (CE) and steering power (SP) to mitigate these biases. The main innovation points are the detailed representational analysis and the training-free method for bias mitigation. The value lies in enhancing the reliability and precision of AI systems in logical reasoning tasks. Experiments concluded that chain-of-thought (CoT) prompting greatly reduces content effects, and task-difference steering vectors improve reasoning accuracy and nearly eliminate content effects in Qwen3-14B⁶⁹.
Jiqun Pan from [institution] and colleagues focused on enhancing safety and reliability in industrial question-answering systems through a Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD) framework. The main innovation points are the use of a knowledge graph to enrich state representation and the integration of structured knowledge into the distillation process. The value lies in improving the reasoning capability and output reliability of industrial QA systems. Experiments demonstrated that KG-MASD outperforms other MAS-assisted distillation methods and single LLM baselines on BLEU-4, ROUGE-1, ROUGE-2, and ROUGE-L metrics, confirming the effectiveness of integrating KG-guided priors⁷⁰.
Timothy Pistotti from University of Auckland and colleagues explored the impact of stimulus quality on LLM performance in syntactic assessments, specifically parasitic gaps (PGs). They proposed a new dataset and a direct minimal pair analysis method to evaluate LLMs’ syntactic competence more accurately. The main innovation points are the refined dataset and the ‘wh-effect’ analysis method. The value lies in providing a clearer understanding of LLMs’ syntactic capabilities. Experiments showed substantial improvements in GPT-2’s accuracy on PGs when evaluated on refined stimuli, suggesting that unintended complexities in stimuli can obscure underlying model sensitivity⁷¹.
Qin Dong from [institution] and colleagues addressed the representational bottleneck in Low-Rank Adaptation (LoRA) for fine-tuning LLMs. They proposed Multi-$A$ Shared Adaptation (MASA), a novel PEFT architecture that uses an ensemble of down-projection matrices ($A$) to capture diverse features. The main innovation points are the ‘multi-$A$, single-$B$’ structure and the Asymmetric Cross-layer Sharing (ACS) strategy. The value lies in improving the efficiency and effectiveness of parameter-efficient fine-tuning methods. Experiments on MMLU, GSM8k, Fingpt-fineval, and BBH benchmarks showed that MASA outperforms LoRA and its variants, achieving higher accuracy with similar parameter overhead⁷².
Timothy Pistotti from University of Auckland and colleagues further investigated the syntactic competence of LLMs using a refined dataset and the ‘wh-effect’ analysis method. The main innovation points are the direct minimal pair analysis and the refined dataset generated for PGs. The value lies in offering a more transparent and interpretable evaluation of syntactic knowledge. Experiments revealed that GPT-2 demonstrates robust knowledge of filler-gap licensing principles, with significant improvements over previous evaluation metrics, indicating the effectiveness of the ‘wh-effect’ method for diagnosing syntactic competence⁷³.
Hudson de Martim from Federal Senate of Brazil introduced the SAT-Graph API for deterministic querying of legal norms within a structured knowledge graph. The main innovation points are the formal query execution layer and the canonical actions for high-precision hybrid search and causal tracing. The value lies in enabling accurate and explainable legal information retrieval, critical for high-stakes decision-making processes. While no specific experimental insights are provided, the paper emphasizes the need for precise and auditable legal information retrieval⁷⁴.
Pontakorn Trakuekul from Jasmine Technology Solution and colleagues developed OpenJAI-v1.0, an open-source LLM for Thai and English, focusing on instruction following, long-context understanding, and tool use. The main innovation points are the data curation methodologies and the adaptations of existing benchmarks for Thai. The value lies in enhancing the robustness and practical utility of LLMs for underrepresented languages. Experiments showed that OpenJAI-v1.0 outperforms other leading open-source Thai models across various benchmarks, indicating its versatility and practical usefulness⁷⁵.
Chenpeng Wang from [institution] proposed Adaptive Tool Generation with Models as Tools and Reinforcement Learning (MTR), a simulation-first training framework for scalable and reliable training of tool-augmented language models. The main innovation points are the simulation-first training approach and the multi-agent architecture with specific roles for tool interaction. The value lies in enabling more efficient and stable training environments for complex reasoning tasks. Experiments on HotpotQA, MuSiQue, 2WikiMultiHopQA, and Bamboogle benchmarks demonstrated competitive performance to live-API systems, especially on reasoning-intensive tasks, confirming the necessity of the two-stage training approach⁷⁶.
Cheonkam Jeong from Savassan and colleagues explored the encoding of semantic understanding using Montague Grammar as a robust type system for LLMs. The main innovation point is the shift towards type-theoretic semantics for improving alignment with ethical and legal standards. The value lies in enhancing the reliability and transparency of AI systems in handling sensitive information. The paper proposes a legal knowledge graph as a navigable semantic space, though no specific experimental insights are provided⁷⁷.
Seng Pei Liew from SB Intuitions and colleagues investigated the effectiveness of bootstrapped pretraining methods in reducing the computational cost and inefficiency associated with training LLMs from scratch. The main innovation point is the detailed scaling law model that includes an interaction term between the number of tokens in the first and second stages of pretraining. The value lies in optimizing compute usage and providing practical guidance for bootstrapping LLMs. Experiments revealed a saturation effect where the scaling exponent decreases with longer pretraining of base models, offering insights into when bootstrapping is beneficial⁷⁸.
Zihao Li from University of Helsinki and colleagues studied the impact of test-time scaling (TTS) on reasoning models (RMs) for machine translation (MT). The main innovation points are the method for regulating test-time reasoning and the comprehensive set of MT benchmarks. The value lies in understanding how TTS can improve translation quality in various contexts, including post-editing. Experiments concluded that TTS offers limited benefits for direct MT with general-purpose RMs but significantly enhances performance in post-editing contexts and for domain-specific fine-tuned models⁷⁹.
Neeraja Kirtane from Got It Education and colleagues introduced MathRobust-LV, a methodology for assessing LLMs’ robustness to linguistic variations in mathematical reasoning tasks. The main innovation points are the controlled variation generation and the focus on high-school-level problems. The value lies in identifying the brittleness of LLMs under benign rephrasings, suggesting a reliance on surface-level features rather than deep understanding. Experiments showed that parameter scaling improves both absolute accuracy and robustness, but the gains are not linear, and larger models still face significant performance drops under variation⁸⁰.

Technical Trends

The papers collectively showcase a trend towards developing structured and controlled methodologies to enhance the interpretability, robustness, and efficiency of LLMs and MLLMs. Innovations include structured languages for interpretability, compositional reasoning mechanisms, refined evaluation datasets, and novel fine-tuning architectures. There is a common emphasis on leveraging structured knowledge, whether through knowledge graphs, syntactically informed templates, or type-theoretic frameworks, to improve model performance and reliability. Additionally, several papers focus on optimizing the training process through simulation-first approaches and understanding the saturation effects in bootstrapped pretraining.

Datasets and Evaluation Metrics

GPT-2-RES-25k, Gemma-2-2B-RES-16k, Gemma-2-2B-RES-65k: Used for evaluating feature descriptions and interpretability.
Composite Referring Benchmark, refCOCO, refCOCO+, refCOCOg: Used for evaluating the localization accuracy of MLLMs.
CHART-6, GGR, VLAT, HOLF: Used for assessing chart reading and visualization understanding.
ACLAward, Laboratory: Used for evaluating the effectiveness of research automation frameworks.
CDTP dataset: Used for evaluating the performance of Chinese LLMs on knowledge-intensive tasks.
MMLU, GSM8k, Fingpt-fineval, BBH: Used for testing the performance and efficiency of fine-tuning methods.
IFBench-TH, IFBench-EN, MT-Bench-TH, MT-Bench-EN, LongBench-v2, BFCL-v3-TH, BFCL-v3-EN, MMLU-ProX-lite-EN, MMLU-ProX-lite-TH: Used for evaluating the performance of OpenJAI-v1.0 on Thai and English applications.
HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle: Used for assessing the performance of MTR in multi-hop question answering.
Slimpajama-DC, Stack/StarCoder, OpenWebMath: Used for understanding the scaling behavior of bootstrapped LLM pretraining.
MATH dataset, AoPS: Used for assessing the robustness of LLMs to linguistic variations in mathematical reasoning tasks.
BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, IoU@Box, gIoU@Mask, exact match score, content effect (CE), steering power (SP): Commonly used metrics for evaluating model performance, reliability, and robustness across different tasks.

2025年10月06日NLP论文汇总（英文）

Topic 1: Reasoning and Explanation in LLMs

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation

Topic 2: Large Language Models and Their Applications

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation Metrics

Topic 3: Bias and Fairness in AI

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation

Topic 4: Machine Translation and Multilingual Systems

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation

Topic 5: Data Handling and Annotation

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation

Topic 6: Security and Privacy in AI

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation

Topic 7: Evaluation and Benchmarking of AI Models

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation Metrics

Topic 8: Human Interaction with AI

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation

Topic 9: Content Generation and Moderation

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation

Topic 10: AI Development Techniques and Methods

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation

Topic 11: misc

Topic Overview

Individual Paper Contributions

Technical Trends

Datasets and Evaluation Metrics

References