2025年10月10日NLP论文汇总(英文)
- Topic 1: Reasoning and Logical Training (8 papers)
- Topic 2: Multilingual and Cross-lingual Models (6 papers)
- Topic 3: Human Interaction and Alignment (6 papers)
- Topic 4: Large Language Model Safety and Guardrails (6 papers)
- Topic 5: Generative Models and Text Generation (8 papers)
- Topic 6: Model Fine-tuning and Calibration (7 papers)
- Topic 7: Speech and Audio Processing (4 papers)
- Topic 8: Evaluation and Auditing of Models (5 papers)
- Topic 9: Neural Network Architectures and Techniques (8 papers)
- Topic 10: Healthcare and Medical Applications (4 papers)
- Topic 11: misc (19 papers)
Topic 1: Reasoning and Logical Training
Topic Overview
The research topic of “Reasoning and Logical Training” focuses on advancing the reasoning capabilities of large language models (LLMs) across various domains and tasks. This includes enhancing inductive reasoning, improving the evaluation of game worthiness, ensuring safe and ethical reasoning trajectories, and boosting mathematical reasoning accuracy. The importance of this topic lies in creating more versatile and human-aligned AI systems that can reason effectively, much like humans, in complex and diverse scenarios. This is critical for deploying AI systems in real-world applications where decision-making, reliability, and ethical considerations are paramount.
Individual Paper Contributions
-
Katherine M. Collins from University of Cambridge and colleagues studied the evaluation of AI systems’ ability to assess the worthiness of games, specifically focusing on expected payoffs and funness. They proposed a new framework that uses a large-scale dataset of over 121 novel board games and over 450 human judgments to compare the evaluations produced by various language and reasoning models against human assessments. The main innovation points are the introduction of a methodological approach considering the complexity of computation and the difficulty of quantification. The value lies in providing a deeper understanding of AI systems’ reasoning capabilities beyond mere gameplay. Experiments revealed that reasoning models better align with human judgments on expected payoffs but struggle with quantifying funness, highlighting the need for more resource-rational models. 1
-
Zhuowei Chen from Guangdong University of Foreign Studies and colleagues addressed the development of effective LLM safeguards that can detect malicious requests across multiple languages, including those with fewer resources. They introduced ConsistentGuard, a framework that enhances explainability and knowledge transfer through reasoning and cross-lingual alignment. The main innovation points are the use of novel rewards for controlling reasoning length and diversity and the Constrained Alignment Optimization (CAO) method. The value lies in enabling safer and more interpretable AI systems in multilingual settings. With only 1,000 training samples and 3 billion parameters, ConsistentGuard outperformed baseline models like Llama Guard and ShieldGemma across six languages. 2
-
Kedi Chen from East China Normal University and colleagues conducted a comprehensive survey on inductive reasoning for LLMs. They categorized methods into post-training, test-time scaling, and data augmentation, and introduced a new taxonomy and a unified sandbox-based evaluation approach with a fine-grained observation coverage metric (OC). The main innovation points are the systematic review and the proposed evaluation strategy. The value lies in providing a structured overview and a new evaluation framework for future research in inductive reasoning. Although no specific experimental results are provided, the paper offers insights into the importance of synthetic data creation, IRL-style optimization, and human intervention in enhancing LLMs’ inductive reasoning. 3
-
Hua Cai from UniDT and colleagues tackled the limitations of LLMs in legal reasoning due to inconsistent legal data and lack of transparency. They proposed a two-stage training framework combining Supervised Fine-Tuning (SFT) with Reinforcement Learning (RL) via Group Relative Policy Optimization (GRPO) and an iterative inference mechanism. The main innovation points are the introduction of a high-quality legal reasoning dataset (Unilaw-R1-Data) and the explicit legal iterative inference mechanism. The value lies in improving the legal conformity and reasoning accuracy of LLMs. Unilaw-R1 demonstrated superior performance on benchmarks like Unilaw-R1-Eval, LawBench, and LexEval, outperforming Qwen-2.5-7B-Instruct by an average margin of 6.6%. 4
-
Taiqiang Wu from The University of Hong Kong and colleagues aimed to achieve efficient reasoning in LLMs without sacrificing performance. They revisited the model interpolation (MI) method to merge Thinking and Instruct models and provided a systematic framework to understand the dynamics of weight interpolation. The main innovation points are the fine-grained ablation studies and the identification of a three-stage paradigm guiding the creation of models with desired reasoning behaviors. The value lies in optimizing LLMs for real-world applications by balancing reasoning capability and token efficiency. The interpolated model achieved the best trade-off between effectiveness and efficiency in the second stage, outperforming other baselines like TA and TIES across benchmarks including AIME’25, IFEval, and GPQA-Diamond. 5
-
Jidong Li from [Institution] and colleagues focused on the false premise problem in multimodal large language models (MLLMs). They proposed a new benchmark named JBA and introduced JBA-GRPO, a reinforcement learning framework incorporating a ‘reasoning reward’ to enhance MLLMs’ ability to recognize and reject false premises. The main innovation points are the hierarchical taxonomy of false premises and the JBA-GRPO framework. The value lies in increasing the reliability and trustworthiness of MLLMs. Experiments showed that the Qwen2.5-VL-7B-Instruct model fine-tuned with JBA-GRPO achieved higher scores in False Premise Coverage (FPC), False Premise Detection Precision (FPDP), and True Premise Identification Rate (TPIR) compared to baselines. 6
-
Samir Abdaljalil from Texas A&M University and colleagues introduced the Audit-of-Understanding (AoU) framework to address hallucinations in LLM-generated solutions, particularly in mathematical reasoning. The framework decomposes queries into candidate assumptions, audits their support, and performs inference based only on validated subsets. The main innovation points are the posterior-constrained inference and the theoretical guarantees provided. The value lies in enhancing the logical consistency and reliability of LLMs. Empirical evidence showed substantial reductions in hallucinations and improvements in accuracy on benchmarks like GSM8K, MultiArith, and SVAMP. 7
-
Yuyi Huang from University of Macau and colleagues explored the phenomenon of ‘Path Drift’ in large reasoning models (LRMs), where reasoning trajectories can deviate from aligned paths, leading to harmful content generation. They proposed a Path Drift Induction Framework and introduced path-level defense strategies. The main innovation points are the definition of Path Drift and the three-stage attack framework. The value lies in identifying deeper structural vulnerabilities and proposing nuanced mitigation strategies. Experiments confirmed that behavioral triggers such as first-person commitments significantly reduced the refusal rates of generating unsafe content, demonstrating the need for trajectory-level alignment oversight. 8
Technical Trends
The papers in this collection adopt several key technical trends:
- Framework Development: Multiple papers propose frameworks (e.g., ConsistentGuard, Unilaw-R1, JBA-GRPO, AoU) that integrate various methodologies such as reinforcement learning, fine-grained ablation studies, and post-training safeguards.
- Reinforcement Learning: Several papers leverage reinforcement learning techniques, such as Group Relative Policy Optimization (GRPO) and Constrained Alignment Optimization (CAO), to enhance model performance and reasoning accuracy.
- Evaluation Metrics and Datasets: There is a strong emphasis on creating new datasets and metrics to evaluate reasoning capabilities more comprehensively, such as the Unilaw-R1-Data, JBA dataset, and the observation coverage metric (OC).
Datasets and Evaluation
- Evaluating Language Models’ Evaluations of Games1: Over 121 novel board games and over 450 human judgments.
- Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data2: Extended safety benchmarks to six languages.
- A Survey of Inductive Reasoning for Large Language Models3: No specific new datasets; uses existing benchmarks.
- Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference4: Unilaw-R1-Data, LawBench, LexEval.
- Revisiting Model Interpolation for Efficient Reasoning5: AIME’25, IFEval, GPQA-Diamond.
- Judge Before Answer: Can MLLM Discern the False Premise in Question?6: JBA dataset.
- Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models7: GSM8K, MultiArith, SVAMP.
- Path Drift in Large Reasoning Models: How First-Person Commitments Override Safety8: Various LRMs models, focusing on high-risk domain tasks.
These papers collectively contribute to the advancement of reasoning and logical training in LLMs, highlighting the importance of methodological innovation, comprehensive evaluation frameworks, and specialized datasets in addressing the unique challenges of AI reasoning.
Topic 2: Multilingual and Cross-lingual Models
Topic Overview
Multilingual and Cross-lingual Models represent a critical area in Natural Language Processing (NLP) that seeks to bridge the gap between languages, enabling models to perform tasks across different linguistic systems effectively. This topic is vital for advancing global communication and information access, especially for underrepresented languages. By addressing issues such as factual recall inconsistencies, data efficiency, and human-model performance gaps, these models aim to achieve parity with human capabilities in understanding and generating text across languages. The research in this field also explores innovative ways to unify diverse neural network operations to enhance model performance in both computer vision and NLP tasks.
Individual Paper Contributions
-
Yihong Liu from LMU Munich and colleagues studied the inconsistencies in factual recall across different languages by large language models (LLMs), particularly those involving languages with different scripts or linguistic structures. They proposed a dedicated entity translation probing task and two prompting-based remedies, SubSub and SubInj, to enhance the factual recall and consistency of LLMs by incorporating English translations of subjects into prompts. The main innovation points are the focus on entity-level alignment and the use of a more stringent consistency metric. The value lies in improving the reliability and effectiveness of multilingual LLMs in applications such as cross-lingual translation and question answering. Experiments on the KLAR dataset, which covers 17 languages and spans 2,619 language-agnostic facts, showed significant improvements in factual recall and consistency across all models tested, with SubInj providing stronger improvements than SubSub, though the smallest OLMo model showed less pronounced gains 9.
-
Jaap Jumelet from University of Groningen and colleagues addressed the lack of developmentally plausible datasets and models for multilingual language learning. They introduced BabyBabelLM, a multilingual benchmark consisting of curated datasets from child-directed speech, educational content, children’s books, and subtitles, covering 45 languages with varying tiers of data size. The main innovation is the creation of a developmentally plausible dataset pipeline for future expansions and a community-driven model of dataset curation. The value lies in facilitating research towards achieving human-like learning efficiency and cognitive plausibility in multilingual settings. Monolingual models trained on BabyBabelLM data performed well on syntactic benchmarks but struggled on knowledge-intensive tasks, while bilingual models showed performance gains when English was included as a second language, except for Dutch on INCLUDE, possibly due to domain mismatch 10.
-
Adnan El Assadi from Carleton University and colleagues tackled the issue of evaluating text embedding models without human performance references, which limits the interpretation of model performance. They introduced HUME, a framework designed to measure human performance in text embedding tasks across 16 datasets from the MTEB suite, ensuring a broad coverage of linguistic diversity and task complexity. The main innovation is the provision of human performance baselines and insights into task difficulty. The value lies in offering a human-centered evaluation method that aids in understanding model capabilities and limitations, especially in low-resource languages and complex tasks. Humans achieved an average performance of 77.6%, slightly below the best embedding model’s 80.1%, highlighting competitive but not dominant performance. Notably, humans outperformed models in non-English sentiment analysis and Arabic semantic similarity, indicating challenges in capturing cultural nuances 11.
-
Hehe Fan from Zhejiang University and colleagues focused on the inefficiencies of current deep learning models, specifically CNNs and Transformers, in identifying and encoding relevant elements adaptively and relatively. They proposed Translution, a new operation that integrates the adaptive relevance of self-attention with the relative encoding of convolution, and a lightweight variant α-Translution. The main innovation points are the combination of self-attention and convolution and the introduction of α-Translution to mitigate parameter increases. The value lies in enhancing the performance of deep learning models in understanding spatial relationships and context variability. Experiments on dynamic MNIST, ImageNet-1K, and OpenWebText demonstrated significant improvements in accuracy and lower perplexity over self-attention alone, indicating better handling of location variability and context-dependent tasks 12.
-
James Ald Teves from Silliman University and colleagues aimed to address the underrepresentation of the Hiligaynon language in NLP research, focusing on the lack of annotated corpora and baseline models for Named Entity Recognition (NER). They created the first publicly available NER corpus and fine-tuned two multilingual Transformer models, mBERT and XLM-RoBERTa, for Hiligaynon texts. The main innovation is the creation of a large annotated dataset and fine-tuning models for a minority language. The value lies in providing foundational resources for further NLP research in Hiligaynon and other underrepresented languages, and demonstrating cross-lingual transfer potential. Both models achieved strong F1-scores exceeding 80%, with person-based named entities recognized with the highest accuracy and organizations being the most challenging 13.
-
Prawaal Sharma from Infosys and colleagues focused on the lack of effective unsupervised OCR methodologies for very low-resource scripts, specifically the Takri script. They proposed VOLTAGE, a contrastive learning-based OCR methodology featuring an automated Glyph Feature Recommendation System (GFRS) for labeling symbols. The main innovation is the automation of glyph feature extraction and enhancement of data augmentation techniques using GANs. The value lies in digitizing endangered scripts, thereby preserving cultural heritage and promoting their continued use. Experiments showed VOLTAGE achieving high accuracy rates (95% for machine-printed and 87% for handwritten samples) on the Takri script, with consistent performance across different Indic scripts, indicating its general applicability 14.
Technical Trends
The papers collectively emphasize the importance of developing robust and adaptable models that can handle the complexities of multilingual and cross-lingual tasks. Innovations include the integration of human performance metrics for model evaluation, the unification of different neural network operations for enhanced context understanding, and the creation of developmentally plausible datasets to simulate efficient human learning processes. There is also a focus on addressing the underrepresentation of minority and endangered languages through the creation of annotated corpora and specialized OCR methodologies.
Datasets and Evaluation
- KLAR: Used to evaluate entity-level alignment and crosslingual consistency in LLMs, covering 17 languages and spanning 2,619 language-agnostic facts 9.
- BabyBabelLM: A multilingual benchmark of developmentally plausible training data, covering 45 languages with varying tiers of data size 10.
- MTEB: The suite of datasets used by HUME to measure human performance against embedding models, covering a wide range of linguistic diversity and task complexity 11.
- Dynamic MNIST, ImageNet-1K, OpenWebText: Datasets used to evaluate the performance of Translution and α-Translution in computer vision and NLP tasks 12.
- HiligayNER Corpus: An annotated dataset of over 8,000 sentences for Hiligaynon NER, tagged using the BIO encoding scheme 13.
- Takri Dataset: The largest labeled Takri dataset with approximately 226,000 symbols, including both machine-printed and handwritten samples, used for evaluating VOLTAGE 14.
These datasets and frameworks provide a comprehensive basis for evaluating the performance of models across different linguistic and task environments, emphasizing the need for culturally sensitive and contextually accurate representations.
Topic 3: Human Interaction and Alignment
Topic Overview
Human interaction and alignment with artificial intelligence (AI) systems have become increasingly critical as AI technologies penetrate various sectors including academia, healthcare, and journalism. The importance of this research topic lies in developing AI systems that not only function efficiently but also align with human values, intentions, and needs, thereby ensuring the reliability, safety, and effectiveness of AI applications in real-world scenarios. This involves creating methodologies that enable AI to understand and respond to human queries accurately, generate content that meets human standards of quality and relevance, and interact in ways that are respectful and trustworthy.
Individual Paper Contributions
-
Yu Chao from Tsinghua University and colleagues studied the generation of high-quality, long-form academic surveys using AI agent systems, proposing LLM×MapReduce-V3, a hierarchically modular agent system. The main innovation points of this method are its multi-agent architecture with independent model-context-protocol (MCP) servers and a high-level planner agent that supports human-in-the-loop interactions. The value lies in streamlining the creation of academic surveys, allowing researchers to focus on analysis rather than manual compilation. Human evaluations comparing the output with other systems showed that LLM×MapReduce-V3 generates more informative outlines and higher-quality, more in-depth survey articles, particularly in broader literature coverage and producing longer contexts, outperforming Gemini DeepResearch and Manus AI 15.
-
Bahar İlgen from Robert Koch Institute and colleagues addressed the inadequacy of current readability evaluation metrics in assessing human-centered qualities of simplified health texts. They introduced the Human-Centered Readability Score (HCRS), a five-dimensional evaluation framework integrating automatic measures with structured human feedback. The innovation points include the focus on multidimensional readability in health communication and the emphasis on human feedback in the evaluation process. The value lies in enhancing the usability and trustworthiness of health-focused NLP systems for diverse users, especially those with limited health literacy. While no specific experimental results were presented, the paper sets up a protocol for a more nuanced evaluation of health text simplification systems 16.
-
Guy Mor-Lan from Hebrew University of Jerusalem and colleagues tackled the identification and classification of epistemic factual appeals in news media. They introduced FactAppeal, a manually annotated dataset of 3,226 English-language news sentences, and proposed the task of Epistemic Appeal Identification. The main innovation points involve fine-grained span-level annotations capturing the type of source invoked and the method of appeal. The value lies in understanding the persuasive force of factual claims and improving the credibility assessment of media reporting. Experiments with token-level multi-label classification using encoder models and generative decoder models achieved a macro-$F_{1}$ score of 0.73, with decoder models showing less variability in performance across different categories 17.
-
Huanjin Yao from Nanyang Technological University and colleagues surveyed Agentic Multimodal Large Language Models (MLLMs), focusing on their capabilities in handling complex, dynamic real-world tasks. The paper provides a comprehensive analysis of existing studies and resources, organizing the field through a detailed taxonomy and highlighting the differences between traditional MLLMs and agentic ones. The main innovation points include categorizing agentic MLLMs into internal intelligence, external tool invocation, and environment interaction. The value lies in guiding future research towards more versatile and intelligent AI systems capable of autonomous interaction and reasoning in various applications 18.
-
Mihir Gupte from General Motors and colleagues explored the efficient representation and retrieval of knowledge from hierarchical data structures, specifically trees, for use with Retrieval-Augmented Generation (RAG) systems. They proposed a novel method for generating implicit knowledge from hierarchical structures, involving the use of large language models (LLMs) to create templated summaries at each level of the hierarchy. The innovation points are the bottom-up aggregation of information and preservation of semantic context. The value lies in improving the scalability and efficiency of RAG systems when dealing with complex data structures. Experiments on a proprietary code repository showed that the proposed method generated nearly four times fewer documents in the vector database while maintaining comparable response quality to conventional RAG methods 19.
-
Yujie Ren and colleagues focused on detecting hallucinations in large language models (LLMs) during authentic human-LLM interactions. They introduced AuthenHallu, a benchmark derived from real-world interactions, annotated for hallucination occurrence and categorized into three types. The innovation points include the development of a realistic benchmark and the categorization of hallucinations. The value lies in providing a more accurate evaluation of hallucination detection methods, crucial for sensitive applications like medicine and law. Experiments revealed that vanilla LLMs struggle with hallucination detection, achieving F1-scores ranging from 38.46% to 63.91%, with Qwen-3-32B failing to detect nearly 30% of hallucinations. The study concluded that while there is some promise, current LLM performance is insufficient for reliable real-world deployment 20.
Technical Trends
The papers collectively highlight a trend towards more sophisticated AI architectures that incorporate human feedback and interaction, emphasizing the need for AI systems to not only produce accurate outputs but also to understand and respond to human needs effectively. Techniques such as multi-agent systems, human-centered evaluation frameworks, and taxonomies for categorizing AI functionalities are being developed to enhance the alignment between AI and human expectations. Additionally, there is a growing interest in improving the efficiency and context-awareness of knowledge retrieval and generation processes, particularly for complex data structures and real-world interaction scenarios.
Datasets and Evaluation
- FactAppeal: Manually annotated dataset of 3,226 English-language news sentences, evaluated using token-level multi-label classification and span-level annotations for epistemic appeal identification 17.
- AuthenHallu: New benchmark derived from authentic LLM-human interactions, including 400 dialogues with 800 query-response pairs, annotated for hallucination occurrence and categorized into three types, evaluated using F1-scores for hallucination detection and categorization 20.
- Proprietary Code Repository: Used for evaluating the implicit knowledge generation method for tree-like structures, assessed using BLEU-1, E-F1 Score, and EM Formula for response quality 19.
Topic 4: Large Language Model Safety and Guardrails
Topic Overview
Large Language Model Safety and Guardrails is a critical area of research aimed at mitigating biases, ensuring ethical deployment, and enhancing the reliability of AI systems, particularly in high-stakes applications such as hiring, research synthesis, and question-answering. As LLMs become more integrated into societal functions, addressing their inherent biases and ensuring they do not propagate harmful content becomes imperative. This research topic seeks to develop methodologies and frameworks that can evaluate, detect, and mitigate these issues, thereby promoting fair and safe AI usage.
Individual Paper Contributions
-
Mahika Phutane from Cornell University and colleagues studied intersectional disability bias in large language models (LLMs) when used in hiring scenarios. They proposed the ABLEist metrics and fine-tuned the Llama-3.1-8B-Instruct model to detect covert ableism and intersectional harms. The main innovation points are the focus on intersectional aspects of disability bias and the expansion beyond Western-centric perspectives. The value lies in providing a cost-efficient and reusable tool for detecting ABLEist harms and a framework grounded in disability studies and intersectionality literature. Experiments on 2,820 hiring scenarios revealed up to 58 times more ABLEist harm for disabled candidates compared to baseline candidates without identity markers, concluding that widely used safety tools fail to detect these covert intersectional biases and that specialized models are necessary for accurate detection21.
-
Wei-Chieh Huang from University of Illinois Chicago and colleagues addressed the lack of sufficient evaluation procedures and stage-specific protections in deep research frameworks. They introduced DeepResearchGuard, a multi-stage guardrail framework that includes four guard agents overseeing input, planning, research, and output stages. The main innovation points are the detailed taxonomy and guard rules enabling systematic evaluation and intervention. The value lies in ensuring high factual correctness, comprehensive coverage, and safety of content generated by deep research frameworks. Experiments on the DRSafeBench benchmark demonstrated a significant improvement in the Defense Success Rate (DSR) by 18.16%, with minimal impact on the Over-Refusal Rate (ORR), concluding that DeepResearchGuard effectively identifies and mitigates harmful content, enhancing the safety and quality of research outputs22.
-
Aashiq Muhamed from Carnegie Mellon University and colleagues tackled the issue of selective refusal in retrieval-augmented generation (RAG) systems. They introduced RefusalBench, a generative evaluation framework using 176 perturbation strategies across six categories to dynamically assess refusal capabilities. The main innovation points are the contamination-resistant evaluation and systematic probing of selective refusal. The value lies in providing a reliable method to measure and improve the refusal accuracy of RAG systems, crucial for their safe deployment. Experiments on RefusalBench-NQ and RefusalBench-GaRAGe showed a trade-off between caution and permissiveness, with no model performing well on both answer accuracy and refusal accuracy simultaneously, concluding that selective refusal is a trainable capability sensitive to model alignment and scale23.
-
Thi-Nhung Nguyen from Monash University and colleagues investigated the emergence, propagation, and amplification of stereotypical bias in multi-agent systems (MAS) where LLMs collaborate. They proposed a comprehensive evaluation framework that includes both system-level and individual agent-level assessments, utilizing three stereotypical bias benchmarks and a directed graph representation of MAS interactions. The main innovation points are the focus on MAS and the role of communication protocols in bias mitigation. The value lies in understanding how biases evolve through agent interactions and identifying cooperative and debate-based protocols as effective in reducing bias amplification. Experiments showed that GPT-4.1-mini exhibits the highest robustness against bias, and larger models within the same family are generally more robust, concluding that while competitive communication is less robust, it can limit bias emergence and propagation24.
-
Ki Jung Seo from Hanyang University and colleagues focused on the overconfidence of LLMs in their generated answers, particularly relevant in critical domains like law and healthcare. They introduced ADVICE, a fine-tuning method that encourages LLMs to better consider their answers when expressing confidence. The main innovation points are the focus on understanding the reasons behind overconfidence and developing a method to improve verbalized confidence estimation. The value lies in enhancing the trustworthiness of LLMs by reducing their overconfidence without compromising answer accuracy. Experiments on TriviaQA, SciQ, MMLU, and LogiQA datasets showed significant reductions in Expected Calibration Error (ECE) and Net Calibration Error (NCE) compared to baseline methods, concluding that ADVICE effectively calibrates confidence scores without affecting task performance25.
-
Utsav Maskey from Macquarie University and colleagues worked on over-refusal in RAG systems, where LLMs unnecessarily reject benign requests due to aggressive safety filters. They proposed SafeRAG-Steering, a method that steers intermediate representations towards a safe region, along with a domain-stratified benchmark named RagRefuse. The main innovation points are the dual focus on reducing unsafe completions and over-refusal. The value lies in improving the usability and safety of RAG systems without sacrificing response rates to legitimate queries. Experiments demonstrated a significant reduction in over-refusal rates across various domains, concluding that SafeRAG-Steering effectively mitigates over-refusal while maintaining safety standards26.
Technical Trends
The papers collectively highlight several technical trends:
- Fine-Tuning and Specialized Models: Many contributions involve fine-tuning existing models or developing specialized models to address specific safety concerns.
- Comprehensive Audits and Metrics: There is a move towards creating detailed metrics and benchmarks that capture a broad spectrum of safety and bias issues, rather than relying on simple exact-match evaluations.
- Multi-Stage Evaluation Frameworks: Several papers emphasize the importance of evaluating systems at multiple stages to identify and mitigate harmful content effectively.
- Understanding Interaction Dynamics: In MAS, understanding how biases emerge and propagate through agent interactions is gaining attention, with a focus on communication protocols and underlying model robustness.
Datasets and Evaluation
- ABLEIST: Uses a set of 2,820 hiring scenarios spanning diverse candidate profiles to measure intersectional disability bias.
- DeepResearchGuard: Employs DRSafeBench, an 828-query benchmark designed to test the safety and quality of deep research models.
- RefusalBench: Utilizes RefusalBench-NQ and RefusalBench-GaRAGe, transforming QA datasets into dynamic benchmarks with 176 perturbation strategies.
- ADVICE: Evaluates on TriviaQA, SciQ, MMLU, and LogiQA datasets, focusing on calibration quality metrics like ECE and NCE.
- SafeRAG-Steering: Tests over-refusal behaviors using the domain-stratified RagRefuse benchmark, covering various domains such as Chemical, Medical, Legal, and Financial.
Each paper employs distinct datasets and metrics tailored to their specific research objectives, showcasing the evolving methodologies in assessing and mitigating biases and safety issues in LLMs and related systems.
Topic 5: Generative Models and Text Generation
Topic Overview
Generative Models and Text Generation represent a critical area of research in Natural Language Processing (NLP) and Machine Learning (ML). These models are designed to create human-like text by learning patterns from vast amounts of data. The focus of recent studies has been on enhancing the quality and efficiency of these models, addressing issues such as data scarcity, security threats, creative output generation, and the integration of domain-specific knowledge. Advances in these areas not only improve the foundational capabilities of generative models but also enable more sophisticated applications in fields ranging from automated theorem proving to interactive database querying.
Individual Paper Contributions
-
Zichun Yu from Carnegie Mellon University and colleagues studied the scarcity of high-quality pretraining data for large language models (LLMs), proposing RePro, a method that uses reinforcement learning to recycle low-quality web data into high-quality pretraining material. The main innovation points of this method are the combination of quality and faithfulness reward functions, with the quality reward based on the DataMan score and the faithfulness reward ensuring semantic preservation. The value lies in its ability to offer a substantial speedup and maintain semantic fidelity, making it possible to enhance LLMs’ capabilities without relying on large-scale external resources. Experiments showed that RePro improved the DCLM Core score by 4.7% to 14.0% relative to organic data baselines, outperforming state-of-the-art techniques like ReWire and WRAP, concluding that the method effectively recycles web data while preserving semantic integrity27.
-
Liang Lin from NTU and colleagues addressed the problem of backdoor removal in LLMs, proposing Locphylax, a defense framework that aggregates backdoor representations in the model’s space. The main innovation is the deliberate injection of known backdoors to cluster both known and unknown backdoors, followed by fine-tuning to neutralize them. This approach stands out by not requiring access to unknown backdoor triggers or specific input/output features, making it more practical than previous methods. Evaluations on SST2, AGNews, and SafeRLHF datasets demonstrated that Locphylax significantly reduced the Attack Success Rate to 4.41%, outperforming existing baselines by 28.1% to 69.3%, while maintaining clean accuracy within 0.5% of the original model. The paper concludes that Locphylax effectively mitigates backdoor threats without compromising legitimate model performance28.
-
Meiru Zhang from the University of Cambridge and colleagues tackled the automation of formalizing mathematical statements into machine-verifiable forms, proposing DRIFT, a four-stage framework that decomposes, retrieves, illustrates, and formalizes theorems. The innovation lies in its decomposition-driven approach and the use of exemplar theorems to illustrate the application of retrieved definitions, overcoming the limitations of current retrieval methods. DRIFT was evaluated on ProofNet, MiniF2F-test, and ConNF benchmarks, achieving state-of-the-art performance in dependency retrieval and autoformalization tasks, especially on the ConNF benchmark, which tests out-of-distribution generalization. The paper concludes that DRIFT improves the syntactic and semantic correctness of generated formal statements, suggesting that contextualized retrieval is crucial for successful autoformalization29.
-
Peiyuan Gong from Renmin University of China and colleagues focused on improving long-tail query rewriting on short-video platforms like Kuaishou, proposing CardRewriter. The method involves summarizing multi-source knowledge into knowledge cards and employs a two-stage training framework with supervised fine-tuning and group relative policy optimization. The innovation lies in leveraging platform-specific knowledge to enhance query relevance and retrieval effectiveness. Experiments showed that CardRewriter outperformed baselines in semantic relevance and retrieval expansion, with significant improvements in Hitrate@300, LVR, and CTR metrics, and a reduction in IQRR. The conclusion is that CardRewriter significantly enhances the search experience on short-video platforms by improving the quality of query rewriting30.
-
Jiajing Guo from Bosch Research North America and colleagues evaluated different inference-based test-time scaling strategies for improving Text2SQL systems, proposing methods like Divide-and-Conquer and Result Verification. The innovation is in the comprehensive assessment of lightweight workflows suitable for practical deployment, using metrics such as Soft F1-Score, Execution Error Rate, and Inference Time. Experiments on the BIRD Mini-Dev benchmark indicated that strategies like Divide-and-Conquer and Result Verification significantly improve SQL generation quality, with a notable trade-off between accuracy and latency. The study concludes that certain test-time scaling strategies can effectively enhance Text2SQL systems without requiring extensive retraining or computational resources31.
-
Omid Reza Heidari from Concordia University and colleagues introduced AgentiQL, a multi-expert framework for text-to-SQL generation, addressing the limitations of monolithic LLM architectures in handling complex reasoning and diverse database schemas. The innovation points include the Divide-and-Merge module, Column Selection refinement, and Adaptive Routing mechanism. AgentiQL’s value lies in its modular design, which enhances interpretability and scalability, using smaller open-source LLMs to achieve competitive performance. Experiments on the Spider benchmark revealed that AgentiQL reached up to 86.07% execution accuracy, especially with the Planner\andExecutor merging strategy, highlighting the framework’s robustness in managing complex schemas and reasoning tasks. The paper concludes that AgentiQL offers a more efficient and interpretable solution for NL2SQL tasks compared to monolithic models32.
Technical Trends
The papers reviewed highlight several technical trends in the field of generative models and text generation:
- Data Enhancement and Recycling: Techniques like RePro emphasize the importance of recycling and enhancing available data to improve model performance, indicating a shift towards more efficient use of data resources.
- Security Measures: With Locphylax, there is a growing emphasis on securing generative models against backdoor attacks, showcasing the need for robust defenses in practical applications.
- Contextual Learning: Methods such as DRIFT and CardRewriter leverage contextual information to improve the quality of generated outputs, suggesting a trend towards integrating domain-specific knowledge into generative processes.
- Efficient Workflow Optimization: Papers like “Rethinking Agentic Workflows” and AGENTIQL explore ways to optimize the inference and generation processes, aiming for a balance between accuracy and computational efficiency.
Datasets and Evaluation
The papers utilized a range of datasets and evaluation metrics to assess their proposed methodologies:
- Datasets: SST2, AGNews, SafeRLHF, ProofNet, MiniF2F-test, ConNF, BIRD Mini-Dev, Spider
- Evaluation Metrics: DataMan Score, DCLM Core score, Attack Success Rate (ASR), Hitrate@300, Long-View Rate (LVR), Click-Through Rate (CTR), Initiative Query Reformulation Rate (IQRR), Soft F1-Score, Execution Error Rate, Inference Time, Number of LLM Calls, Token Count, Execution Accuracy (EX)
These metrics and datasets collectively provide a comprehensive evaluation of the proposed methods’ effectiveness in various scenarios, from data recycling to theorem formalization and SQL generation.
Topic 6: Model Fine-tuning and Calibration
Topic Overview
The topic of model fine-tuning and calibration focuses on optimizing and adapting large language models (LLMs) for specialized tasks and ensuring their performance remains robust even after compression or when dealing with noisy data. This area is crucial for enhancing the practicality of LLMs in diverse applications, from mathematical reasoning and web interaction to medical order extraction and multimodal fusion. The goal is to improve model efficiency, maintain or enhance accuracy, and promote output diversity, making LLMs more deployable on resource-limited devices and more reliable in real-world scenarios.
Individual Paper Contributions
-
Zhiwen Ruan from Southern University of Science and Technology and colleagues studied the inefficiency and lack of diversity in LLMs during supervised fine-tuning for specialized tasks like mathematical reasoning. They proposed Critical Token Fine-tuning (CFT) to solve the issue of uniform token supervision suppressing alternative reasoning paths. The main innovation points of this method are its selective application of gradient updates to critical tokens, identified through a counterfactual perturbation process. The value lies in enhancing the accuracy and diversity of LLMs in reasoning tasks without requiring additional training models. Experiments on datasets like GSM8K and MATH showed significant improvements in accuracy and maintained higher output diversity compared to traditional SFT and other baselines, concluding that CFT provides better initialization for reinforcement learning, supporting sustained performance gains and enhanced exploration capabilities 33.
-
Tao Yu from CASIA, UCAS, and the University of Waterloo and colleagues addressed the limitation of current web agents that depend on external tools to parse and summarize dynamic web environments. They introduced BrowserAgent, a scalable end-to-end framework that learns directly from real-time web interactions, using a minimal set of atomic browser operations and a two-stage training pipeline (SFT and RFT). The main innovation points include the use of an explicit memory mechanism to store key conclusions across steps, balancing long-term reasoning with real-time perception. The value lies in reducing the data and infrastructure requirements needed for training, improving the efficiency and effectiveness of web agents. Experiments on datasets such as HotpotQA, 2Wiki, and Bamboogle demonstrated significant performance improvements over baselines like Search-R1, particularly in multi-hop reasoning tasks 34.
-
Bowei He from the Department of Computer Science, City University of Hong Kong and colleagues tackled the challenge of preserving LLM capabilities after post-training compression techniques. They proposed a Calibration Data curation framework (COLA) to optimize the preservation of LLM capabilities during compression. The main innovation points are the systematic selection and processing of calibration data to maximize representativeness and diversity in the activation space. The value lies in providing a more comprehensive understanding and a novel method to improve the efficacy of post-training compression techniques, enabling efficient deployment on devices with limited computational resources. Experiments on models like LLaMA3-8B and Qwen2.5-7B across various compression methods showed significant improvements in complex reasoning tasks 35.
-
Heming Xia from the Department of Computing, The Hong Kong Polytechnic University and colleagues aimed to mitigate the overthinking issue in large reasoning models and closed-source APIs, which leads to excessive computational costs and memory usage. They introduced AdvPrompt, an iterative refinement framework to generate high-quality adversarial prompts that elicit concise reasoning. The main innovation points include the exploration of persuasive prompts, role-playing prompts, and detailed instructions to reduce response length without degrading performance. The value lies in addressing the overthinking issue and improving the efficiency-performance trade-off, providing a black-box solution that does not require additional training. Experiments on benchmarks like GSM8K, MATH-500, AMC 2023, and AIME 2024 showed a 3-fold reduction in average response length for Qwen3 models and up to 47% reduction in token usage for commercial APIs like Gemini-2.5, while maintaining reasoning accuracy 36.
-
Bo Yuan from Zhejiang University and colleagues focused on the issue of learning with noisy labels (LNL) in the context of parameter-efficient fine-tuning (PEFT) of LLMs. They proposed Delora, a novel framework that employs dual Low-Rank Adaptation (LoRA) modules to decouple sample selection from model training, thereby avoiding the propagation of initial inaccuracies. The main innovation points are the dynamic regularization technique and the separation of clean and noisy samples. The value lies in providing a robust solution that can generalize across different types of noise and classifier models, enhancing the robustness and reliability of LLMs in practical scenarios. Experiments on synthetic and real-world noisy datasets like Trec, SST-5, and Yorùbá demonstrated superior performance compared to baselines like Co-Teaching, SelfMix, and NoiseAL 37.
-
Euhid Aman from NTUST Taiwan and colleagues addressed the challenge of deploying large-scale multimodal vision-language models on edge devices with limited resources. They introduced BitMar, a deployable quantized multimodal language model that integrates low-bit encoders for text and vision with an external episodic memory system. The main innovation points include the use of a 1.58-bit quantized text encoder and a vision encoder with additional quantization-aware compression, alongside a cross-modal fusion module and per-layer conditioning. The value lies in achieving a strong balance between quality and speed, suitable for edge deployment. Experiments showed BitMar’s competitive performance on language understanding and multimodal tasks despite its small model footprint and low-latency operation 38.
-
A H M Rezaul Karim from George Mason University and colleagues evaluated the effectiveness of LLMs in extracting structured medical orders from unstructured clinical text, specifically multi-turn doctor-patient conversations. They assessed the use of Meta’s LLaMA-4 Scout 17B model in a few-shot prompting setup, demonstrating that general-purpose, instruction-tuned LLMs can achieve competitive performance on specialized clinical NLP tasks. The main innovation points are the reliance on prompt engineering alone, avoiding the need for additional clinical domain pretraining or retrieval augmentation. The value lies in showing that even minimal task-specific guidance can enhance the model’s accuracy in identifying orders and their provenance, contributing to improved decision support and automation in clinical workflows. Experiments on the MEDIQA-OE 2025 shared task revealed significant improvements in the description and order_type subtasks, with an average F1 score of 37.76, but the reason extraction subtask remained challenging 39.
Technical Trends
The papers collectively highlight a trend towards developing more efficient, robust, and adaptable fine-tuning techniques for LLMs. Innovations range from selective fine-tuning strategies to address inefficiencies in reasoning tasks, to frameworks that enable direct learning from dynamic web environments and efficient memory utilization. Additionally, there is a focus on preserving model capabilities after compression and managing noisy labels through decoupled sample selection and regularization. These advancements reflect a growing emphasis on balancing model performance with computational and memory efficiency, as well as the integration of external memory systems and adversarial prompting techniques to enhance model robustness and output diversity.
Datasets and Evaluation
- Mathematical Reasoning: GSM8K, MATH-500, AMC 2023, AIME 2024
- Web Interaction: HotpotQA, 2Wiki, Bamboogle
- Noisy Label Detection: Trec, SST-5, SST-2, Yorùbá, Hausa, AlleNoise
- Multimodal Understanding: DiNOv2 features, datasets aligned with language benchmarks like ARC-Easy, BoolQ, HellaSwag, WinoGrande, CommonsenseQA, MMLU
- Medical Order Extraction: MEDIQA-OE 2025 shared task datasets
Evaluation metrics vary by paper and dataset but commonly include accuracy, response length, memory usage, and sample efficiency. Specific metrics noted include F1 scores for medical order extraction, token reduction rates for overthinking mitigation, and precision and recall for noisy label detection.
Topic 7: Speech and Audio Processing
Topic Overview
Speech and audio processing encompasses a broad range of technologies aimed at enabling machines to understand, analyze, and generate speech. This field is crucial for applications ranging from voice assistants and automated customer service to more sophisticated tasks like emotion recognition and speech translation. As large language models (LLMs) advance, there is growing interest in how they can be integrated into speech processing pipelines to improve their performance and capabilities, especially in areas requiring nuanced understanding and generation, such as emotion detection and multilingual speech translation. Additionally, efforts are being made to enhance the interpretability of these models, particularly for high-stakes applications in healthcare and finance.
Individual Paper Contributions
-
Jingyi Chen from The Ohio State University and colleagues studied the reliance of large audio language models (LALMs) on lexical versus acoustic cues for emotion understanding in speech. They proposed LISTEN, a new benchmark that measures LALMs’ ability to process emotions through controlled cue manipulation and multimodal evaluation. The main innovation points include the creation of diverse and theoretically grounded test samples based on psycholinguistic theory and the use of a zero-shot evaluation protocol with standardized prompts generated by GPT-5. The value lies in providing a deeper understanding of LALMs’ processing capabilities, which is essential for improving AI systems in social and emotional interactions. Experiments on various datasets showed that LALMs exhibit a consistent lexical dominance, limiting their true listening ability. The gaps between actual accuracy and prediction-marginal baseline suggest the extent to which models use real lexical or acoustic information, concluding that models need further development to better utilize acoustic information 40.
-
Nam Luu from Charles University and colleagues addressed the challenge of developing an end-to-end Speech Translation (ST) system that integrates speech foundational models and LLMs. They proposed an architecture that combines pre-trained acoustic models (HuBERT and Whisper encoders) with LLMs to perform both ASR and ST without an intermediate step. The main innovation points involve the use of a Projection layer to align speech feature dimensions with the LLM’s embedding space and exploring length adapters to handle sequence length discrepancies. The value lies in simplifying the overall architecture and potentially improving performance and efficiency for real-world multilingual communication. Evaluations on MuST-C, IWSLT, and LibriSpeech datasets indicated that models using Gemma 2 9B as the decoder show superior ASR and ST performance, especially when paired with the Whisper encoder, surpassing the cascaded Whisper+NLLB system in most metrics. This suggests the potential of integrating LLMs with speech foundational models for end-to-end speech translation 41.
-
Yibo Yang from The Hong Kong University of Science and Technology and colleagues tackled the issue of interpretability in deep learning models, particularly in high-stakes domains like healthcare and finance. They introduced the Concept Language Model Network (CLMN), a neural-symbolic framework that enhances interpretability through continuous concept embeddings and fuzzy logic-based reasoning. The main innovation points include projecting concepts into an interpretable embedding space and introducing adaptive concept interaction modeling. The value lies in achieving a balance between interpretability and performance in NLP tasks, which is critical for trust and accountability in sensitive applications. Experiments on the augmented CEBaB dataset (aug-CEBaB-yelp) demonstrated that CLMN outperforms existing concept-based methods in accuracy and explanation quality when applied to various PLMs, showing significant improvements in concept prediction accuracy and macro F1 scores while maintaining competitive final prediction accuracy. Ablation studies further confirmed the effectiveness of integrating concept loss and neural symbolic components 42.
-
Jianjin Wang and colleagues focused on enhancing the quality of speech-to-speech translation by addressing the issue of sparse semantic representations in speech tokens. They proposed MTP-S2UT, a model that incorporates multi-token prediction (MTP) loss into intermediate layers where Connectionist Temporal Classification (CTC) loss is computed. The main innovation points are the application of MTP loss at earlier layers to strengthen hidden representations and reduce prediction uncertainty. The value lies in improving the applicability and accuracy of speech-to-speech translation systems in practical scenarios such as international conferences and cross-border communication. Experiments on the CVSS-C benchmark dataset for French to English and Spanish to English speech translation tasks revealed that MTP-S2UT significantly improves ASR-BLEU scores and reduces speech token prediction uncertainty compared to baseline and other MTP variants, concluding that integrating MTP loss into S2UT models can effectively enhance speech translation performance 43.
Technical Trends
The papers in this collection reflect evolving trends in speech and audio processing, particularly in the integration of large language models with specialized speech processing techniques. There is a clear emphasis on enhancing model performance through innovative architectural designs and loss functions, such as the use of projection layers, length adapters, and multi-token prediction. Additionally, there is a growing focus on interpretability and the fusion of neural and symbolic reasoning to make model decisions more transparent and understandable, which is vital for trust in high-stakes applications.
Datasets and Evaluation
-
LISTEN Benchmark: Used for measuring LALMs’ reliance on lexical versus acoustic cues in emotion understanding. No specific datasets mentioned for this benchmark.
-
MuST-C, IWSLT, and LibriSpeech Datasets: Employed in the evaluation of end-to-end speech translation systems, assessing both ASR and ST performance using Word Error Rate (WER), BLEU, and COMET-family metrics.
-
Augmented CEBaB Dataset (aug-CEBaB-yelp): Enriched with human-annotated and ChatGPT-generated concepts, used for training and evaluating sentiment classification models with CLMN.
-
CVSS-C Benchmark Dataset: Utilized for evaluating speech-to-unit translation models in French to English and Spanish to English tasks, focusing on ASR-BLEU scores and entropy as key metrics.
These datasets and evaluation metrics provide a robust foundation for assessing the advancements and improvements proposed by the respective models, contributing to the broader field of speech and audio processing.
Topic 8: Evaluation and Auditing of Models
Topic Overview
The evaluation and auditing of models are crucial aspects of artificial intelligence research, ensuring that models perform effectively and ethically across various domains. With the increasing complexity and widespread adoption of AI models, particularly large language models (LLMs), there is a growing need for benchmarks and frameworks that can comprehensively assess these models’ capabilities, limitations, and impacts. This includes not only their performance on practical tasks but also their ability to contribute to scientific research and handle real-world data efficiently. Moreover, understanding how these models integrate and update knowledge dynamically, and how they align with user expectations and evolving societal needs, is vital for their continued development and deployment.
Individual Paper Contributions
-
Qiran Zou from National University of Singapore and colleagues studied the evaluation of automatic machine learning (ML) research agents on fundamental ML problems, proposing FML-bench to solve the issue of assessing these agents beyond mere application-focused tasks. The main innovation points of this method are its five-dimensional evaluation protocol covering utility, diversity, academic contribution rate, cost, and step success rate, as well as the formalization of a unified optimization framework for iterative refinement. The value lies in providing a more holistic assessment of research agents’ capabilities, emphasizing their potential to contribute to scientific discovery and innovation. Experiments on various tasks within the FML-bench showed that broad exploration strategies, exemplified by TheAIScientist, led to more effective solutions compared to specialized or general-purpose agents, though the latter were more efficient in terms of resource consumption44.
-
Shaobo Wang from EPIC Lab, SJTU, and colleagues addressed the redundancy and inefficiency in LLM benchmark datasets, proposing EssenceBench to solve this problem. EssenceBench employs a Genetic Algorithm (GA) for iterative identification and elimination of redundant samples based on text-level and ranking-level redundancies. The main innovation points are the coarse-to-fine framework and the combination of fitness-based subset search with attribution-based sample search. The value lies in making LLM evaluations more scalable and efficient, crucial for rapid development cycles and resource-limited environments. Comparative results across datasets like HellaSwag, GSM8K, ARC, WinoGrande, and MMLU demonstrated that EssenceBench achieved superior compression results with lower reconstruction errors and higher efficiency compared to existing methods, indicating its effectiveness in reducing dataset sizes while maintaining evaluation fidelity45.
-
Federica Bologna from Cornell University and colleagues tackled the challenge of reliably evaluating long-form clinical QA systems under resource constraints, introducing LongQAEval. The framework addresses the difficulty in achieving consistent human judgments and the high cost of expert annotators by proposing a dataset of 300 question-answer pairs annotated by six medical experts. The main innovation points include a systematic comparison of coarse-grained and fine-grained annotation designs and the examination of LLMs’ roles in the evaluation process. The value lies in its tailored approach to clinical QA evaluation, enhancing the reliability and fairness of system-level ratings. Experiments on the K-QA dataset revealed that fine-grained annotations improved inter-annotator agreement for correctness and mitigated biases related to response length, while LLMs like GPT-4 and Llama-3.1-Instruct-405B performed comparably to physicians on primary care questions but fell short on safety assessments46.
-
Jingyi Wu from Zhejiang University and colleagues explored the mismatch between content creator themes and audience preferences on digital platforms, focusing on TED Talks. They proposed the use of Latent Dirichlet Allocation (LDA) modeling and introduced the ‘difference index’ to quantify the gap between speaker and audience preferences. The main innovation point is the emphasis on temporal dynamics in shaping consumer engagement, challenging traditional views that prioritize content features. The value lies in providing a nuanced understanding of how timing and thematic content affect audience engagement, which is essential for optimizing content strategies in competitive digital environments. Analysis of 4,475 TED Talks from 2006 to 2022 showed that temporal factors have a more significant impact on audience engagement than thematic content, suggesting that content creators must consider timing alongside content quality to maximize engagement47.
-
Geunyeong Jeong from Konkuk University and colleagues focused on the static nature of knowledge in LLMs, proposing the Steam framework to enhance the semantic-level integration of edited knowledge. Steam introduces Latent Positioning and Latent-Level Alignment components to improve the coherence and reliability of knowledge editing processes. The main innovation point is the use of semantic anchors to guide the internal representation of edited facts. The value lies in addressing the limitations of token-level likelihood optimization, promoting a more coherent knowledge editing process. Experiments on the CounterFactPlus dataset demonstrated that Steam significantly enhanced the Portability metric, showing notable improvements in multi-hop reasoning tasks and overall semantic consistency across different models and batch sizes, without compromising local factual accuracy or generalization48.
Technical Trends
The papers collectively highlight a trend towards developing more sophisticated evaluation protocols and frameworks for assessing AI models, particularly in niche areas such as fundamental ML research, clinical QA, and knowledge editing in LLMs. There is an evident move away from simplistic, application-oriented benchmarks towards more complex, scientifically rigorous evaluation methods. Additionally, there is a growing interest in leveraging advanced algorithms like Genetic Algorithms and Latent Dirichlet Allocation (LDA) to refine and understand model behavior and performance.
Datasets and Evaluation Metrics
-
FML-bench: Uses real-world codebases to evaluate automatic ML research agents on fundamental ML problems such as generalization, data efficiency, representation learning, continual learning, causality, robustness, privacy, and fairness. Evaluation metrics include utility, diversity, academic contribution rate, cost, and step success rate.
-
EssenceBench: Employs HellaSwag, GSM8K, ARC, WinoGrande, and MMLU datasets. Evaluation metrics focus on compression ratio, reconstruction error, and ranking preservation.
-
LongQAEval: Utilizes the K-QA dataset with 300 question-answer pairs. Evaluation metrics include inter-annotator agreement (IAA), response length bias, and system-level ratings across correctness, relevance, and safety.
-
Steam: Uses the CounterFactPlus dataset for evaluating the effectiveness of knowledge editing. Metrics include Edit score, which is the harmonic mean of Efficacy, Paraphrase, Neighborhood, and Portability scores, with a particular focus on Portability.
These papers collectively emphasize the necessity for adaptable and comprehensive evaluation methodologies to ensure that AI models meet the evolving demands of their intended applications and environments.
Topic 9: Neural Network Architectures and Techniques
Topic Overview
Neural network architectures and techniques are at the forefront of advancements in artificial intelligence, particularly in natural language processing (NLP). As these models grow in complexity and size, there is a pressing need to explore innovative ways to enhance their functionality, efficiency, and ethical considerations. This collection of papers delves into various aspects of neural network architectures, including scaling context lengths for diffusion models, improving reproducibility and performance in sequence labeling, detecting sarcasm using deep learning, mitigating biases in large language models, creating lightweight baselines for medical abstract classification, designing efficient multilingual neural machine translation systems, and enhancing reasoning capabilities in large language models.
Individual Paper Contributions
-
Guangxin He from HKUST and colleagues studied the challenge of extending the context window of diffusion large language models (LLMs) beyond their original training limits, proposing UltraLLaDA to solve this problem. The main innovation points of this method are the adaptation of the Neural Tangent Kernel (NTK) method for diffusion LLMs, and the exploration of adaptive masking and end-of-document concatenation strategies for post-training. The value lies in broadening the applicability of diffusion LLMs to handle longer texts efficiently, maintaining coherence and complex dependency management. Experiments on benchmarks such as PPL-128K, NIAH-128K, LongBench-16K, and RULER-32K showed that UltraLLaDA outperformed both training-free baselines and the original LLaDA base model, achieving superior performance in long-range information recall and consistency maintenance49.
-
Anirudh Ganesh from The Ohio State University and colleagues focused on improving the reproducibility and understanding of the BiLSTM-CNN-CRF architecture for sequence labeling tasks, specifically named entity recognition (NER) and part-of-speech (POS) tagging. Their contribution is a thorough reproducibility study and open-source PyTorch implementation of the architecture. The value lies in verifying the effectiveness of the existing architecture independently, providing a detailed breakdown of its components, and ensuring future researchers can reliably build upon this work. The ablation study revealed that the combination of character-level CNN, BiLSTM, and CRF layers is crucial for high performance, achieving 91.18% F1-score on the CoNLL-2003 NER dataset and 97.52% accuracy on Penn Treebank WSJ POS tagging, closely matching original results50.
-
Yawen Yang and colleagues addressed the issue of discontinuous named entity recognition in biomedical texts, proposing the Gap-aware grid tagging model (GapDNER). The key innovation is treating context gaps as special spans and utilizing biaffine mechanisms, linear attention, and criss-cross attention to capture intra- and inter-span features. The value lies in enhancing the precision and recall of entity recognition in complex biomedical structures, which is essential for downstream tasks like entity linking and relation extraction. Experiments on datasets like CADEC, ShARe13, and ShARe14 showed that GapDNER outperforms other grid tagging models, seq2seq-based methods, and baselines, with an improvement of up to 1.86% F1 score on ShARe14, highlighting its effectiveness in recognizing discontinuous entities51.
-
Manas Zambre and colleagues tackled the problem of accurately detecting sarcasm in textual communications, particularly in social media, proposing a modular deep learning framework that incorporates DCNNs and contextual models like BERT. The framework includes specialized modules for sentiment analysis, contextual embedding, linguistic feature extraction, and emotion detection, allowing for flexible and extensible architecture. The value lies in its ability to detect sarcasm effectively even when textual cues are ambiguous, with the addition of visual cues further boosting accuracy. Evaluations on a multimodal dataset of text-image tweet pairs demonstrated that BERT alone achieved 88.6% accuracy, while the combined multimodal model reached 93.2%, showcasing the benefits of multimodal learning52.
-
Tingxu Han from 1 and colleagues investigated the systematic bias in LLMs generated through different prompting strategies, introducing DiffHeads to analyze and mitigate this bias. The method involves assigning importance scores to attention heads to identify those responsible for biased outputs and selectively masking them. The value lies in providing a deeper understanding of the internal mechanisms that cause bias and offering a novel framework to address fairness issues within LLMs without compromising overall performance. Experiments on eight representative LLMs showed an average 44.85% improvement in fairness for models like Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct, with minimal impact on performance metrics53.
-
Jiaqi Liu and colleagues developed lightweight baselines for the classification of medical abstracts, focusing on DistilBERT with cross-entropy loss. The innovation lies in a systematic fine-tuning approach with controlled computational budgets, evaluating the impact of different loss functions on model performance. The value is in achieving competitive performance with reduced computational costs, making the model suitable for healthcare settings with limited resources. Experiments concluded that DistilBERT with standard cross-entropy loss yields the best balance of accuracy (64.61%) and macro-F1 score (64.38%), outperforming BERT-base and alternative loss functions54.
-
Hong Su from Chengdu University of Information Technology and colleagues explored the enhancement of LLMs’ reasoning capabilities, proposing a Layered Intuition–Method Model with Scope Extension. The model combines intuition-based reasoning with method-based reasoning, introducing scope extension strategies across multiple dimensions. The value lies in the potential to improve LLMs’ adaptability to indirect or unseen issues, which is crucial for robust and versatile applications. The entropy of method extension was introduced as a metric to measure adaptability, revealing that systems with higher entropy perform better in handling diverse and unseen problems55.
-
Mukul Lokhande and colleagues presented Bhasha-Rupantarika, an algorithm-hardware co-design approach for multilingual neural machine translation, aiming to create a lightweight system suitable for deployment in resource-limited settings. The innovation is in the distillation of the NLLB-200 model at ultra-low precision levels (INT4 and FP4), integrated with FPGA accelerators for rapid inference. The value is in achieving significant reductions in model size and latency while maintaining translation quality, making it deployable on edge devices and IoT. Experiments demonstrated a 4.1 times reduction in model size, a 4.2 times speedup in inference speed, and a 2.2 times increase in throughput compared to OPU, indicating its suitability for low-resource environments56.
Technical Trends
The papers in this collection highlight several evolving trends in neural network architectures and techniques. These include the use of specialized positional embeddings to scale context lengths, modular deep learning frameworks for tackling specific challenges like sarcasm detection, the application of hardware co-design principles to optimize computational efficiency, and the development of systematic methods to address bias and improve reasoning capabilities. There is a noticeable shift towards designing models that are not only powerful but also lightweight, efficient, and ethically sound, reflecting the broader goals of making AI technologies more accessible and reliable.
Datasets and Evaluation Metrics
- UltraLLaDA: Benchmarks include PPL-128K, NIAH-128K, LongBench-16K, and RULER-32K.
- BiLSTM-CNN-CRF Reproducibility Study: Datasets used are CoNLL-2003 for NER and Penn Treebank WSJ for POS tagging.
- GapDNER: Datasets include CADEC, ShARe13, and ShARe14.
- Sarcasm Detection Using DCNNs: Evaluated on a multimodal dataset of text-image tweet pairs.
- DiffHeads: Tested across eight representative LLMs, with metrics like Code-BLEU, accuracy for mathematical reasoning, and knowledge comprehension tasks.
- Lightweight Baselines for Medical Abstract Classification: Utilizes the medical_abstracts dataset, with evaluation metrics including accuracy and macro-F1 score.
- Bhasha-Rupantarika: Evaluations conducted on 1000 queries per language, assessing memory footprint, latency, and throughput.
- Layered Intuition–Method Model: No specific datasets were mentioned; the paper focuses on theoretical contributions and the entropy of method extension as a novel metric.
Topic 10: Healthcare and Medical Applications
Topic Overview
The field of healthcare and medical applications is rapidly evolving, driven by advancements in artificial intelligence (AI) and natural language processing (NLP). These technologies are increasingly being utilized to enhance the efficiency, accuracy, and accessibility of medical services. From improving coreference resolution in clinical narratives to developing multi-agent systems for medical consultations and creating inclusive communication tools, the research efforts aim to address critical challenges within the medical domain. The importance of these studies lies in their potential to improve patient care, streamline medical workflows, and reduce barriers to accessing health information for underserved populations.
Individual Paper Contributions
-
Kangyang Luo from Tsinghua University and colleagues studied the limitations of current coreference resolution (CR) methods in balancing performance and computational efficiency. They proposed ImCoref-CeS, an improved lightweight pipeline that integrates a supervised model with LLM-based reasoning to refine coreference clusters. The main innovation points include a lightweight bridging module (LBM) for long-text encoding, a biaffine scorer for capturing positional information, and a hybrid mention regularization (HyMR) strategy for training efficiency. The value lies in the framework’s ability to achieve higher accuracy while maintaining a balance with computational efficiency, making it suitable for a range of applications including text summarization and knowledge graph construction. Experiments on datasets like OntoNotes and WikiCoref showed an Avg. F1 score of 86.0% and 73.2%, respectively, outperforming baselines like Maverick${}_{\text{mes}}$. The paper concludes that integrating LLMs as reasoning-augmented components significantly improves CR, especially in cross-domain generalization, but at the cost of increased inference time 57.
-
Lei Gu from Peking University and colleagues focused on diagnosing and quantifying collaborative failure modes in medical multi-agent systems used for consultations. They introduced MedAgentAudit, a structured taxonomy and empirical study encompassing 3,600 interaction logs across six medical datasets and six multi-agent frameworks. The main innovation is the development of a mixed-methods approach that combines qualitative and quantitative analysis, along with a quantitative auditing framework to assess information loss, viewpoint shift attribution, collaboration quality, and conflict resolution. The value lies in providing transparency and reliability to the decision-making processes within these systems, which is essential for gaining clinical and public trust. While the paper does not explicitly compare against baselines or provide specific metrics, it identifies several failure patterns and underscores the importance of transparent reasoning pathways in medical AI 58.
-
Prawaal Sharma from Infosys and colleagues addressed the digital communication barrier faced by semi-literate individuals. They proposed NIM, a neuro-symbolic ideographic metalanguage designed to facilitate inclusive communication by decomposing complex ideas into simpler, atomic concepts. The innovation points include the use of BERT embeddings and BIRCH clustering for ontology creation, and the employment of LLMs for handling out-of-vocabulary (OOV) concepts. The value lies in promoting socio-digital inclusion by enabling underprivileged populations with limited formal education to engage more effectively in digital communication. Empirical evaluation demonstrated over 80% semantic comprehensibility and an 8.5 times improvement in learnability compared to benchmarking studies, indicating high user satisfaction and interest in the system 59.
-
Zilong Wang from Ningbo Institute of Digital Twin and colleagues tackled the challenge of efficient and accurate extraction of structured information from copy-heavy documents. They developed a hybrid OCR-LLM framework tailored for different document structures and modalities, employing four LLM-based paradigms—Direct, Replace, Table, and Multimodal—and integrating multiple OCR engines. The innovation points are the document-aware method selection strategy and the diverse empirical evaluation covering various document formats. The value lies in achieving high precision with sub-second response times, which is critical for enterprise-scale operations. Experiments revealed that table-based extraction methods delivered perfect F1 scores for structured documents with latencies under 1 second, while PaddleOCR achieved an F1 score of 0.997 for image-based documents with a significant speedup over multimodal methods. The conclusion highlights the necessity of format-aware extraction strategies to optimize both accuracy and efficiency in production pipelines 60.
Technical Trends
The papers collectively demonstrate a trend towards integrating large language models (LLMs) with traditional machine learning techniques to enhance performance in healthcare and medical applications. There is a clear emphasis on improving the robustness and reliability of AI systems through innovative architectural designs, such as the lightweight bridging module in ImCoref-CeS and the hybrid OCR-LLM framework in the document information extraction study. Additionally, there is a growing focus on ensuring transparency and interpretability in AI-driven medical consultation systems, as seen in the MedAgentAudit study. Lastly, the use of neuro-symbolic AI principles to bridge the digital divide among semi-literate populations showcases an emerging approach to inclusivity in technology.
Datasets and Evaluation
- ImCoref-CeS evaluated on OntoNotes and WikiCoref, using the Avg. F1 score as a primary metric.
- MedAgentAudit used a large-scale dataset comprising 3,600 interaction logs across six diverse medical datasets and six multi-agent frameworks, focusing on qualitative and quantitative auditing metrics rather than accuracy scores.
- NIM was empirically tested with semantic comprehensibility and learnability metrics, showing significant improvements over benchmarking studies without explicit baseline comparisons.
- Hybrid OCR-LLM Framework assessed across various document formats including PNG, DOCX, XLSX, and PDF, with F1 scores and latency serving as key evaluation metrics.
These studies collectively emphasize the importance of selecting appropriate evaluation metrics that reflect not only the accuracy of AI systems but also their efficiency, reliability, and usability in real-world medical applications.
Topic 11: misc
Topic Overview
The research topic of this collection of papers revolves around the development and enhancement of large language models (LLMs) through various innovative techniques and frameworks. These papers address different facets of LLMs, including their efficiency, adaptability, and ethical considerations. The overarching goal is to improve the practical utility of LLMs in real-world applications by tackling issues such as data scarcity, computational costs, and the need for diverse and context-sensitive responses. This is particularly relevant in domains such as healthcare, education, and finance, where the reliability and effectiveness of AI systems are paramount.
Individual Paper Contributions
-
Zhichao Wang from Inflection AI and colleagues studied inference-time scaling strategies for large language models (LLMs), proposing a systematic review that categorizes these strategies into output-focused and input-focused methods. The main innovation points are the detailed analysis and organization of techniques like CoT, ToT, ReAct, MCTS, and RAG, which enhance model performance without additional training. The value lies in providing a comprehensive framework for understanding and applying these techniques, making LLMs more adaptable and efficient. No specific datasets or baseline comparisons were provided within the given content 61.
-
Zhichao Xu from University of Utah and colleagues addressed the inefficiency in retrieval-augmented generation (RAG) systems, particularly when trained with reinforcement learning (RL), proposing RECON (REasoning with CONdensation). The main innovation points are the integration of an explicit summarization module to compress retrieved evidence, which is trained in two stages: relevance pretraining on MS MARCO and distillation from GPT-4o-mini. The value lies in demonstrating the effectiveness of summarization in reducing token consumption and improving training speed and inference latency. Experiments on seven public QA benchmarks showed an average EM score improvement of 14.5% for the 3B model and 3.0% for the 7B model, with a 5.2% improvement in training speed and a 30.9% reduction in inference latency 62.
-
Ananya Malik from Northeastern University and colleagues investigated the influence of multi-demographic personas on LLMs’ empathy, proposing a framework to measure affective and cognitive empathy. The main innovation points are the use of the ISEAR dataset to simulate conversations and the DegreeD metric to measure thematic transitions and behavioral patterns. The value lies in providing a methodological advancement for understanding LLM empathy across varied demographics, highlighting the need for improved empathy-awareness in AI systems. Experiments revealed notable variation in empathy across different demographic attributes, with Confucian culture yielding lower emotion intensity scores and gender-queer attributes expressing higher anger 63.
-
Ruize An from Beihang University and colleagues proposed Text2Token, an unsupervised text representation learning framework that uses token target prediction instead of contrastive learning. The main innovation points are the two-stage training strategy using data-driven and model-derived targets, which avoids the need for annotated or generated datasets. The value lies in achieving competitive performance on the MTEB v2 benchmark, especially in clustering, reranking, and retrieval tasks. Text2Token outperformed LLM2Vec-unsup and Llama2Vec (EBAE) in these tasks, demonstrating robustness and alignment during fine-tuning 64.
-
Yijie Xu from The Hong Kong University of Science and Technology (Guangzhou) and colleagues introduced Synergistic Test-time Adaptation (SyTTA), a framework that adapts LLMs to new domains using 4-16 extra tokens per query. The main innovation points are the coupling of input perplexity reduction with output entropy minimization and the use of a Dynamic Importance Weighting rule. The value lies in reducing the dependency on costly domain-specific data and enhancing LLMs’ adaptability to specialized knowledge. Experiments on DomainBench and InstructBench showed SyTTA outperforming baselines such as Tent, EATA, and TLM, with over 120% gain in ROUGE-Lsum on the Agriculture dataset for the Qwen 2.5-7B model 65.
-
Liang Pang from Chinese Academy of Sciences and colleagues provided a survey on LLM sourcing, introducing a unified framework that integrates model and data perspectives. The main innovation points are the categorization of methods into prior-based and posterior-based paradigms and the comprehensive coverage of four dimensions: Model, Model Structure, Training Data, and External Data. The value lies in offering a cohesive perspective on model provenance, which is crucial for accountability, traceability, and risk mitigation. The paper reviewed various datasets like HC3, TweepFake, CHEAT, and CodeContests, noting that no single method outperforms all baselines across every dataset 66.
-
Parthiv Chatterjee from Dhirubhai Ambani University and colleagues proposed PerAugy, a data augmentation method to enhance the diversity of user preference trajectories for personalized text summarization. The main innovation points are the Double Shuffling (DS) and Stochastic Markovian Perturbation (SMP) methods, which create diverse synthetic user interaction graphs. The value lies in improving the performance of user-encoder models and downstream summarization tasks, especially in clustering, reranking, and retrieval tasks. PerAugy achieved average gains of 24%, 25%, and 18% across AUC, MRR, and nDCG@5 metrics 67.
-
Hengyuan Zhang from The University of Hong Kong and colleagues introduced PerSyn, a personalized data synthesis strategy that shifts from ‘Generate then Select’ to ‘Route then Generate’. The main innovation points are the use of a router-guided mechanism based on the Bradley-Terry model to assign prompts to the best-suited teacher model. The value lies in reducing computational costs and improving the efficiency of knowledge distillation from LLMs to smaller student models. PerSyn outperformed traditional baselines such as Strong, Mix, and CAR in instruction tuning and math reasoning, with average improvements of 8.7% on IFEval and 7.5% on MATH 68.
-
Michael Y. Hu from New York University and colleagues developed ECHO (Experience Consolidation via Hindsight Optimization), a framework that enhances the sample efficiency of LM agents by generating and learning from counterfactual trajectories. The main innovation points are the adaptation of hindsight experience replay (HER) from RL to LM agents, enabling them to convert failures into synthetic successes. The value lies in improving the performance of LM agents in environments with sparse rewards and high interaction costs. ECHO outperformed the baseline ReAct LM agent by 80% in average reward on XMiniGrid-Stateful and improved interaction efficiency by reducing the average number of messages needed to resolve queries on PeopleJoinQA-Stateful 69.
-
Jinghao Zhang from University of Science and Technology of China and colleagues proposed RLFR, extending RLVR by incorporating flow rewards derived from the latent space of LLMs. The main innovation points are the use of velocity deviations of policy latents in a flow field and the introduction of a new reward shaping method. The value lies in improving the exploration capabilities of LLMs and mitigating the distribution gap during policy optimization. RLFR outperformed RLVR and other baselines like entropy-based advantage shaping on various language and multimodal reasoning benchmarks 70.
-
Peng Fan from Chengdu University of Technology and colleagues introduced Time Independence Loss (TIL) and Aligned Cross Entropy (AXE) Loss for end-to-end speech recognition with similar length speech and text. The main innovation points are the replacement of conventional CTC loss and the introduction of frame fusion techniques to preserve more information during downsampling. The value lies in improving the efficiency and performance of ASR models while drastically reducing computational complexity. Experiments on AISHELL-1 and AISHELL-2 showed that E4, which uses KFDS with AXE loss, outperformed the baseline B0 on AISHELL-1 with a slight improvement in CER and discarded 87% of frames 71.
-
Kai Zhang from and colleagues proposed AssoMem, a scalable memory QA framework that constructs an associative memory graph to anchor dialogue utterances to automatically extracted clues. The main innovation points are the integration of multi-dimensional retrieval signals and the adaptive mutual information (MI)-driven fusion strategy. The value lies in improving context-aware memory recall for AI assistants, enhancing their reliability and user-friendliness. AssoMem outperformed state-of-the-art baselines on LongMemEval_m, LongMemEval_l, and MeetingQA datasets, achieving higher retrieval accuracy and nDCG@10 scores 72.
-
Hakyung Sung from University of Oregon and colleagues introduced the ASC analyzer, a Python package for measuring argument structure construction usage in English texts. The main innovation points are the automatic tagging of ASCs and the computation of 50 indices to capture various aspects of ASC usage. The value lies in providing a scalable and systematic tool for assessing linguistic complexity and proficiency in L2 research. The ASC-based model explained a significant portion of variance in writing scores ($R^{2}{\text{adj}} = 0.143$) compared to a syntactic complexity-based model ($R^{2}{\text{adj}} = 0.077$) 73.
-
Luyao Zhuang from Hong Kong Polytechnic University and colleagues proposed LinearRAG, an efficient framework for Graph Retrieval-Augmented Generation (GraphRAG) that constructs a hierarchical graph called Tri-Graph. The main innovation points are the focus on lightweight entity extraction and semantic linking instead of costly relation extraction. The value lies in achieving higher retrieval and generation accuracy while maintaining computational efficiency. LinearRAG showed the highest GPT-based accuracy on the 2Wiki dataset (63.70%) and the highest Contain-based accuracy on the same dataset (70.20%), with minimal indexing and retrieval times 74.
-
Wenqing Wang from Huazhong Agricultural University and colleagues proposed Adaptive Intent-driven Preference Optimization (A-IPO), extending the standard DPO model with an explicit ‘intention module’. The main innovation points are the incorporation of latent user intent and fact-checking of input prompts. The value lies in improving the preference optimization process and handling diverse cultural and adversarial preferences. A-IPO outperformed baselines like DPO, GDPO, few-shot prompts, and supervised fine-tuning (SFT) on various metrics across different datasets 75.
-
Guan-Yan Yang from and colleagues introduced ArtPerception, a two-phase black-box jailbreak framework that uses ASCII art to bypass LLM security measures. The main innovation points are the systematic pre-test phase for optimizing ASCII art parameters and the one-shot attack phase. The value lies in providing a strategic approach to reconnaissance and execution, enhancing both efficiency and stealth. ArtPerception achieved high Not Refuse Rates (NRR) and Attack Success Rates (ASR) on datasets like AdvBench and Hex-PHI, outperforming other state-of-the-art jailbreak methods 76.
Technical Trends
The technical trends observed in these papers include:
- Inference-Time Scaling Strategies: Focus on enhancing model performance without additional training, covering reasoning, search, and RAG techniques.
- Efficient Data Processing and Retrieval: Techniques such as summarization and context condensation to reduce computational overhead and improve efficiency.
- Personalized and Context-Aware Models: Methods for personalizing data synthesis and handling diverse user preferences and intents.
- Enhanced Exploration in Reinforcement Learning: Frameworks that improve the exploration-exploitation balance in RL, particularly for LLMs.
- Scalable and Secure ASR Systems: Approaches to align speech and text lengths in ASR systems using novel loss functions and frame fusion techniques.
- Memory and Knowledge Retrieval: Development of associative memory graphs and hierarchical structures to improve memory recall and knowledge integration.
- Empirical Analysis of LLM Vulnerabilities: Exploration of ASCII art-based attacks and the development of robust security measures against such vulnerabilities.
Datasets and Evaluation Metrics
- QA Benchmarks: HotpotQA, Musique, Bamboogle, MS MARCO, OpenReview, PubMed, Yelp, AISHELL-1, AISHELL-2, LongMemEval_m, LongMemEval_l, MeetingQA, AdvBench, Hex-PHI, DomainBench, InstructBench.
- Evaluation Metrics: EM score, AUC, MRR, nDCG@5, ROUGE-Lsum, CER, FID, GPT-based accuracy, Contain-based accuracy, Not Refuse Rate (NRR), Attack Success Rate (ASR), Difficulty-Aware Coefficient Allocation, Initial-Anchored Target Entropy, DegreeD diversity metric, ascMATTR, ascAvFreq, R@10, nDCG@10, pass@1, pass@32, Earth Mover’s Distance, normalized discounted cumulative gain (nDCG@10).
These papers collectively contribute to advancing the field of LLMs by addressing specific challenges related to scalability, efficiency, personalization, and security, thereby paving the way for more robust and versatile AI systems.
References
-
Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data ↩︎ ↩︎
-
A Survey of Inductive Reasoning for Large Language Models ↩︎ ↩︎
-
Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference ↩︎ ↩︎
-
Revisiting Model Interpolation for Efficient Reasoning ↩︎ ↩︎
-
Judge Before Answer: Can MLLM Discern the False Premise in Question? ↩︎ ↩︎
-
Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models ↩︎ ↩︎
-
Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety ↩︎ ↩︎
-
On the Entity-Level Alignment in Crosslingual Consistency ↩︎ ↩︎
-
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data ↩︎ ↩︎
-
HUME: Measuring the Human-Model Performance Gap in Text Embedding Task ↩︎ ↩︎
-
Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling ↩︎ ↩︎
-
HiligayNER: A Baseline Named Entity Recognition Model for Hiligaynon ↩︎ ↩︎
-
VOLTAGE: A Versatile Contrastive Learning based OCR Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction ↩︎ ↩︎
-
LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System ↩︎
-
FactAppeal: Identifying Epistemic Factual Appeals in News Media ↩︎ ↩︎
-
Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures ↩︎ ↩︎
-
Detecting Hallucinations in Authentic LLM-Human Interactions ↩︎ ↩︎
-
ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios ↩︎
-
DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety ↩︎
-
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models ↩︎
-
The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems ↩︎
-
ADVICE: Answer-Dependent Verbalized Confidence Estimation ↩︎
-
Steering Over-refusals Towards Safety in Retrieval Augmented Generation ↩︎
-
RePro: Training Language Models to Faithfully Recycle the Web for Pretraining ↩︎
-
Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models ↩︎
-
DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems ↩︎
-
CardRewriter: Leveraging Knowledge Cards for Long-Tail Query Rewriting on Short-Video Platforms ↩︎
-
Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks ↩︎
-
AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation ↩︎
-
Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning ↩︎
-
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions ↩︎
-
Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization ↩︎
-
Merlin’s Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting ↩︎
-
Weed Out, Then Harvest: Dual Low-Rank Adaptation is an Effective Noisy Label Detector for Noise-Robust Learning ↩︎
-
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices ↩︎
-
Assessing Large Language Models for Structured Medical Order Extraction ↩︎
-
Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance ↩︎
-
End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs ↩︎
-
CLMN: Concept based Language Models via Neural Symbolic Reasoning ↩︎
-
MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction ↩︎
-
FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth ↩︎
-
Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data? ↩︎
-
LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints ↩︎
-
When or What? Understanding Consumer Engagement on Digital Platforms ↩︎
-
STEAM: A Semantic-Level Knowledge Editing Framework for Large Language Models ↩︎
-
UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models ↩︎
-
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF: A Reproducibility Study ↩︎
-
GapDNER: A Gap-Aware Grid Tagging Model for Discontinuous Named Entity Recognition ↩︎
-
Sarcasm Detection Using Deep Convolutional Neural Networks: A Modular Deep Learning Framework ↩︎
-
DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models ↩︎
-
Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default ↩︎
-
A Layered Intuition – Method Model with Scope Extension for LLM Reasoning ↩︎
-
Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation ↩︎
-
ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement ↩︎
-
MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems ↩︎
-
NIM: Neuro-symbolic Ideographic Metalanguage for Inclusive Communication ↩︎
-
Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task ↩︎
-
Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG ↩︎
-
RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation ↩︎
-
Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model’s Empathy ↩︎
-
Text2Token: Unsupervised Text Representation Learning with Token Target Prediction ↩︎
-
You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs ↩︎
-
Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers ↩︎
-
Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation ↩︎
-
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting ↩︎
-
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment ↩︎
-
End-to-end Speech Recognition with similar length speech and text ↩︎
-
AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval ↩︎
-
ASC analyzer: A Python package for measuring argument structure construction usage in English texts ↩︎
-
LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora ↩︎
-
ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test ↩︎